Concept Formation in Computational Creativity Download PDF

8 Conclusions

Throughout this investigation, the technology associated with the topics addressed in this thesis has been in a constant state of evolution. There are no clear signs that the improvements in this area will stop or slow down, making it very challenging to come up with conclusions that can stay relevant for an extended period of time. Nevertheless, in this final chapter I will attempt to summarize some theoretical considerations emerging from this research that should endure, while also highlighting the many limitations that characterized this doctoral journey. The hope is that these insights can foster further research and collaborations to better understand and anticipate the future challenges that creative practices may encounter in the face of rapidly advancing technology.

8.1 Research summary

When this research began in 2019, it was set out to explore how different computational approaches might affect the creative process, with the assumption that they implicitly embed diverging concept ontologies (TOCs). Compositionality had emerged early on from the literature review as one of the factors that are fundamental to concept formation. While traditional rule-based approaches to computation implement crisp analytical rules of composition which are pre-defined by humans, DL originally struggled in achieving compositionality due to the inherent locality of neuron states in a network. During the course of this doctoral journey, the innovation of self-attention and its ability to scale over massively large datasets proved to be an effective way to address compositionality using data-driven technologies.

In order to describe different computational approaches and their respective nuances, an extension to mediation theory (Ihde 1993) that addressed two different contextualizations of input (R[] and D[]) has been proposed. This distinction has been discussed in relation to existing TOCs as well as models of creativity (Boden 1996, 2003; Rhodes 1961; Hoorn 2014) and used in the studies to analyze specific forms of mediation afforded by DL technologies such as VAEs, Transformers, GANs and the most recent LDMs. The insights that emerged from each study highlight several themes that can be related back to the TOCs discussed in the literature review as well as provide further research directions for existing creativity theories.

As part of this doctoral thesis, three studies have been conducted. The first and second study embraced a practice-led research approach (Candy 2011), in the form of close collaborations with a musician and a photographer. These studies addressed existing DL tools that could be trained on datasets we created ourselves. In the second study, the act of putting together a dataset was used as a reflective practice (Schön 1983) to investigate how D[] contextualizations form concepts from a series of examples. The third study focused on the usage patterns of a Discord bot powered by Stable Diffusion (Rombach et al. 2021), with particular attention to the construct of Tolerance for Ambiguity (TA) (Norton 1975) and its relationship with user expectations about text-to-image technology. The data collected from 76 participants shows an exponential distribution of amount of images generated per user and a weak correlation between TA and expectations of image alignment. No significant correlation between TA and usage was found in the collected data. These results suggest the hypothesis that social factors, such as network effects, might be influencing usage, while personal traits (TA) only affect expectations about text-to-image technology capabilities.

8.2 Post-phenomenology of creativity

One of the main gaps exposed in the literature review was the lack of a non-dual way to represent human and non-human roles in creative endeavors. The adoption of the post-phenomenological lens allows for a description of the creative process that does not separate them, but instead puts them in a mediating relationship. Existing models of creativity taken under consideration either formalize the distinction, for example (Rhodes 1961), or are agnostic about the nature of the actor performing the step as proposed by (Hoorn 2014).

Adopting a post-phenomenological approach might provide an explanation to why certain aspects of a technology might be overlooked or not fully understood by its users, as discussed by (Eede 2010) with regard to technological opacity. Indeed the difference between alterity relations that use R[] or D[] might not be immediately obvious. For example, users might be inclined to think that GPT is capable of analytical reasoning and consider its output as bound to some kind of truth, while the reality is that GPT is most definitely not fact-checking its output in any way and will likely make up a probable answer if it has not seen any data related to a topic during training. Understanding this is roughly equivalent to being able to discern R[] from D[] technologies.

The post-phenomenological framework discussed in this thesis is able to explain some of the unexpected results that emerged from the third study. As already highlighted by (Flusser 1999, 2000) the Apparatus is to be considered part of the technological mediation. This perspective provides some insights into the unexpected finding that network effects might be a factor that affects tool use (see Section 6.5). This explanation also extends to the discussion about the increasingly important role datasets curation presented in Section 7.1, as the increasing popularity of model sharing sites like Civitai seems to confirm.

Adopting a perspective that allows to see language as a technology, the framework might also provide some insight into its ability to bridge between R[] and D[], thus mediating our experience of concepts. By enabling both analytic and synthetic contextualizations, language has the unique potential to bridge perception and cognition. This also highlights its key role in the development of multi-modal deep learning models that can effectively interface with various forms of human experience and understanding.

Finally, as highlighted throughout Chapter 7, possessing intimate knowledge of how technology affects creative work is crucial for future generations of artists and designers, as the attention of communities of practice is shifting from Product to Process (Rhodes 1961). The extension of mediation theory (Ihde 1990) proposed in this thesis should be useful to educators to promote a more conscious and effective use of both types of computational approaches. I believe it is also particularly important to teach how the two can integrate with one another. R[] and D[] have strength and weaknesses that are complementary and, as seen in Table 3.1, they might be more or less suited for a task, based on which ACASIA module they are embedded in. For the educators in this field, a teaching objective that this research suggests is to provide students with an understanding of technological opacity and background relations, so that they can master their creative expression through any technology.

8.3 Limitations

The studies address new technologies as they develop. This is not ideal, as discussed in Section 3.4, because the methods used to conduct a research must continuously change and adapt to new advancements in order to remain relevant. The methodology addressing this topic is still in its infancy. Constructs such as alignment between prompt and image are yet to be formalized and operationalized properly and by the time they will be, there might be new solutions which renders them obsolete. It is simply not possible to keep up to date with the latest release of every tool and conduct rigorous research at the same time. Fast research is not good research, but slow research becomes irrelevant fast in this field.

Regarding the first two studies, the practice-led approach has proven to be limiting in terms of potential generalization of results. While the experience gained through practice turned out to be valuable, N = 1 studies are very specific by nature and the results cannot be discussed in objective terms. In hindsight, the first study focused too much on the output rather than the process, which might be hindering its generalizability even further. Practice-led research conducted in this form is effectively limited to self-reflective exercises, which can only be valuable at a later stage.

Another limitation that was felt for all the studies was the lack of adequate parallel computing power and memory (i.e. GPUs) to run experiments. Unfortunately, the ability to run certain DL models depends hardware at hand, which, in turn, is contingent on budget constraints and accessibility. For example, in the first study, we could use a VAE capable of producing only two bars of melody because of the limited amount of memory of our GPU. Coincidentally, at the time there was also a global shortage of GPUs due to the COVID-19 pandemic. Again, in the second study, the resolution of the images we could generate was limited for the same reason. When we were finally able to buy a better GPU for the third study, as we put online the Discord bot, we committed our only available unit to this task, which prevented us from running other experiments without creating disruptions to users. Having only one GPU also imposed a limit to the amount of users we could host in a workshop.

A limiting factor for the third study was also the use of Discord as interface for generation. While on one hand the platform provides a relatively easy interface to operate the bot, on the other hand it does require some explanation and the user experience might not be very intuitive for new users. This introduces another layer of complexity which is unrelated to the study topic, potentially introducing a confounding variable. Unfortunately, there is no easy way around this if the goal is to expose a group of participants to these technologies. Most UIs for SD are designed for one user only and therefore do not provide a queue system which is necessary in the case of simultaneous requests.

Another limitation of this research is the lack of a comparative study between R[] and D[] technologies. A pilot workshop aiming to formalize a protocol for this kind of experimental research was conducted in early 2022, but the gathering restrictions imposed by the COVID-19 pandemic made it difficult to put it in place. Fortunately, it is very likely that this study will eventually take place within 2023.

8.4 Further research

This doctoral thesis opens many questions but leaves them only partially answered. A topic that demands further inquiry is the study of technologies that integrate R[] and D[]. While this can be formulated as a practical problem, it does indeed hold many philosophical implications. Some examples of this type of hybrid approach are already surfacing after the release of GPT4, such as AutoGPT1 which combines D[] elements of GPT with traditional R[] elements by giving it access to internet search, a memory system, a way to execute code and other scripted functionalities. It is possible that this type of integration will evolve into something more sophisticated in the future which could be able to encompass both analytic and synthetic realms.

An additional avenue for future research stemming from this doctoral study involves exploring community sharing and compositionality in user-trained models. As highlighted in Section 7.1, the active involvement of community members in developing personalized model extensions (such as custom concept embeddings, hypernetworks, LoRA, etc.) is transforming the way in which artists and creators present content to their audiences. If this trend persists, it is conceivable that these sharing platforms may evolve into a central hub for both content creators and consumers.

Furthermore, there are implications arising from the popularization of tools such as SD and GPT with regards the understanding of truth. For example, while in the early days of photography the artifacts produced by cameras could be considered as somehow bound to reality, as of today, it is almost impossible to discern a generated image from a real one. A similar problem is evident today in the education sector, as students make use of GPT for their assignments without necessarily questioning the origin of the information that is returned by the model. While to some extent these issues already existed before DL platforms became available, the problem is now becoming extremely obvious because these tools are so easy to use and millions of people are adopting them. How will these tools change our understanding of truth?

Another direction for further research that naturally follows from the journey presented in this thesis is to investigate multi-modality in DL. As discussed in Section 7.2, language could become the unifying interface which allows us to generate all types of media using just a single tool. For example, a future version of the popular streaming site Netflix could offer the possibility to describe a setting, upload a few photos of ourselves or our friends and generate a whole TV series where we are the protagonists. While this may sound like wild speculation, it is not so hard to believe that this will be possible one day. Crude yet functional text-to-video is already a reality (Esser et al. 2023) and so are language models for audio (Borsos et al. 2022) and music (Agostinelli et al. 2023) generation.

Finally, there is one last important question to be asked, which is what, if anything at all, can slow down or stop this trend. On March 22nd, 2023, an open letter was published on the internet asking to Pause Giant AI Experiments2 which warns:

AI systems with human-competitive intelligence can pose profound risks to society and humanity, as shown by extensive research and acknowledged by top AI labs. As stated in the widely-endorsed Asilomar AI Principles, Advanced AI could represent a profound change in the history of life on Earth, and should be planned for and managed with commensurate care and resources. Unfortunately, this level of planning and management is not happening, even though recent months have seen AI labs locked in an out-of-control race to develop and deploy ever more powerful digital minds that no one – not even their creators – can understand, predict, or reliably control.

These are legitimate concerns that the creative industry can fully relate to. It seems we are witnessing the beginning of a new era, where, once again, the impact of technology on society is about to redefine what it means to be creative.

In conclusion, I would like to share a personal reflection that may leave readers on an optimistic note. I experienced an enlightening moment while observing a stunning sunset on a tropical beach. Many individuals around me were viewing this breathtaking scene through their mobile phone screens, attempting to preserve the moment in digital form. However, given my experience working on this thesis, I felt no such compulsion, confident in my ability to generate a virtually limitless number of sunset images, even featuring myself. Thus, I was content to merely immerse myself in the experience, allowing the beauty of the moment to spark the creative inspiration.