Concept Formation in Computational Creativity giovannilion.link ↗ Download PDF

7 Discussion

The studies presented offer evidence as to why we should adopt a post-phenomenological lens when examining human and non-human creativity in the context of art and design. The post-phenomenological perspective provides a way of looking at computational creativity that does not separate humans and society from the technology that shapes them, allowing us to form a more comprehensive and trans-disciplinary understanding of the phenomena we observe in this context. This is particularly important in today’s world where the boundaries between the human and the non-human are becoming increasingly blurred. In particular, it was highlighted how the trending data-driven approach might bring about new opportunities and challenges for creative practices as it affects the way output is consumed and evaluated by the audience. This chapter further discusses the key insights that emerge from the studies in relation to TOCs and creativity theories, with the overarching goal of providing an outlook over what the future of computational creativity might evolve into, if this trend continues.

7.1 Datasets curation

Datasets are the central aspect of D[] contextualizations. Computational systems that learn, must inevitably learn from something. In small-scale scenarios, such as those described in Chapters 4 and 5, the need for large amounts of training data in most implementations makes it extremely difficult for practitioners to effectively use data-driven tools without relying on external data sources. While on one hand we can blame the inherent limitations of DL for this, on the other we must take into account that creative practices do not exist in a vacuum. Artists and designers often have a degree of visibility into other methods and products, previous and contemporary, which might consciously or unconsciously affect them. Boden’s (2003) account of combinational and exploratory creativity aptly describe this process of repurposing and representing old elements in a new way. Bruno Latour also points out the inherent remedial nature of design: . What artists and designers do in their practice is perhaps better understood not as creation, but as selection of the old, abstraction and combination, to put in ACASIA terms. The emphasis here is on the selection of something as input which is then decomposed and reconstructed anew.

Putting together a large-scale dataset can be a daunting task, let alone having the resources to train a model at the scale we are witnessing in the recent years. It is remarkable that, as the scale increases, the prediction accuracy does not plateau, making tools like GPT, Dall-E and SD possible, but this race for scale in the long run might also have detrimental effects in terms of market concentration and accessibility. The impact of this is already evident from the literature review of CC, showing a decreasing amount of novel systems being developed. First hand experience also confirms this difficulty, as presented in the first two studies.

The response of the open-source community to this problem has perhaps shown a possible solution to this issue, both in practical and theoretical terms. For example, textual-inversion (Gal et al. 2022) provides a way to add token-embeddings and hyper-networks based on as little as 3-4 images. These custom-trained concepts can then be combined with other tokens to generate virtually infinite visual . Dreambooth (Ruiz et al. 2022) provides a way to also fine-tune SD with custom visual content, yet this results in an entirely new model, which is inconvenient for sharing and makes merging with other models difficult. Low-rank adaptation of LLMs, or LoRA (Hu et al. 2021), was applied to SD and released in February 2023. LoRA has solved the issue of having to re-release the whole model when fine-tuning, while also providing a way for users to extend SD concept library with reduced memory requirements allowing training on consumer hardware. Alongside fine-tuned models, LoRAs are then easily published as relatively small files on online sharing platforms like Civitai (https://civitai.com).

LoRAs can also be combined with one another, thus enabling interoperability across user-defined visual concepts. This opens up the possibility for community powered compositionality of user-trained models. In a sense, such configuration is not unlike what (Jackendoff 1989) and his neo-classical theory would predict: a set of core concepts (SD base model) and a series of semantic-fields (user trained models) which adjust and complete the base model incorporating examples that are more personal and specific (see Section [sec:the-regress-problem-for-semantic-fields]).

In this context, I bring forward a consideration about what the shift of attention towards D[] might bring for artists and designers. It may very well be that, in the future, practitioners will choose to release their particular graphic style, interior design, or product aesthetic as a generative model, rather than as a set of fixed outputs such as images, soundtracks, text or 3D models. Much like the generative art movement in the 1960s was not about the content, this wave of user-generated text-to-image models might take the same path and shift the attention towards the Process perspective and away from a Product-centric view, because content is becoming way too abundant to be of any value.

In a not so distant future, content producers of today might be required to evolve their skills to better understand the foundations of DL and, more specifically, learn how to build their own datasets and train their own models. Extending large pre-trained models with unique subjects or styles might be one of the ways individual artists can provide creative value to their audiences. Similarly, fine-tuning weights with further examples to integrate models into the flow of applications and interfaces might become the order of the day for creative practitioners and designers. Customization of the generative capabilities of these tools requires both knowledge and skill. The selection of examples representing a coherent visual concept is perhaps an art that can be learned and mastered, a process that can reveal our biases and challenge our imagination. The combination-making capabilities of language might then expand the reach of our creations and let us explore conceptual spaces through computable compositionality.

By understanding the use of computational tools in the creative process as an act of engaging in human-technology-world relations, we can further refine and give meaning¹ to the interactions we have with any tool or instrument. In particular, as suggested by the differentiation between R[] and D[] considered in this thesis, the contextualization acting in the background of a specific interaction is an important aspect to consider when engaging with computational tools for creative tasks. Furthermore, the fitness of the two contextualizations varies based on the ACASIA component that they are applied to.

As highlighted in all studies, training a model can be thought of as the act of creating a conceptual space by assessing the similarities across all examples provided. By doing so, the model is performing its own alien² abstraction (as data compression), so then associations and combinations can happen in latent space. The modularity of the ACASIA (Hoorn 2014) meta-model of creativity is particularly useful here because it allows us to compare the two approaches by looking how individual modules are implemented. The studies suggested that, while in R[] contextualizations the abstraction has already been performed by humans defining analytical rules, in D[], high-level human-compatible abstraction is achieved through large-scale datasets and training. Because of this, D[] only affords the illusion of analyticity (GPT is an unreliable calculator, SD struggles drawing exactly two circles, etc…) and its abstraction is much more attuned to the complexities of human perception, as opposed to cognition.

In conclusion, attention to Process is required in order to select examples, form new combinations, iterate and learn. While the evaluation of output (Product) is necessary to improve, typically, the adjustments are to be made in the Process. Without a perspective over the technological relations that exist between the human and non-human, the intersection of technology and creativity is hard to navigate. As shown in Chapter 5, reflective practices act as a compass, making us become aware of how we expect these tools to behave and learn to distinguish it from how they actually work. This developed self-awareness can increase the effectiveness and range of expression of using technology for creative endeavors. With intimate knowledge of the creative Process of a specific technology is possible experiment with its boundaries, in search for unexpected behaviors (Hoorn 2023).

7.2 Language as interface for concepts

As emerged from the text-to-image study (Chapter 6), language can be a powerful interface for visual concepts. Similar types of text conditioning have been successfully applied to audio (Borsos et al. 2022; R. Huang et al. 2023; H. Liu et al. 2023; J. Huang et al. 2023; Dong et al. 2023; Ghosal et al. 2023) and music (Agostinelli et al. 2023; Schneider et al. 2023; K. Chen et al. 2023; Melechovsky et al. 2023), which suggests that language can be used to interface in a multi-modal way to multiple forms of perception. However, as the literature review has thoroughly exposed, it would be wrong to state that we can describe the nature of concepts only in linguistic terms. Doing so would require to either rely on formal definitions, hence falling into all the problems of CT, or limiting the study of concepts to lexical concepts, one of the limitations of NT. Reflecting on the studies conducted, it became clear that having language as interface for conditioning DL models was a tremendous asset, which unlocked the immense creative potential existing in latent space. A hypothesis that could follow from these ideas is to think of language as a technology in itself, so we could write:

(I−Language) → Speech

(Ihde 1993) is cautious about addressing language as a technology and argues that because language involves human intentionality and meaning-making processes - something he calls - it cannot simply be treated as an objectified tool like other forms of technology. Instead, understanding how people use language requires an appreciation for the cultural contexts in which they operate, including shared assumptions about meanings and values. Ihde notes that unlike most technologies which have specific functions and purposes (e.g., a hammer is designed for pounding nails), language has an open-endedness and flexibility in terms of what it can express. This means that while we use language to achieve particular ends (such as conveying information or persuading someone), there is no limit to what we might say using this medium. Ihde also warns that addressing language purely as a technology risks missing out on its deeper dimensions such as creativity, playfulness, and poetic expression.

Researchers in other fields such as biolinguistics (Koster 2009) and evolutionary linguistics (Mufwene 2013) also brought forward the idea that language can be seen as technology. These views typically follow from (Chomsky 1957) and his theory that language is an innate ability that humans are born with - it is part of our genetic makeup or universal grammar. Assessing language as a technology heavily depends on what we intend by Language and Technology, which are both quite expansive and controversial concepts themselves, but as it should be evident since Section [sec:platosproblem], looking for exact formal definitions seems to be a dead end.

I will explore here the perspective that a Language can be broadly understood as a compositional system for concepts. I will also adopt the perspective that a Technology is anything that mediates our experience of the world. These two perspectives might then allow for Language to be a Technology that mediates our experience of concepts. According to the post-phenomenological notation, we might formalize how language is acting as an interface for concepts as follows:

I → Mathematics/(L_M→R[C₁,C₂,...,C_n])(−Result)
I → Storytelling/(L_S→D[C₁,C₂,...,C_n])(−Story)

Where L_x is a form of language and C_n can be any concept, such as number, addition, cause, character, die and so on. While in the first example the use of R[] is reflecting the analytic nature of mathematics, the use of D[] in the second example highlights the synthetic nature of storytelling, a process influenced by the author’s experience and cultural context. In both examples the hermeneutic intentionality (Ihde 1993) sets the concept contextualization required by the specific form of language.

It could be argued that for the storytelling example there are some elements of R[] as well. A more obvious case is that of a poem that follows a specific meter, such as a sonnet. It seems that R[] and D[] can happily coexist in the same mediation. It is perfectly possible to have:

I → Poetry/(L_P→(R[Meter]−D[Theme]))(−Sonnet)
I → Music/(L_Mu → (R[Tempo]−D[Theme])(−Song)
I → Architecture/(L_A→(R[Materials]−D[Aesthetics]))(−Building)

Once again, language is intended here very broadly as a compositional system, grounded in concept tokens and a representation of their relationships. It seems that the ability to deal with both R[] and D[] contextualizations of compositional systems (i.e. languages) is perhaps its most remarkable feature as it allows to bridge perception and cognition. As the experience of the first two studies suggests, similarity spaces constructed without language-like elements in support of compositionality are relatively difficult to control. In the third study, which leverages the compositional properties of word tokens and large-scale pre-trained models, the relationships between certain words and the visual representations of concepts they refer to, becomes immediately available to the users as an interface to visual concepts. Arguably, this is why in the third study most participants picked up relatively easily on how to generate content using Botticello.

As final remark, it should be noted that as LLMs evolve to be more powerful and capable, they will also become multi-modal. GPT4 already incorporates the ability to describe, generate and modify images based on natural language. It is only a matter of time before the compositional capabilities of these tools unify other domains under the same roof. However, it is still unclear how future DL solutions will integrate or achieve the ability to deal with concepts analytically and whether this is possible at all using machine learning.

7.3 Blurring human and non-human creativity

The concepts of ownership and authorship in art and design are constantly challenged by technological advancements and the current DL trend is perhaps a challenge that will leave a long trace. For example, there are many concerns regarding whether the output of image generators such as Dall-E or SD is to be considered derivative work because it relies on existing visual content that has been taken from the internet without explicit consent from the authors (Brittain 2023). There are also many cases in which using these tools can be considered as fair use, especially when there is substantial tweaking of the generation parameters, custom datasets or manual post-processing efforts. In all studies presented, the boundaries of human and non-human efforts are being twisted and blurred in different ways. This section addresses the implications of this ambiguity and how existing creativity theories might adjust to accommodate for it.

In the attempt to frame CC systems under the four Ps of creativity, (Jordanous 2016) suggests to change Person into Producer, in order to accommodate non-human agents as potential actors or co-actors. While this extension is welcome, it perhaps also leaves out other aspects that are characteristic of CC, for example how a specific technology (Process) might affect the criteria that the audience adopts to deem an artifact as creative (Press). Furthermore, communities of practice (Press) might foster individual learning (Person), which in turn will affect the way technology is used or even its core functioning, as seen in numerous open-source projects spawned after the release of SD (Process). Separating the different perspectives might have its benefits, but also promotes a reductionist view of creativity, which gives an incomplete picture.

The ACASIA meta-model by (Hoorn 2014) does not, by definition, take into consideration the nature of the agent performing each module. ACASIA’s agnostic take on the nature of the agent performing the tasks in each module makes the model better suited to describe these interactions. I believe that it might be valuable to express more specifically how the interaction between the human and the non-human shapes the creative process, within ACASIA modules. For example, in the third study, the interaction loop happened much faster compared to the first two studies in which the technology only allowed two iterations in total. As discussed in Section 6.5, the shorter generation time of SD affords both human and non-human association and combination in extremely short cycles. This fast paced interaction dramatically increases the speed at which a technology’s background relations can become more transparent to the user.

Under the post-phenomenological view, the use of a technology for creative endeavors can be both personal and impersonal. On one hand, the non-human can be perceived as external to us, as is the case for alterity relations where technology is to be considered as a quasi-other. On the other hand, a technology can become more transparent to its user by gaining insights into its background relations, to the point where hermeneutic intentionality becomes possible through the interaction with it. With regards to the studies presented in this thesis, the affordances of specific DL tools seem to have had significant impact in defining these boundaries. In addition to the increased speed of the iteration cycle discussed earlier, the introduction of text-based conditioning provides a more transparent interface, given that the compositionality language is so familiar to us.

In the last two years, it is not uncommon to hear the term referring to someone who specializes in the craft of writing prompts for LLMs or LDMs. Arguably, this skill requires a deeper understanding of a D[] contextualization, which is acquired primarily by experimentation. Prompt-engineering is characterized by the ability to see through the opacity of alterity relations and gain intimate knowledge of the technologies acting in the background of DL models. By stepping into the technological mediation and understanding its components, one is then able to see the bigger picture and have full control over the technology. Seeing through alterity relations enables us to become aware of a technology’s inner functioning and this awareness affects the mediation itself, making us perceive the output as more ours.

However, there is no shared understanding of authorship and ownership in the community of practice because most users approaching GDL use algorithms and models off the shelf, often with little knowledge about their internal functioning. If one comes up with the idea of an avocado shaped chair and uses a LDM to visualize how it may look like, is it still possible to argue that the final image was by the human just because they typed the right sequence of words as prompt? Is the model simply solving a production problem? Can a be considered the author of a prompt? These are extremely difficult questions to answer, but one thing is for certain: the value of merely executing existing ideas is diminishing rapidly in the face of these technological advancements. As DL tools become more accessible and widespread, the importance of fostering unique and innovative approaches to utilizing these technologies in creative practices becomes increasingly crucial. Consequently, the focus should shift towards developing a deeper understanding of the underlying mechanisms that drive these systems. In doing so, we can strive to ensure that the resulting creative outputs maintain a sense of authenticity and originality, even as the lines between human and non-human contributions continue to blur.

I think we should think of AI as the intellectual equivalent of a backhoe. It will be much better than us at a lot of things.