What do you (really) want from AI Music generation?

Kunwoo Kim


MusicLM’s performance was measured in adherence to the text descriptions and audio quality (Agosttinelli et al., 2023). My initial reaction was a surprise as I listened through their examples as the music impressively embodied the nuances of the text pretty well.

“Audio quality” is a vague term, and I do not quite understand the FAD score (nor how can this be quantized in the first place), yet my subjective reaction was quite weighed by the low sampling rate, which automatically gives me an impression of not-so-good ‘audio quality’. They do mention that they will try to integrate in a higher sampling rate in future studies, but why 24 kHz? I could have missed it, but I assumed that it was chosen for efficiency in training and generation, but having the resulting music that all sound like it was playing from a cheap radio is a design choice that needed to be reconsidered.

 I think the instrumentation and rhythms were integrated well, but the music all seemed to be missing a kind of intention, message, or an idea (maybe it is my stereotype of AI). For example, some examples prompt a “catchy melody”, yet I could not quite catch it. Subtle nuances like accents, grooves, or memorable melody contours seem to be missing. However, this could be where humans can intervene, as they specifically mention that MusicLM could be a creative tool that could “assist” music creation. Listening through the examples as end-music experiences may not give too much justice.

When could AI Music generation such as MusicLM be used? One initial idea that comes into my mind is a ‘mood board’. I remember making a game with my friends, and for each situation, such as characters like two-faced-princess or three-dumb-goblins, or locations like a bat-cave or a backstage, I composed 30 ~ 60 seconds of music. Sometimes this music helped my friends to come up with visual illustrations or a story. I think MusicLM could be a great way of generating ‘moods’. Each example I listened to gave me interesting impressions on different ‘moods’.

However, could AI Music generation ever go beyond a means-to-the-end? I personally believe so. I think in the near future, AI music could be able to produce music that really can’t be differentiated from human-created music. But could it expand the boundary of music? When computer music arrived in the realm of music, it significantly broadened the musical world with novel timbres, synthesis techniques, and more. However, could a generation that is limited to clips composed by expert musicians produce something novel enough to widen the boundary? I am not sure.

What I want from AI Music Generation ultimately (frankly, I don’t really know I want) might be novelty. Yes, it is super awesome to experience how music adheres to a given text, and it seems to be useful in many ways, but its end product is another form of western-music contained within the given clips of music. One day, I hope AI Music Generation has its own idiosyncrasies enough to be a distinct genre; another expansion to the boundaries of music; something more than useful, perhaps beautiful.