Critical Response #1

Soohyun Kim


Let's consider separately the cases when the musical audio itself generated by MusicLM is used directly as the final product and when it is used as a reference for human musicians or song-writers.

Also, although the performance of MusicLM is currently not at the level of a publishable music record in terms of audio quality and song form structure, let's assume in advance a situation where the text-to-music generation technology has been completed and the performance has been raised to the level of publishable music record. (Maybe within 3 years? Maybe within 1 year??)

The most fundamental limitation of the text-to-music generation model is that it is difficult to fine-tune various features of musical audio generated by the model. In the actual music production industry, mixing/mastering engineers worry about the 0.1 dB parameter tuning precision of various parameter knobs, but with MusicLM, it is difficult for users to directly adjust even a simple adjustment, such as simply wanting to increase the volume of the bass guitar. A user could text prompt MusicLM to make the bass guitar volume louder, but still it would not perfectly reflect the user's very specific intentions (e.g. increase the volume of the bass guitar by 0.5 dB for a specific part of the currently generated music).

I think that tuning each instrument track or audio effect parameter for text-to-music gerneration models will remain a challenge for several years. The reason is that building a suitable dataset (dataset labeled with the audio of each instrument track that makes up the final mix audio) to address this problem (which is a task that would have to be done manually by humans) is a very difficult problem as of now. There are massive amount of records out there but there had been no music producers who were interested in make labeled data with raw instrument track recordings for their songs. I am not sure if we can build such a dataset in several years starting now.

I think that this issue can be partially solved by adding a source separation stage after the generative model and allow users to remix the sources.


When I posted the news about MusicLM on South Korea's largest musician web forum (where many professional musicians access), the first reaction from the musicians was, “If I only use a musical audio generated by MusicLM as a guide demo, play it myself, record it, and mix/master it to make a separate publishable level record, I will be able to pretend like if it is my own song while not I did not compose but only recorded and produced.” Like this case, if MusicLM is used as a demo for human musicians or song-writers to produce a record out of it, it is really hard to distinguish whether it is an AI generated song. And also it is questionable who owns the copyright of the song if it’s record is published.


Lastly, let's discuss music generated by MusicLM from the perspective of music consumers. Here is one important aspect in the music buisiness: when someone like a song of a certain singer-songwriter, does it solely due to the quality of the song?? Certainly it is not ture. When some one like a song of a certain singer-songwriter, the personal attraction, appearance, charisma, or personality of the singer-songwriter occupy the large portion of the reason, so these are what make music consumers open their wallet (or take out their credit card). Can people become a fan of MusicLM? Will people follow MusicLM’s instagram? Will people want to pay money for MusicLM? I believe it should be unlikely as of now.

I think one of the solutions for this issue is to make a virtual e-girl of MusicLM with cute appearance and attractive character, which music consumers can idolize.