Counterfactuals in music generation

Introduction

The space of human-AI co-creation of content is ripe with possibility; machine learning models can be used to augment human abilities, resulting in outcomes that would not be possible with humans or AI alone. However, many state-of-the-art ML systems today act as “black boxes” that don’t afford end-users any control over their outputs. In the context of creativity, we desire ML systems that are both expressive in their outputs and controllable by an end-user. Specifically in the context of music generation, current models are designed more for listeners than composers. While generative models such as Musenet and GANSynth can create outputs with impressive harmonies, rhythms, and styles, they lack any method for the user to refine those features. If the user doesn’t like the musical output, their only option is to re-generate, producing a completely different composition. Moreover, changing the way these models generate output requires machine learning experience and hours of training time, which is not feasible for composers.

Counterfactual inference is reasoning about “what would have happened, had some intervention been performed, given that something else in fact occurred.” (Barenboim et al. 2020) In order to reason about what components of the model are necessary and/or sufficient to produce certain kinds of outputs, counterfactual scenarios are useful.

Relevant Background

On Pearl’s Hierarchy and the Foundations of Causal Inference -- the foundations of counterfactuals, rooted in causal inference / probability

Counterfactual Visual Explanations -- a good example of how counterfactual inference can be used to produce human-readable explanations of machine learning classifiers

Active learning of intuitive control knobs for synthesizers using gaussian processes -- paper where the authors learn high-level mappings from synthesizer control space to high-level concept space (as defined by user ratings)

Updates

Weeks 1 & 2 were primarily spent doing literature review. I learned a lot of background information on causal inference, primarily informed by Pearl's Causal Hierarchy.

Week 3

this week, I tentatively narrowed the scope of the project to music generation. I have been experimenting with various music generation models from Google Magenta, including the Music Transformer and CoCoNet.

Week 4

Still playing around with the generative models, trying to get some intuition into their workings and what parameters I can adjust. The orderless composition property of Coconet is particularly interesting — it seems like the sampling strategy is non-deterministic, and we could run a counterfactual in the vein of "what would have happened if the notes were generated in a different order..."

Week 5

I am now using the DDSP framework in my project. My project abstract can be viewed here. It is particulary interesting because rather than generating audio directly using a neural network, they use neural networks to drive the parameters of traditional sound synthesis, which in turn generates audio. This greatly reduces the number of parameters that the neural networks need, and thus drastically reduces the amount of training data needed to produce models that can do impressive feats such as timbre transfer.

One particularly interesting work that DDSP cites is this paper: Active learning of intuitive control knobs for synthesizers using gaussian processes. In this paper, they learn high-level "knobs" that map from synthesizer control space (i.e., parameters such as F0, amplitude, harmonic distribution, etc.) to high-level concepts, such as "scariness" or "steadiness." This is relevant and interesting, since they are building an interactive ML system for music generation, but also one in which it would be ripe to explore the space of counterfactual possibilities (e.g. if the "scariness" knob had been lower, what would the composition have been like?)

Another interesting paper in this space of learning high-level concepts through Bayesian optimization is Design Adjectives.

Week 6

I have been diving more into the algorithm behind the "intuitive control knobs" paper discussed above — specifically, the Bayesian optimization they use to learn the mapping from control space to high-level concept space. This notebook serves as a good interactive introduction to the relevant concepts. Essentially, the algorithm learns an uncertain approximation of the real mapping, by iteratively polling the user for a high-level concept rating of an input, and using that input-rating pair to update the posterior distribution.

Building off of this tutorial, I have made a very rudimentary "knob" that can be trained by the user to output a simple sound.

A couple questions/difficulties I have are 1) what specific sound synthesis params to control (and how coarse-grained,) and 2) how to "encourage" the Bayesian optimization to converge to interesting outputs? There are several different "settings" of the algorithm that I need to explore, such as the acquisition function to use.

I also met with Tobi Gerstenberg, a professor doing work on counterfactual judgment in psychology. It led to some interesting discusssions—rather than just counterfactuals, it may make more sense to use direct intervention to provide feedback, and then after the high-level concepts have been learned, present the user with counterfactual output examples that vary along the dimensions of the high-level concepts.

Another idea I had was to have "hierarchy" of learnable concepts—the motivation is that sometimes when training these knobs, I noticed that some feature of the sound was what I wanted, but another was not. For example, "a sound may have a 'sharp attack' like a guitar but may 'warble' too quickly to be one, and a composer may still rate it above average because it possess at least one of the main characters of a guitar sound." (Huang et al. 2014)

The idea is that the user can "drop down" from a higher-level concept to a lower-level one, teach the model that concept, and then in the case where the low-level concept was unsatisfactory, the user can adjust that concept up, while maintaining the other desirable quality.

Week 7

In my quest to build improved "knobs" that can be learned, I'm trying to build intuition for which control parameters to control, and how fine-grained control to give the algorithm (since the parameters can be different at every timestep, that can be impossible to sufficiently explore the whole parameter space.) Julius brought up the concept of envelope in class, which seems like a useful abstraction that provides users with a degree of control, while not requiring every single timestep be defined separately.

I made a new "knob" where you can interactively teach the algorithm the attack, decay, sustain, and release of the F0 envelope. Here's the very messy Colab notebook.

The last couple days I've been exploring wavetables as a parameter space for the user to control, which are fed into a wavetable synthesizer. I found this video to be particularly intuitive and inspiring in terms of the vast range of evolving sounds that you make with wavetables. Letting the user learn a wavetable is interesting, since given a melody, the user can essentially refine the timbre.

Also, on the counterfactual side, I have refined my idea that I talked about last time-- I found a paper on Bayesian optimization and attribute adjustment that is very relevant algorithmically. Essentially, after learning one mapping from control space to high-level attribute, this algorithm enables the user to adjust another attribute while remaining invariant to the original attribute. This would be great for allowing the user to provide the algorithm with counterfactual examples during training.

Week 8

The Bayesian optimization and attribute adjustment paper uses a VAE to guide the Bayesian optimization -- also very relevant for my project. I'm now working towards implementing the algorithm described in this paper -- the authors are from Stanford, so I'm in touch with them and getting some of the implementation details about what libraries they used to implement the Bayesian optimization, etc.

Week 9

I have successfully been able to overfit a VAE to training data. Now working on setting up a subset of the NSynth dataset for more large-scale training.

Also, in parallel, I talked to Stephan Eismann, who was the first author of the Bayesian optimization and attribute adjustment paper mentioned last week. He suggested some helpful tips about re-implementing the algorithms in his paper, and also led me to a challenge with the project -- the attributes that users can adjust need to be known at training time. So perhaps I will have to predefine some attributes of the sound that can be quantitatively assessed at training time, train a representation that is invariant to those. Users can still teach the algorithm towards a user-defined high-level concept, but the lower-level adjustable concepts would need to be predefined.

In light of that, I'm inclined to explore different approaches for attribute adjustment. Given that identifying something viable might take awhile, my 220c final demo might end up just being exploring the latent space of the VAE though :)

Week 10

So ultimately, it turns out it's quite hard to train DDSP-based generative models-- the autoencoders' setup in the original DDSP paper was a "supervised" setup, wherein the latent vector z was supplemented with pre-extracted pitch and loudness information. INterestingly enough, when I try training on NSynth, it not only leans on the f0/loudness information, it actually competely ignores the latent z. If you zero out the latent z, the synthesis is virtually the same -- leading me to believe that the model is essentially memorizing the timbres of the various instruments, and using the f0/loudness info like keys to a glorified dictionary. You can play around with that model here.

To combat this problem of overreliance on the supervised features, I tried getting rid of f0/loudness, and training end-to-end with just z vector, but that turns out to be quite hard...I realized that the DDSP authors wrote a follow-up paper, where they shed more light on ways to do "end-to-end" autoencoding using a Resnet + sinusoidal synth (albeit with a different end-goal of pitch detection.) That was cool, except they pretrained for 1M iterations, which I don't really have time for right now 🤷🏾‍♂️

So, for the final class presentation, I've decided to take a non-DDSP VAE, based off of https://github.com/yjlolo/gmvae-synth/, and provide a user-friendly interface to do timbre interpolation with this. My final website is viewable at https://music220c.ketan.me/. Since this model disentangles pitch and timbre, those can be independently controlled. This isn't super novel, as the authors of the associated paper with this repo also did timbre interpolation -- but hopefully it can be seen from this progress log that I pursued a lot of novel directions this quarter, I hope to incorporate that progress into my future work on this project. I think that having a model setup with disentangled / interpretable latents, such as the GMVAE, lends itself to counterfactual intervention, since we can manually intervene on one of the latents (e.g. pitch or timbre.)

Counterfactuals in music generation

Navigation menu

Views

Personal tools

Navigation

Search

Tools