Difference between revisions of "Counterfactuals in music generation"

Revision as of 13:34, 12 May 2021

Introduction

The space of human-AI co-creation of content is ripe with possibility; machine learning models can be used to augment human abilities, resulting in outcomes that would not be possible with humans or AI alone. However, many state-of-the-art ML systems today act as “black boxes” that don’t afford end-users any control over their outputs. In the context of creativity, we desire ML systems that are both expressive in their outputs and controllable by an end-user. Specifically in the context of music generation, current models are designed more for listeners than composers. While generative models such as Musenet and GANSynth can create outputs with impressive harmonies, rhythms, and styles, they lack any method for the user to refine those features. If the user doesn’t like the musical output, their only option is to re-generate, producing a completely different composition. Moreover, changing the way these models generate output requires machine learning experience and hours of training time, which is not feasible for composers.

Counterfactual inference is reasoning about “what would have happened, had some intervention been performed, given that something else in fact occurred.” (Barenboim et al. 2020) In order to reason about what components of the model are necessary and/or sufficient to produce certain kinds of outputs, counterfactual scenarios are useful.

Updates

Weeks 1 & 2 were primarily spent doing literature review. I learned a lot of background information on causal inference, primarily informed by Pearl's Causal Hierarchy.

Week 3

this week, I tentatively narrowed the scope of the project to music generation. I have been experimenting with various music generation models from Google Magenta, including the Music Transformer and CoCoNet.

Week 4

Still playing around with the generative models, trying to get some intuition into their workings and what parameters I can adjust. The orderless composition property of Coconet is particularly interesting — it seems like the sampling strategy is non-deterministic, and we could run a counterfactual in the vein of "what would have happened if the notes were generated in a different order..."

Week 5

I am now using the DDSP framework in my project. My project abstract can be viewed here. It is particulary interesting because rather than generating audio directly using a neural network, they use neural networks to drive the parameters of traditional sound synthesis, which in turn generates audio. This greatly reduces the number of parameters that the neural networks need, and thus drastically reduces the amount of training data needed to produce models that can do impressive feats such as timbre transfer.

One particularly interesting work that DDSP cites is this paper: Active learning of intuitive control knobs for synthesizers using gaussian processes. In this paper, they learn high-level "knobs" that map from synthesizer control space (i.e., parameters such as F0, amplitude, harmonic distribution, etc.) to high-level concepts, such as "scariness" or "steadiness." This is relevant and interesting, since they are building an interactive ML system for music generation, but also one in which it would be ripe to explore the space of counterfactual possibilities (e.g. if the "scariness" knob had been lower, what would the composition have been like?)

Another interesting paper in this space of learning high-level concepts through Bayesian optimization is Design Adjectives.

Week 6

I have been diving more into the algorithm behind the "intuitive control knobs" paper discussed above — specifically, the Bayesian optimization they use to learn the mapping from control space to high-level concept space. This notebook serves as a good interactive introduction to the relevant concepts. Essentially, the algorithm learns an uncertain approximation of the real mapping, by iteratively polling the user for a high-level concept rating of an input, and using that input-rating pair to update the posterior distribution.

Building off of this tutorial, I have made a very rudimentary "knob" that can be trained by the user to output a simple sound.

A couple questions/difficulties I have are 1) what specific sound synthesis params to control (and how coarse-grained,) and 2) how to "encourage" the Bayesian optimization to converge to interesting outputs? There are several different "settings" of the algorithm that I need to explore, such as the acquisition function to use.

I also met with Tobi Gerstenberg, a professor doing work on counterfactual judgment in psychology. It led to some interesting discusssions—rather than just counterfactuals, it may make more sense to use direct intervention to provide feedback, and then after the high-level concepts have been learned, present the user with counterfactual output examples that vary along the dimensions of the high-level concepts.

Another idea I had was to have "hierarchy" of learnable concepts—the motivation is that sometimes when training these knobs, I noticed that some feature of the sound was what I wanted, but another was not. For example, "a sound may have a 'sharp attack' like a guitar but may 'warble' too quickly to be one, and a composer may still rate it above average because it possess at least one of the main characters of a guitar sound." (Huang et al. 2014)

The idea is that the user can "drop down" from a higher-level concept to a lower-level one, teach the model that concept, and then in the case where the low-level concept was unsatisfactory, the user can adjust that concept up, while maintaining the other desirable quality.

Week 7

In my quest to build improved "knobs" that can be learned, I'm trying to build intuition for which control parameters to control, and how fine-grained control to give the algorithm (since the parameters can be different at every timestep, that can be impossible to sufficiently explore the whole parameter space.) Julius brought up the concept of envelope in class, which seems like a useful abstraction that provides users with a degree of control, while not requiring every single timestep be defined separately.

I made a new "knob" where you can interactively teach the algorithm the attack, decay, sustain, and release of the F0 envelope. Here's the very messy Colab notebook.

@@ Line 22: / Line 22: @@
 I am now using the [https://magenta.tensorflow.org/ddsp DDSP] framework in my project. My project abstract can be viewed [https://docs.google.com/document/d/1tbPepxLPPV2MJjuzKfh2UqA_vHr0eMc81PPHPzi0mUg here.] It is particulary interesting because rather than generating audio directly using a neural network, they use neural networks to drive the parameters of traditional sound synthesis, which in turn generates audio. This greatly reduces the number of parameters that the neural networks need, and thus drastically reduces the amount of training data needed to produce models that can do impressive feats such as timbre transfer.
-One particularly interesting work that DDSP cites is this paper: [Active learning of intuitive control knobs for synthesizers using gaussian processes https://dl.acm.org/doi/10.1145/2557500.2557544]. In this paper, they learn high-level "knobs" that map from synthesizer control space (i.e., parameters such as F0, amplitude, harmonic distribution, etc.) to high-level concepts, such as "scariness" or "steadiness." This is relevant and interesting, since they are building an interactive ML system for music generation, but also one in which it would be ripe to explore the space of counterfactual possibilities (e.g. if the "scariness" knob had been lower, what would the composition have been like?)
+One particularly interesting work that DDSP cites is this paper: [https://dl.acm.org/doi/10.1145/2557500.2557544 Active learning of intuitive control knobs for synthesizers using gaussian processes]. In this paper, they learn high-level "knobs" that map from synthesizer control space (i.e., parameters such as F0, amplitude, harmonic distribution, etc.) to high-level concepts, such as "scariness" or "steadiness." This is relevant and interesting, since they are building an interactive ML system for music generation, but also one in which it would be ripe to explore the space of counterfactual possibilities (e.g. if the "scariness" knob had been lower, what would the composition have been like?)
-TODO:
+Another interesting paper in this space of learning high-level concepts through Bayesian optimization is [http://graphics.cs.cmu.edu/projects/design-adjectives/ Design Adjectives].
-- Intuitive control knobs
+''Week 6''
-- Bayesian optimization
+I have been diving more into the algorithm behind the "intuitive control knobs" paper discussed above — specifically, the Bayesian optimization they use to learn the mapping from control space to high-level concept space. [https://colab.research.google.com/github/yujko/5thSummerSchoolCourseMaterials/blob/master/Day1-Oulasvirta/Lecture_2_BayesianOptimization_Oulasvirta.ipynb This notebook] serves as a good interactive introduction to the relevant concepts. Essentially, the algorithm learns an uncertain approximation of the real mapping, by iteratively polling the user for a high-level concept rating of an input, and using that input-rating pair to update the posterior distribution.
-- Gaussian Processes
-''Week 6''
-TODO:
+Building off of this tutorial, I have made a very rudimentary "knob" that can be trained by the user to output a simple sound.
-- Design Adjectives
+A couple questions/difficulties I have are 1) what specific sound synthesis params to control (and how coarse-grained,) and 2) how to "encourage" the Bayesian optimization to converge to interesting outputs? There are several different "settings" of the algorithm that I need to explore, such as the acquisition function to use.
-- knob prototype (link colab)
+I also met with Tobi Gerstenberg, a professor doing work on counterfactual judgment in psychology. It led to some interesting discusssions—rather than just counterfactuals, it may make more sense to use direct intervention to provide feedback, and then after the high-level concepts have been learned, present the user with counterfactual output examples that vary along the dimensions of the high-level concepts.
-- idea of higher-level and lower-level control knobs
+Another idea I had was to have "hierarchy" of learnable concepts—the motivation is that sometimes when training these knobs, I noticed that some feature of the sound was what I wanted, but another was not. For example, "a sound may have a 'sharp attack' like
+a guitar but may 'warble' too quickly to be one, and a composer may still rate it above average because it possess at least
+one of the main characters of a guitar sound." ([https://dl.acm.org/doi/10.1145/2557500.2557544 Huang et al. 2014])
-- talked to Tobi Gerstenberg; describe convo
+The idea is that the user can "drop down" from a higher-level concept to a lower-level one, teach the model that concept, and then in the case where the low-level concept was unsatisfactory, the user can adjust that concept up, while maintaining the other desirable quality.
 ''Week 7''
-TODO:
+In my quest to build improved "knobs" that can be learned, I'm trying to build intuition for which control parameters to control, and how fine-grained control to give the algorithm (since the parameters can be different at every timestep, that can be impossible to sufficiently explore the whole parameter space.) Julius brought up the concept of envelope in class, which seems like a useful abstraction that provides users with a degree of control, while not requiring every single timestep be defined separately.
-- discussion about building intuition for which control parameters to control (and how fine-grained): using envelopes.
-- improved knob prototype
+I made a new "knob" where you can interactively teach the algorithm the attack, decay, sustain, and release of the F0 envelope. Here's the [https://colab.research.google.com/drive/10GYRguNJUddROIMNzFEE47YxzBmRi2Ab very messy Colab notebook].

Difference between revisions of "Counterfactuals in music generation"

Revision as of 13:34, 12 May 2021

Navigation menu

Views

Personal tools

Navigation

Search

Tools