I created a tool for using interactive machine learning to create mappings to control audio synthesis algorithms. Specifically, the tool allows users to place examples of specific sounds in 3D locations in a virtual reality environment; the tool then generalizes a mapping for the entire space based on the positions and/or velocities of the example sounds. The motivation for doing this is that the task of generating a synthesis mapping is generally done by hand, which is a slow process that ends when the programmer runs out of time, not necessarily when they are satisfied with the mapping they have created; the manual procedure is also limited in the kind of mappings it can achieve.
Here's a demo video of the neural-net model using position (controller x, y, z) features to create a synthesis mapping.
In the video: first, I create a simple mapping with four examples. This shows that I can correctly replicate user training examples in runtime mode with an interpolation between them that's more interesting than just changing the volume.
Notably, I capture a preset from runtime mode, then put a few copies of it on a triangle to create a small region that I can use to reliably play a particular sound.
Then, I put another sound above and below it (using the preset functionality again), to easily create a mapping that would be impossible with a linear mapping. This shows the flexibility of this model over an original linear regression model I created, which would have struggled to replicate all user examples so closely with so many examples provided.
There are a few more quality of life features implemented here, shown tangentially -- in example mode, the examples turn yellow and preview their sound when you hover over them. The ground changes color for runtime mode so you can easily tell which mode you're in. The examples visually turn clear the further away you are from them, to remind you that you're using a position-based mapping.
Here's a demo of the velocity / gesture-based interaction model with only using velocity features. While previous versions used so many features that we could overfit beyond the point where the user could replicate their original motions, I think this version achieves the right amount of fitting to where it's expressive, makes the user feel like they're being listened to, and only learns velocity-gestures that the user can reasonably replicate.
(The features here are velocity: x, y, z, magnitude, square magnitude.)
You can see that with just two examples, we can already make a very compelling mapping.
Still, the position-based neural net mapping is the one that I find the most conducive to fine-tuning.