From CCRMA Wiki
Kevin Montag's Music 256A Final Project, Fall 2009
Intuition is a program that seeks to make audio production more intuitive. It's focused around transforming existing audio to give it desired perceptual qualities - a particular smoothness, brightness, etc.
The user starts by specifying some songs that she likes, and what she likes about them - the great shimmer to that new Jay-Z single, or the starkness of that old Johnny Cash ballad. Then she specifies the tools she wants to use to achieve that kind of sound, in the form of LADSPA plugins. Finally, she specifies some sonic qualities that she thinks will be relevant when the program is trying to figure out what to do with those plugins when faced with sounds it hasn't seen before.
Intuition then uses machine learning algorithms to deduce a map from qualities of an input sound to parameters for the LADSPA plugins, that will allow it to take arbitrary inputs, and make them sound more like that Johnny Cash song in real time. Or, depending on the plugin and feature choice, it may become an instrument of its own.
Intuition has a relatively minimal UI. The user makes four selections - input features, plugins to use, songs to mimic, and features of those songs to target - and then presses "Go!" The algorithmm does its work to learn a mapping - this is fairly time-consuming - and then opens up JACK ports at the inputs and outputs of the plugin chain that the user has specified. While audio is being fed to the program, the values of control parameters for the plugins are displayed, as well as a happy or sad face to let the user know how well the program is doing at matching her desired features. The program does its best to display all the necessary information without feeling cluttered.
In the future, options will be added to fine-tune the parameters of the machine learning algorithm straight from the interface, but these wouldn't be placed in a prominent spot.
The other main UI improvement for the future is to add the ability to set ranges for individual plugin parameters. Currently, plugin parameters can take on any values, so there are often large changes in the sound over small periods of time if parameters are not chosen very carefully.
A number of machine learning algorithms are used in Intuition. The primary one is a locally-weighted linear regression, which can be performed in real time once data points have been provided. Applying input-specific weights to our training data allows us to avoid assuming that the mapping from input features to plugin parameters is linear - it clearly isn't!
The program can also perform unsupervised clustering of features to extract more than one "quintessential" point in feature-space from a given piece of target audio, and can learn using logistic regression to choose which of these points to target given a new piece of input.
The most challenging computational aspect of the program, though, turns out not to be the learning algorithms, but rather generating data points for them - the parameter space for most plugins is very large, and it's hard to find an optimum value. The program finds optima by computing the values of target features for random points in the parameter space, and then performing gradient ascent seeded with the point that looks most promising - this works to an extent, but is still susceptible to local optima in the space.
The program is designed with extensibility in mind. Currently features are extracted from audio using functions from libxtract, but it's easy to create new features (by subclassing off of an abstract feature class) and integrate them seamlessly with the program. Algorithms and parameters are modular, and interfaces to external libraries and protocols are presented in terms of the abstractions that are "native" to Intuition. The algorithmic side of the program is very much independent of its UI; it was designed with the intention that it could be used as a standalone library for other applications.
The program is also designed for performance, since it needs to do some heavy lifting in real time. For a given piece of audio, features are computed only once, and libxtract's philosophy of "cascading" features is built into the abstractions for audio objects. Numerical corners are cut where appropriate in the learning algorithms to speed them up.
Intuition uses a number of external libraries and APIs: libxtract and FFTW for feature extraction, libsndfile for reading training data, JACK for outputting and receiving real-time audio, LADSPA for transforming audio, and GTK for its UI.
The milestones which were accomplished for the project were, in approximately chronological order:
Milestone 1: Get a framework up and running for reading/writing files, extracting features, processing collections of files, etc.
Milestone 2: Figure out the "extent" of the program - better define the user experience and the program's capabilities.
Milestone 3: Get a user interface up and running that allows the input features, plugins, target files, and target features to be specified.
Milestone 4: Design an API for linking the interface with the computational backend, refactor existing code.
Milestone 5: Implement the computational backend.
Milestone 6: Integrate the UI and learning algorithms, revising the API and incrementally redesigning the backend with performance in mind.
Milestone 7: Polish the UI, and add new ways of giving feedback.
to Matt Hoffman for some good advice regarding the problem of generating data points for the learning algorithms! to Ge Wang for helping me turn a nebulous idea into a concrete one!