Lead Instrument Extraction


Lead instrument extraction is simply defined as the extraction of a lead instrument from a recording of multiple instruments. It can be used for automatic karaoke, automatic “guitar hero” tracks, and automatic jam tracks. It is also useful in situations in which we wish to process (add effects, change the volume, etc.) a single instrument in a mix. We can extract the instrument, process it, and put it back in the mix. Another place that this is useful is for musicians that learn parts of songs by ear. For example, if one is trying to learn a guitar solo from a recording, it would be easier to try to figure out the part if we could isolate just the lead guitar and remove the background music.

We treat lead instrument extraction as a source separation problem in which the lead instrument is considered to be one source and all of the background music is considered to be another source. We consider single channel recordings.

The first step in describing the procedure that we use for source separation is the representation of the audio signal. Consider the following clip of piano music.

The audio signal is a large vector of numbers. The spectrogram that is shown above is a large matrix of numbers. We can clearly see and hear five notes. In a perceptual sense, we can think of the matrix as having a rank of five. Also, as can be seen, the matrix has approximately five different patterns so its effective rank is approximately five as well.

We need a matrix decomposition that will give us a meaningful representation of such low rank matrices. We will soon be dealing with mixtures of sounds so the representation needs to stem from some kind of additive model in which the individual components are non-negative.

We use probabilistic latent component analysis (PLCA), which is numerically equivalent to non-negative matrix factorization (NMF). The spectrogram is modeled as a linear combination of outer product expansions.

We treat the spectrogram as a two dimensional probability distribution, P(f,t). Using the EM algorithm, we then estimate a set of spectral basis vectors, P(f|z), a set of temporal basis vectors, P(t|z), and a set of mixture weights, P(z). The decomposition of our piano example can be seen below.

Once we perform the decomposition, the spectral basis vectors give us a spectral model of the sound.

When we deal with mixtures of sounds, each sound source can be associated with a spectral model. Consider the following recording. Lead guitar extraction is demonstrated with this example.

It has a number of instruments. We would like to extract the lead guitar, which starts about half way through the recording. The way that we approach this problem is to first learn a spectral model for the background music (all instruments except for the lead guitar). Since we have a no lead guitar in the first half, we can use that as training data to learn the spectral model. We use the PLCA technique that is described above to learn this model.

An alternative method to obtain a model for the background music is to use every column of the training data as a spectral basis vector. This is highly overcomplete and redundant. We can deal with this by using an entropic prior (for sparsity) and a continuity prior (to enforce temporal coherence). The intuitive reason to sometimes use this alternative method is that the background music is likely to pretty much repeat itself underneath the lead guitar. We can therefore use entire data rather than learning an approximate model.

Once we learn the model for the background music, we consider the segment with the lead guitar and background music. We run PLCA on this segment. Since we already have a model for the background music, we tell PLCA to learn a model only for the lead guitar. It will learn a model to explain the things that were not well explained by the background music model. We can then reconstruct the background music and lead guitar separately.

The original recording of the background music with the lead guitar (second half of the above recording) is below.

Once we use the algorithm described above to remove the lead guitar, we get the following estimate of the background music.

Although there are some artifacts, we get a fairly clean separation with no bleed from the lead guitar.

Here is another example. We wish to remove the lead guitar in this example as well. We start with the original recording.

We deal with this bleed using post-processing. As mentioned above, the spectral model consists of a number of spectral basis vectors. The guitar estimate above is the combination of all of the spectral basis vectors that correspond to the guitar model. Some of these vectors have erroneously explained the percussion. If we can find these vectors, we can remove them and re-assign them to the background music.

For example, the reconstruction from two of the erroneous vectors that clearly explain the percussion are below.

We extract a set of features from each of the vectors that allow us to distinguish the ones that correspond to percussion sounds from the ones that correspond to non-percussion sounds. Once we have these feature vectors for each of the vectors from the model, we need to cluster them.

We use K-means clustering to perform an unsupervised classification of the percussion and non-percussion vectors. With the right initializations, we will know which cluster belongs to each class so it will be completely unsupervised. We then reconstruct the lead guitar with the vectors that correspond to the non-percussion cluster. The cleaned up guitar estimate is below. As can be seen and heard, the percussion bleed is completely removed. There is still  a little bit of bleed from the rhythm guitar and bass guitar but that can be removed using a similar technique.

One of the interesting applications of lead guitar extraction is that given a recording of a number of instruments (such as in our example), we can individually process the lead guitar. This is demonstrated in the example below. A heavy amount of reverb is applied to the extracted lead guitar and it is then put back into the mix.

I don’t advocate practically using such large amounts of reverb. This is merely a demonstration to show that we can process an individual instrument in a single channel mix.

This system allows us to perform any kind of processing from a slight change in volume to heavy effects processing on the extracted instrument.

We now return to our original example and focus on the extracted lead guitar. It has a significant amount of bleed from the other instruments (mainly the percussion) as can be seen and heard below.

Once we use the algorithm described above to remove the lead guitar, we get the following estimate of the background music.

Background Music + Lead Guitar

Background Music Estimate

Background Music + Lead Guitar

Background Music Estimate

Lead Guitar Estimate

Lead Guitar Estimate with Post-Processing

Remix with Processed Lead Guitar