Source Separation by Humming

 

There are various interfaces for dealing with audio as can be seen in standard audio editors. These are predominantly visual interfaces. They are not the most intuitive interfaces for audio as it is inherently sonic data rather than visual data. Also, they can be quite limiting and do not allow us to perform various interesting tasks. Source separation is one such task. It has numerous interesting applications some of which are described here. This can be done be if we have  some training data for some of the sources. It is however interesting to explore how this can be done with no training data.


We want to have a method to tell the system the source that we want to extract from a recording. This is pretty much impossible with real music in a visual editor. Given a spectrogram of an audio signal, we can roughly see some of the sources but there is a lot of ambiguity in this. For example, in the spectrogram below, we can see some harmonic series that correspond to the vocals. However, it is extremely laborious and next to impossible to actually use a visual editor to pick out the voice. Also, in a lot of music, the visual location of a a source is a lot more ambiguous. In some examples, it would not be clear if such patterns correspond to the voice or an instrument. 

We have therefore developed a more intuitive interface for specifying the source that is to be extracted. Since this is an audio problem, we believe that the user input should also be audio. In our system, a user vocalizes the sound that he/she wishes to extract. This could be things such as singing the vocals, whistling the guitar part, or beatboxing the drum part.

In the above example, if we wish to extract the vocals, we could just sing it in our own voice. An example user query is below.

Original Music

Query

When we apply our separation algorithm, we get the following separated vocals and background music. It should be noted that the above user query is the only input to the system and no training data has been used.

Separated Vocals

Separated Background Music

The goal of the separation is often to process an individual source in the mix. For example, we might want to pitch shift the vocals in the current example. This can be heard / seen below. The artifacts of the separation are then largely masked while still achieving the goal of processing an individual source.

Remix with Processed Voice

In the above example, the user input and the source to be extracted were both vocals. The same system can however be used to extract other sources as well. The user input just needs to vocalize the source that needs to be extracted in some form.


In the following example, we wish to extract the lead guitar.

Original Music

Our user input, which is just whistling the lead guitar, is below.

Query

When we apply our separation algorithm, we get the following separated lead guitar and background music.

Separated Lead Guitar

Separated Background Music

It should be noted that the separation follows the user input. As the user input contains only the first two phrases of lead guitar, the separated lead guitar also contains only the first two phrases. The starting of the third phrase can be heard in the separated music.


One of the artifacts that is left behind in the separated music is the attacks of the lead guitar. This is attributed to the fact that the whistling does not contain the attacks. The majority of the lead guitar is however separated. Moreover, if the goal is to process the lead guitar and than mix it back with the background music, this amount of separation will work quite well.


Let us consider another example. The goal of the separation is often to get rid of an artifact during the recording process. For example, we sometimes record multiple instruments in a single microphone (either due to budgetary constraints or artistic choices such as microphone technique). Consider a classical music recording in which we have a perfect take (all of the musicians have played well). If one of the musicians forgot to take his / her cough medicine that day, we could have a frustrating situation as below.

Music with Artifact

We would of course like to remove the cough from the above recording. This is a very simple user input as shown below.

Query

The cleaned up recording is shown below.

Clean Music

As can be seen from the above examples, the user input does not need to be exactly the same as the source that we wish to remove. It just needs to be a rough approximation. The key point is that the user input just needs to resemble the source that we wish to extract more than any other source.


The basic methodology that we use can be seen below. The diagram is with respect to the example in which the user input is whistling.

The PLCA algorithm (as described here ) is applied to the sound mixture. This results in the learning of a number of basis vectors that characterize the sound mixture. Different basis vectors will correspond to different sources. Some of the basis vectors will correspond to the source that we wish to extract / remove. During the learning process, if we have a way of suggesting that some of these learned basis vectors should correspond to a particular source, we would be able to reconstruct that source and the remaining sources separately.


We learn a set of basis vectors for the user input using PLCA. These basis vectors are then used as prior distributions on the first n components during the learning of the basis vectors of the sound mixture. This will suggest to the algorithm that the first n basis vectors should approximately correspond to the source that is vocalized by the user. We can then reconstruct that source using the first n basis vectors. The remaining music is constructed using the other basis vectors.

Reference


  1. Paris Smaragdis, Gautham J. Mysore, “Separation by Humming”: User Guided Sound Extraction from Monophonic Mixtures” in Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY. October 2009