Multipitch Estimation
Multipitch estimation is the concurrent estimation of the pitch of multiple instruments in a sound mixture. It is an integral part of automatic music transcription as well as various other applications in areas such as music information retrieval.
The first step is to obtain a time-frequency representation of the audio signal. Due to the logarithmic frequency axis of the constant-Q transform, musical notes are shift invariant in both time and frequency. Therefore, for a given instrument, different notes approximately correspond to simple shifts in frequency. This property makes the constant-Q transform more suited for pitch estimation than a standard spectrogram.
We treat the constant-Q transform of a sound as a 2D multinomial distribution. It is modeled as a convolution of 2 other multinomial distributions:
1) Kernel Distribution - Characterizes repetitive spectral structure
2) Impulse Distribution - Characterizes dynamics and yields a pitch track
Given only the constant-Q transform, the EM algorithm is used to estimate the kernel distribution and the impulse distribution. The estimated pitch at every time step is the location of the peak of the impulse distribution at that time step. The decomposition of a clip of clarinet music can be seen below.
Kernel
Distribution
Impulse Distribution
Impulse
Distribution
Kernel
Distribution
The above decomposition is a fairly ideal result. The pitch track can clearly be seen in the estimated impulse distribution and the timbral signature of the clarinet can be seen in the estimated kernel distribution (we use the textbook definition of timbre over here).
In order to actually achieve such results, a fair amount of constraining is necessary. Just running the EM algorithm without any constraining, we get the following results.
Impulse Distribution
Kernel
Distribution
The objective of the EM algorithm is to get the reconstruction as close to the original as possible and this is what it is doing in the above example. Although it does a good job of this, it does not solve the problem at hand. We can see that the impulse distribution ends up capturing all of the nuances of the timbre.
We want the impulse distribution to be sparse so that the kernel distribution captures these nuances. A distribution with a low entropy is sparse so we use an entropic prior distribution on the impulse distribution to enforce sparsity. We then get the following results.
Although the impulse distribution is now sparse, each column is not necessarily unimodal as we would like.
In order to remedy this, we impose a prior on the impulse distribution such that each column is a Gaussian. We then get the following results, which is what we want.
Kernel
Distribution
Impulse Distribution
Kernel
Distribution
Impulse Distribution
The goal is to concurrently estimate the pitch of both of the instruments. An example of a mixture of a clarinet and a flute is shown below. The solo recordings have been shown as well but just for illustrational purposes.
We now extend the model to deal with multiple instruments by introducing another latent variable. The EM algorithm is used as before but we now estimate one kernel distribution and one impulse distribution per source. We also estimate a distribution of mixture weights in this case. The model is given by
Impulse
Distribution
Kernel
Distribution
Mixture Weights
We would like to see a neat pitch track in each of the estimated impulse distributions
Using the priors mentioned earlier, the following two impulse distributions are estimated. Both impulse distributions contain both instruments so this is not very helpful to us.
This tells us that the problem needs to be constrained further when dealing with multiple instruments. The intuition is that each pitch track needs to be fairly smooth. This is enforced using a Kalman filter like smoothing. The estimated impulse distributions when we use this smoothing are shown below.
We see that two clean pitch tracks have now been estimated. This is exactly what we want!
Reference
• Gautham J. Mysore, Paris Smaragdis, “Relative Pitch Estimation of Multiple Instruments” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, Taiwan. April 2009