next up previous
Next: Results Up: Polyphonic Instrument Identification Using Previous: ISA of a single


Classification System

The Infomax ICA algorithm by Bell and Sejnowski [8] is used to learn $ \mathbf{A}$ and $ \mathbf{S}$ from $ \mathbf{X}$ in the linear model, keeping only $ N=8$ components for maximum use in further classification. The window length is 10 ms to capture the transient with 50% hop size. The convergence is fast but annealing is also applied.

Various features are calculated from the magnitude spectral bases and temporal envelopes to be used as input to classifiers in the next stage. They include the Mel-Frequency Cepstral Coefficients (MFCC), the Perceptual-Linear Prediction Cepstra (PLPC). They are all calculated using Malcolm Slaney's Auditory Toolbox [9] with 40 frequency bands. The first coefficient is omitted to ignore scaling difference, leaving only twelve each (MFCC-12 and PLPC-12). They describe the shape of a spectral envelope in log-frequency scale similar to the human ears and have been enjoying a considerable success in the past recognition tasks, especially in speech. An additional spectral feature also tried in this experiment is the log-scale spectral centroid (SC) in kHz.

While it is possible to use temporal features, they are notably hard to extract from a polyphonic signal, requiring a good segmentation which is in general hard to do automatically. However, for a pre-segmented note in the case of the two-tonal mixtures, some easy-to-calculate temporal features are experimented. They are the temporal centroid (TC), as a ratio of total duration, the crest factor (CF) in peak/rms and amplitude modulation content, as a ratio of total energy, in the band 4-8 Hz (AM48) and 10-40 Hz (AM1040).

The k-nearest neighbor (k-NN) and Gaussian Mixture Models (GMM) are used as classifiers in this experiment. For k-NN, Mahalanobis distance is used to deal with different scaling and correlation among features. It almost always gives 2-3% better results in the experiments than using Euclidean distance. Each ``independent'' component and basis is individually classified before taking votes to decide which two sources make up the mixture in the experiment. The maximum number of eight components take part in the vote. If a draw occurs, the source assigned to more of the higher energy components prevails. If still undecided, the higher total number of k-NN's and the lower total distance, or the higher total log likelihood in the case of GMM, will be considered until two sources are chosen.

Samples of instruments (about 60 each) were taken from the Iowa and McGill chromatic scale samples1. To limit the factor attributed to pitch, only the octave C4-C5 was used. 80% of the notes available were used in training, while the remaining will be combined exhaustively to make mixtures.


next up previous
Next: Results Up: Polyphonic Instrument Identification Using Previous: ISA of a single
Pamornpol Jinachitra 2004-02-25