next up previous
Next: General Scene Analysis Up: Computational Auditory Scene Analysis Previous: Computational Auditory Scene Analysis

Sinusoid Modeling

This approach involves sinusoidal parameter estimations and source grouping. Virtually all CASA-based systems involve representing the signal in multi-dimensional space e.g. time-frequency, followed by frequency-amplitude analysis and grouping. Sinusoid modeling is just one instance where the signal is modeled as a sum of sinusoids (sine+noise+transient model also exists but no separation technique has ever used that). While it can give a high-quality separation when the mixture is not too complex and sources are harmonic, none has solved for the non-sinusoidal part of the sources and can fail miserably in the presence of many sources as in real songs. The computation needed also increases many-fold for each source added due to the increase in dimension of the parameter space. Its advantage is being parametric which tends to be parsimonious for compression and flexible for modification.

One of the earliest works of harmonic source separation as demonstrated on vowel sounds of non-overlapping harmonics was done by Parsons (1976)  [15]. FFT was used as a front end and spectral peaks were detected before grouing by pitch. A similar but more perceptual-like algorithm was done by Weintraub (1985)  [28]. The autocorrelation peaks with similar periodicities across channels of a cochlear filterbank was used for grouping of partials in this work. Some used the comb filter to notch out unwanted harmonics e.g. de Cheveigné (1993)  [7]. In general, pitch estimation is needed for this type of separation. Meddis and Hewitt (1992)  [12] proposed a way to separate vowel sounds using pitch cues, similarly to  [15]. Meddis & Hewitt method seems to have become a popular reference for pitch estimation. They used a cochlea filterbank followed by an element that calculates the probability of firing of a given inner hair cell on a Basilar Membrane. The autocorrelation of this probability function is then computed for each channel for a dominant periodicity which are then grouped for consistency across channels to obtain the whole spectrum belonging to one source. It is also one of the first few algorithms shifted towards the physiological model of human auditory. A polyphonic pitch estimation based on spectral smoothing as claimed to happen in our ears, was also proposed by Klapuri  [22]. The iterative parameter estimation was then carried out  [24]  [25], while grouping algorithm was presented in  [23]. A similar prior work on iterative estimation for source separation was done by McCauly and Quartieri  [16] where least squares solution was found for frequency and amplitude estimation. The overlapping harmonic causing ill-conditioned matrix was left out during iteration.


next up previous
Next: General Scene Analysis Up: Computational Auditory Scene Analysis Previous: Computational Auditory Scene Analysis

Pamornpol Jinachitra
Tue Jun 17 16:27:28 PDT 2003