Single Mixture Blind Source Separation

Next: Masking Up: Statistical Learning Previous: Statistical Learning

Single Mixture Blind Source Separation

After some promising work on separation of degenerative mixtures(more sources than sensors), a single channel ICA is presented in [34] [33] [35]. It assumes the audio soure signal can be represented by a set of ICA basis functions which can be learned by training. These bases are then used to discriminate between sources using an assumption of source independence in the probability model. The result from training by the test signals itself shows significant separation, leaving the source intact while suppressing the other. However, it does depend on training and parameter fine-tuning which may need to be experimented more extensively. Testing with the source outside the training set gives worse performance while the situation where one speaker's voice is embedded in a background noise of people talking shows some improvement. The author suggests an extensive investigation on how to make a dictionary of bases for uses in real arbitrary recordings. The need of learned basis is what limits this algorithms. Additionally, since the bases are not required to be orthogonal(only source coefficients are constrained to be independent), there is usually too much overlap in the signal space among similar sources e.g. speech-speech as evident in the inferior separation compared to speech-music. Its performance also depends on the model of the PDF but this seems to pose no big problem, especially when a generalized Gussian pdf model is used as proposed. Problem also arises when dealing with more than two sources. Division into subproblems of two sources is suggested. However, a larger dictionary of bases is needed.

Another way to cope with the degeneracy of having just one microphone is to project the signal onto a higher-dimensional subspace before normal analysis. Independent Subspace Analysis was first introduced by Hyvärinen [32]. Casey and Westner extended the subsequent works to circumvent the problem of having only one recording channel in conventional ICA [30]. Basically, the signal is projected onto mutually independent subspaces. Each source, corresponding to each energy track, however, can be spanned by more than one of these subspaces i.e. relaxing the assumption of independence. The problem of having just one channel is removed by taking the STFT magnitude and treating each frequency bin as different channel recordings. Any conventional ICA can then be applied to extract independent outputs corresponding to the magnitude (or energy) tracks in time which are most independent. In [39], Smaragdis took the idea further and investigated the ICA technique exploiting the mutual information criterion in particular. He showed that basic acoustic cues such as harmonicity, common AM/FM modulation and frequency proximity yield higher mutual information and hence would tend to be grouped together as a source by the algorithm. The aim of the experiment is to unify the framework of perceptual grouping under mutual information hypothesis of what the brain does. It has been applied successfully with a drum track, showing good separation of the kick drum, the snare and the hi-hat. In a complex mixture like a song in general, the result is not so good showing unclean energy separation especially in the region of frequency where more than one source components overlap. Also, even though a singing voice is perceived as one source, the algorithm cannot gaurantee the same auditory object in the same source. In other words, the singing voice can be separated into two channels at the output. The remedy is to used either segmentation into appropriate intervals e.g. single-note part from chords, before analysis, or using some measures of similarity across time-frames or output channels to group the same source streams together, as proposed in the original papers.

[19] the so-called refiltering technique is used to separate streams of sources, assuming they are disjoint in time. The algorithm is a hybrid of CASA and a form of statistical learning, though not ICA. A speaker dependent HMM is fit to the training data which is then used to determine the binary masking function for each time sample. The states in the HMM are derived from the STFT pairs of coefficients. The separation is succesful when applied to the mixture of training samples and is training/speaker dependent. The time disjoint assumption is also invalid in many instances. This is one of the hybrid ICA-CASA system for source separation. Another is done by Cichocki in [31]. Masking is also one of the most common schemes used in source separation. It success obviously depends on the masking function and how the signal representation allocate the source energy among the coefficients.

Next: Masking Up: Statistical Learning Previous: Statistical Learning

Pamornpol Jinachitra
Tue Jun 17 16:27:28 PDT 2003