Dynamics Modeling in Sound Mixtures
Gautham J. Mysore
In order to accurately model sounds, we need to exploit as much structure as possible. Dictionary learning methods such as non-negative matrix factorization (NMF) and probabilistic latent component analysis (PLCA) do a good job of modeling the spectral structure of sounds. However, they fail to provide a statistical description of the temporal structure (dynamics) of sounds. The importance of dynamics is demonstrated by the use of hidden Markov models (HMMs) in speech recognition. However, HMMs have a rigid observation model that is not amenable to capture variations in spectral structure of different occurrences of a single state. We propose a novel algorithm for jointly learning the spectral structure as well as a statistical description of the dynamics of sounds. In this algorithm, rather than learning a single dictionary to characterize spectral structure, we learn several small dictionaries to describe different aspects of the sound. We also jointly learn a Markov chain to describe the dynamics between the dictionaries. We then propose a method of combining models of individual sounds with an additive interaction model. This gives us a model of multiple sound sources that incorporates spectral and temporal structure of both sources. This is a general model for mixtures of sounds and can be used for various inference tasks. We discuss the application of this algorithm to source separation.