A Non-negative Framework for Joint Modeling of Spectral Structure and Temporal Dynamics in Sound Mixtures
A common theme in most good strategies to modeling audio is the ability to make use of structure. Particularly, audio has a strong spectral and temporal structure. When dealing with sound mixtures, the structure of the individual sources becomes particularly important if we wish deal with the sources separately. In recent years, there has been a great deal of work in modeling audio using non-negative matrix factorization (NMF) and its probabilistic counterparts. They however fail to provide a model of the temporal dynamics of a sound source. On the other hand, Hidden Markov Models (HMMs) have been used for decades to model temporal dynamics. They can be very powerful for audio analysis, as shown by their application to speech recognition. However, they have several limitations when it comes to high quality reconstruction. We propose a new model, the non-negative hidden Markov model (N-HMM), that combines the best of both worlds. In the proposed model, we jointly learn several small dictionaries that characterize the spectral structure of a given sound source as well as a Markov chain that characterizes the temporal dynamics of the sound source. We then propose a model of sound mixtures, the non-negative factorial hidden Markov model (N-FHMM), that combines models of individual sources. This is demonstrated on the application of source separation.