Modeling Spectral and Temporal Structure in Sound Mixtures
Mathematical modeling of sounds has been an ongoing pursuit for decades. There is a great deal of structure in audio and good models need to make use of this structure. Particularly, audio has a strong spectral and temporal structure. When dealing with sound mixtures, the structure of the individual sources becomes particularly important if we wish deal with them separately. In recent years, dictionary learning methods such as non-negative matrix factorization (NMF) and probabilistic latent component analysis (PLCA) have become quite popular as they provide a rich representation of audio spectra and are amenable to high quality reconstruction of sounds. However, they fail to provide a statistical description of the temporal structure. On the other hand, Hidden Markov Models (HMMs) have been used for decades to model temporal structure. They can be very powerful for audio analysis, as shown by their application to speech recognition. However, they have several limitations when it comes to reconstruction. This is an issue if we desire a high quality audio output. We propose a new algorithm that combines the best of both worlds. The proposed method jointly learns several small dictionaries that characterize the spectral structure of a given sound. It jointly learns the temporal structure of the sound. As in NMF and PLCA, the dictionary elements are all non-negative, which give them a semantic interpretation as well as allowing non-destructive mixing of the dictionary elements. It additionally imposes a hierarchical structure to the dictionaries. We use this algorithm to decompose sounds, process the individual parts, and reconstruct them. This is demonstrated on content aware audio processing. For example, we change a major arpeggio to a minor arpeggio. We then propose a method of modeling sound mixtures by combining models of individual sources. This can be used for various applications as sound mixtures are commonly encountered. We demonstrate it on the application of source separation.