We present a method for recognizing sound sources in a mixture. Using source separation ideas based on probabilistic latent component analysis (PLCA), we learn dictionaries from each source and estimate the relative proportions of sound sources in a mixture by decomposing them with the dictionaries and summing the corresponding activations. In addition to the basic model, we introduce a new method for learning temporal dependency among dictionary elements using a transition matrix. We show this temporally constrained model shows better results than the basic model.
This video demo shows levels of three different sound sources (speech, gun and airplane) in an audio track of a movie. The bars in the left is based on the basic model and those in the right is on the improved model (temporally constrained with the transition matrix). The video shows that the improved model has relatively less false alarm, which is marked with the red circles.