Language Informed Speech Separation

 

High level information about a given problem can be quite useful to constrain the problem, if done the right way. When performing speech separation (separation of a mixture of multiple speakers), high level information about what people are saying is a great constraint. Although, we will rarely have the information about the exact utterances that were spoken, we may know that the utterances are from a constrained vocabulary of words. Moreover, we may know the grammatical rules that are used to combine these words into utterances. This is called a language model and has been crucial for speech recognition performance. We use this idea to constrain speech separation.


The models that we use are the non-negative hidden Markov model (N-HMM) and the non-negative factorial hidden Markov model (N-FHMM) as explained here.


The N-HMM models a sound source with multiple dictionaries such that the spectrum of each time frame of audio is explained by a linear combination of one (of the many) dictionaries. Additionally, it models the transitions between dictionaries using a Markov chain, as shown below. The model is conceptually depicted below.

We learn an N-HMM for each word in the given vocabulary, for each individual speaker. Specifically, we use the data from the speech separation challenge, which has 52 words per speaker. For each word, we learn an N-HMM with left-right Markov chain from multiple training instances of that word.


We then combine the word level N-HMMs into a single large N-HMM according to the grammatical rules specified by the following language model from the speech separation challenge. The model specifies that we choose exactly one word from each category in the following sequence.

Non-negative hidden Markov Model with four dictionaries and a left-right Markov chain.

We combine the word level models by connecting the relevant parts of the Markov chains of the individual words and retaining the correspondences between the states and the dictionaries. This gives us a large N-HMM for each speaker, as conceptually depicted below

Language Model

Speaker Level N-HMM

We combine the N-HMMs of pairs of two speakers into a factorial hidden Markov model (N-FHMM). We then perform speech separation of test data of mixtures of various pairs of two speakers.


An example of speech separation of two speakers of different genders using this method is show below. As a comparison we also show the results of separation using PLCA

Mixture - Different Gender

Separate speaker 1 using the N-FHMM

Separate speaker 2 using the N-FHMM

Separate speaker 1 using PLCA

Separate speaker 2 using PLCA

A more difficult case is one in which the speakers are of the same gender, since they are spectrally much more similar to each other. An example of this is shown below.

We performed this experiments on ten pairs of speakers. The mean BSS-EVAL metrics are below. As shown, the proposed method outperforms PLCA with respect to all metrics.

SAR

(dB)

SDR

(dB)

Proposed method

PLCA

7.96

9.08

4.86

14.91

10.29

8.78

SIR

(dB)

Mixture - Different Gender

Separate speaker 1 using the N-FHMM

Separate speaker 2 using the N-FHMM

Separate speaker 1 using PLCA

Separate speaker 2 using PLCA

We report the mean BSS-EVAL metrics of performing this experiment on ten different speakers of the same gender. In this case as well, the proposed method outperforms PLCA with respect to all metrics.

SAR

(dB)

SDR

(dB)

Proposed method

PLCA

5.11

8.77

2.85

13.88

9.89

8.24

SIR

(dB)

The sound files for the separation results of all mixtures of the same gender and different gender case can be found here.


We conclude that the high level information provided by a language model can greatly improve speech separation performance.

Reference


  1. Gautham J. Mysore, Paris Smaragdis, “A Non-negative Approach to Language Informed Speech Separation”, to be presented at the International Conference on Latent Variable Analysis and Signal Separation (LVA / ICA), Tel-Aviv, Israel. March 2012