Robust Structured Voice Extraction for Flexible Expressive Resynthesis

 

Author : Pamornpol (Tak) Jinachitra

Advisor : Prof. Julius O. Smith

Center for Computer Research in Music and Acoustics (CCRMA)

Stanford University

 

Thesis Download (Final)

Last update :June 5, 2007

 

 

Abstract

Parametric representation of audio allows for a compression of the sound. If chosen carefully, these parameters can capture the expressiveness of the sound as well as reflecting the production mechanism of the sound source and thus allow for an intuitive control in order to effect some changes to the original sound in a desirable way. Human voice is a ubiquitous sound source used in communication and artistic expression. Such informative parametric coding of speech sound results in a low bandwidth transmission over a network while allowing for simple modifications in applications such as emotional speech synthesis for human-machine interaction and special emphasis or slowing for the hearing impaired. In the artistic realm, modification-flexible encoding of a singing voice is also desirable for both amateur and professional recording artists.

 

In order to achieve the desired parametric encoding, algorithms which can robustly identify the model parameters are needed. Unfortunately, in real life, for the whole system to be practical, these algorithms need to also be robust to environmental background noise. As a result, not only do we get an expressiveness-flexible coding system, we can also obtain a model-based speech enhancement that reconstructs the speech embedded in noise cleanly and free of musical noise usually associated with the filter-based approach. In this thesis, a combination of analysis algorithms to achieve automatic encoding of a human voice recorded in noise are described, based on recent developments of the statistical tools and speech sound production knowledge. The source-filter model is employed for parameterization of a speech sound, especially voiced, and an iterative joint estimation of the glottal source and vocal tract parameters based on Kalman filtering and Expectation-Maximization algorithm is presented. In order to find the right production model for each speech segment, speech segmentation is required which is especially challenging in noise. A switching state-space model is adopted to represent the underlying speech production mechanism involving the smoothly varying hidden variables and how they relate to speech observation. To allow for a variety within each production mode, a training and classification based on the Generalized Pseudo Bayesian method and a mixture model are presented. A technique called Unscented Transform is incorporated in the algorithm to improve the segmentation performance in noise. In addition, during voiced periods, the choice of the glottal source model requires the detection of the glottal closure instants. A dynamic programming-based algorithm with strong parametric modeling of the source which yields a modification-flexible voice coding in itself, is also proposed. Each algorithm is evaluated in comparison to the state-of-the-art. The system combination demonstrates the possibility of a parametric extraction of speech from a clean recording or a moderately noisy one for the voice to be reconstructed or modified at will subject to applications.

 

 

 

Sound Samples

  1. Clean Environment : This section uses the LPC coefficients extracted from pre-emphasized clean speech and glottal parameter extraction and segmentation algorithm in Chapter 4 of the thesis to reconstruct the voice. All parameters can be kept as the vocal tract filter coefficients, the fundamental periods (T0), the amplitude (AV) and the open-quotient (OQ).
    1. Male singing voice (normal mode)

                                                               i.      Original

                                                             ii.      Reconstruction

    1. Male singing voice (press mode)

                                                               i.      Original

                                                             ii.      Reconstruction

    1. Male singing voice

                                                               i.      Original

                                                             ii.      Reconstruction

    1. Male speech “Where were you while you were away?” (TIMIT msjs/sx9)

                                                               i.      Original

                                                             ii.      Reconstruction

    1. Male speech “Where are you?”

                                                               i.      Original

                                                             ii.      Reconstruction

    1. Male speech “She saw a fire.” : Fricatives are generated by filtered white noise, using energy estimates from basic LPC.

                                                               i.      Original

                                                             ii.      Reconstruction

 

 

 

  1. Noisy Environment : This section first applies basic noise suppression before glottal segmentation algorithm in Chapter 4 is applied. An EM algorithm presented in Chapter 5 of the thesis is then performed to estimate jointly the vocal tract filter and the glottal source parameter from the original noisy signal. Its variants include an integration with Post Kalman Smoothing (EM-PKS) and VQ-codebook constraint (EM-VQ)
    1. Male singing voice (normal mode)

Original

 

Pink noise SNR=10dB

Pink noise SNR=20dB

White noise

SNR=10dB

White noise

SNR=20dB

Noisy

x

x

x

x

Reconstruction (EM-PKS)

x

x

x

x

Reconstruction (EM-VQ)

x

x

x

x

EM-Kalman smoothing noise suppression

x

x

x

x

                                   

    1. Male speech “Where are you?”

Original

 

Pink noise SNR=10dB

Pink noise SNR=20dB

White noise

SNR=10dB

White noise

SNR=20dB

Noisy

x

x

x

x

Reconstruction (EM-PKS)

x

x

x

x

Reconstruction (EM-VQ)

x

x

x

x

EM-Kalman smoothing noise suppression

x

x

x

x

 

    1. Male speech (with fricatives) : First, do segmentation by noise-compensated GPB2-UKF algorithm in Chapter 3. Then, for voiced segments, do glottal segmentation using algorithm in Chapter 4 (on noise-suppressed version) and further smoothed EM iteration for parameter estimates. For fricative segments, just estimate energy and filter coefficients from noise-suppressed signal, seem to be good enough.

                                                               i.      Original

                                                             ii.      Noisy (white noise SNR=20dB)

                                                            iii.      Reconstruction

                                                           iv.      EM-Kalman smoothing noise suppression

 

 

  1. Applications : All modifications are applied first to the canonical set of parameters {T0,AV,OQ} before the waveform is constructed. These parameters are extracted from clean voice only.

Male singing voice (normal mode)

 

Male speech “Where are you?”

o       Original

o       Time-scaling : Slower x2

o       Pitch Shifting : F0 x1.5

o       Glottal fry + flat pitch : very small OQ + fixed T0

o       Whisper : Random Gaussian noise excitation

 

Male speech “She saw a fire.” : Fricatives are simply translated.

 

Email : pj97@ccrma.stanford.edu