Toward a high-quality singing synthesizer with vocal texture control


To achieve high-quality singing synthesis, spectral modeling and physical modeling have been used in the past. However, spectral models are known to be articulation difficult and expressivity limited. On the other hand, it is not straightforward to adjust physical model parameters to reproduce a specific recording. In this thesis, a high-quality singing synthesizer is proposed with its associated analysis procedure to retrieve the model parameters automatically from the desired voices. Since 95% of singing is voiced sound, the focus of this research is to improve naturalness of the vowel tone quality. In addition, an intuitive parametric model is also developed to control the vocal textures of the synthetic voices ranging from “pressed", to "normal", to "breathy" phonation.

To trade off between complexity of the model and the corresponding analysis procedure, a source-filter type synthesis model is proposed. Based on a simplified human voice production system, the source-filter synthesis model describes human voices as the output of the vocal tract filter excited by a glottal excitation. The vocal tract is modeled as an all-pole filter since only non-nasal voiced sound is focused. To accommodate variations of vocal textures, the glottal excitation model consists of two elements: the derivative glottal wave and the aspiration noise. The derivative glottal wave is modeled by the transformed Liljencrants-Fant (LF) model. Moreover, the aspiration noise is represented as pitch-synchronous, amplitude-modulated Gaussian noise.

The major contribution of this thesis is the development of an analysis procedure that estimates the parameters of the proposed synthesis model to mimic the desired voices. First, a source-filter de-convolution algorithm via the convex optimization technique is proposed to estimate the vocal tract filter from sound recordings. Second, the inverse filtered glottal excitation is decomposed into a smoothed derivative glottal wave and a noise residual component via Wavelet Packet Analysis. Proper parameterizations of the glottal excitation can then be found. By analyzing baritone recordings, a parametric model is constructed for controlling vocal textures in synthesized singing.

Sound examples:

 Synthetic sound examples for variation of vocal textures:         pressed            normal              breathy

  Source-filter de-convolution results:    normal phonation (original / re-synthesized), pressed phonation (original / re-synthesized)

Synthetic sounds were generated by exciting the estimated vocal tract filter by the estimated KLGLOTT88 derivative glottal wave obtained during the source-filter de-convolution step of the analysis procedure.

  Analysis/resynthesis results:                normal phonation (original / re-synthesized), pressed phonation (original / re-synthesized)

Synthetic sounds were generated by exciting the estimated vocal tract filter by the fitted LF derivative glottal wave.

  A breathy vowel: derivative glottal wave + noise source output breathy vowel        (output sound when noise source is not present)

These sound examples illustrate the importance of noise source for breathy voice. The derivative glottal wave source is only secondary compared to the noise source. Without the presence of the noise source, the output vowel sounds nasal but not breathy.



  Defense slides (power point file)


Last modified: 6/20/02 11:17AM Pst 2002