Toward a high-quality singing synthesizer


Text-to-Speech (TTS) synthesis and score-to-singing synthesis have a lot of theoretical common grounds. As TTS, score-to-sing synthesis can also be decomposed to two subsystems. The first subsystem is score analysis which converts the score into abstract representations(e.g. MIDI like representation) including the phoneme and prosody contour. The second subsystem is sound rendering which converts these abstract representations to the acoustic output. The knowledge of the letter-to-phoneme conversion in TTS system can be directly applied to the score analysis subsystem in singing synthesis. Since the prosody is constrained by the input score, the prosody contour is more easy to specified in singing synthesis.

    On the other hand, the singing synthesis and the TTS have different objective emphasises. Naturalness of the sound quality is more important for the singing synthesis, whereas intelligibility is more important for the speech synthesis. Moreover, singers are trained to extend the vowels to make it sound better. 95% in singing is voiced sound. (The time ratio of voiced/unvoiced/silent phonation is roughly 65%/25%/15% in speech). Hence, improving the naturalness of the vowel tone quality is essential for singing synthesis. This is also my research focus.

    In this web page, the first part illustrates the proposed model for the voiced sound and its associated copy synthesis analysis method. Other related analysis methods are attached too. The source-filter type model is chosen as the synthesis model. Simple analysis procedure for copy synthesis is the advantage of this model over articulatory or formant synthesis models. The coarticulation has not investigated for this model. However, since the model is a reasonable approximation of voice production system, it is expected that simple rules(or even linear interpolation between phonemes) are sufficient to describe the coarticulation. The second part illustrates the prosody contour generation procedure.  


  • Comparison of different glottal excitation models under source-filter voice production modeling
  • Glottal source modeling for singing voice synthesis
  • Model parameter estimation for sustained voiced sound: source-filter deconvolution
  • Fundamental frequency estimation
  • Synthesis

  • Prosody generation for score-to-singing synthesis system 
  • References

  • General voice synthesis and coding
  • Inverse filtering and other analysis methods
  • Prosody generation
  • Others
  • Links to other singing synthesizers or related research 

  • Perry Cook's SPASM

  • Source-filter synthesis model (using LPC to extract the VT filter)
  • LYRICOS: Synthesis of Singing Voice

  • Sinusoidal synthesis model with concatenative scheme

    Last modified: 8/28/99 11:17AM Pst 1999