Text-to-Speech (TTS) synthesis and score-to-singing synthesis have a lot of theoretical common grounds. As TTS, score-to-sing synthesis can also be decomposed to two subsystems. The first subsystem is score analysis which converts the score into abstract representations(e.g. MIDI like representation) including the phoneme and prosody contour. The second subsystem is sound rendering which converts these abstract representations to the acoustic output. The knowledge of the letter-to-phoneme conversion in TTS system can be directly applied to the score analysis subsystem in singing synthesis. Since the prosody is constrained by the input score, the prosody contour is more easy to specified in singing synthesis.
On the other hand, the singing synthesis and the TTS have different objective emphasises. Naturalness of the sound quality is more important for the singing synthesis, whereas intelligibility is more important for the speech synthesis. Moreover, singers are trained to extend the vowels to make it sound better. 95% in singing is voiced sound. (The time ratio of voiced/unvoiced/silent phonation is roughly 65%/25%/15% in speech). Hence, improving the naturalness of the vowel tone quality is essential for singing synthesis. This is also my research focus.
In this web page, the first part illustrates the
proposed model for the voiced sound and its associated copy synthesis
analysis method. Other related analysis methods are attached too. The
source-filter type model is chosen as the synthesis model. Simple
analysis procedure for copy synthesis is the advantage
of this model over articulatory or formant synthesis models. The
coarticulation has not investigated for this model. However, since the
model is a reasonable approximation of voice production system, it is
expected that simple rules(or even linear interpolation between phonemes)
are sufficient to describe the coarticulation.
The second part illustrates the prosody contour generation procedure.