Naturalness of the sound quality is essential for the singing synthesis. Since 95% in singing is voiced sound, the focus of this study is to improve the naturalness of the vowel tone quality via the glottal excitation modeling.
In addition to the abilities of flexible pitch and volume control, the desired excitation model is expected to be capable of changing the voice quality so that the voice quality can be modified from laryngealized (pressed) to normal to breathy phonation.
To trade off between the complexity of the modeling and the analysis procedure to acquire the model parameters, we propose to use the source-filter type synthesis model, based on a simplified human voice production system. The source-filter model decomposes the human voice production system into three linear systems: glottal source, vocal tract and radiation. The radiation is simplified as a differencing filter. The vocal tract filter is assumed all-poled for non-nasal sound. The glottal source and the radiation are then combined as the derivative glottal wave. We shall call it as the glottal excitation.
The effort is then to estimate the vocal tract filter parameter and glottal excitation to mimic the desired singing vowels. The de-convolution of the vocal tract filter and glottal excitation was developed via the convex optimization technique. Through this de-convolution, one could obtain the vocal tract filter parameters and the glottal excitation waveform.
The next step is to build the glottal excitation synthesis model after the vocal tract filter has been found. Since both the wave-shape of the glottal excitation and the aspiration noise are important factors to change the breathiness of the sound quality, the glottal excitation is considered as two parts: one is the smoothed quasi-periodic derivative glottal wave and the other one is the glottis noise (turbulence noise). These two components are separated via wavelet decomposition of the glottal excitation waveform from de-convolution. The coarse wave-shape of the smoothed derivative wave is intended to be modeled via the LF model. The noise part is then roughly modeled as a pitch synchronous amplitude modulated Gaussian noise with larger power around the glottal closure instants. Due to the model mismatch and the source-tract interaction, a model for the residual fine structure of the smoothed derivative glottal wave becomes necessary. (This part is still under the survey. Inputs are very welcomed. )
In this talk, the de-convolution and glottal excitation modeling results for both synthetic data and real baritone recordings will be shown. I will focus on the synthetic data simulation results and discuss the impact of aspiration noise, GCI detection error and source-filter.