Spectral modeling can be viewed ``sampling synthesis done right'' [154]. That is, in spectral modeling synthesis, segments of the time-domain signal are replaced by their short-time Fourier transforms, thus providing a sound representation much closer to the perception of sound by the brain [66,109,205]. This yields two immediate benefits: (1) computational cost reductions based on perceptual modeling, and (2) more perceptually fundamental data structures. Cost reductions follow naturally from the observation [168] that roughly 90% of the information contained in a typical sound is not perceived by the brain. For example, the popular MP3 audio compression format [27,28] can achieve an order of magnitude data reduction with little or no loss in perceived sound quality because it is based on the short-time Fourier transform, and because it prioritizes the information retained in each spectral frame based on psychoacoustic principles. To first order, MPEG audio coding eliminates all spectral components which are masked by nearby louder components.
The disadvantages of spectral modeling are the same as those of sampling synthesis, except that memory usage can be greatly reduced. Sampling the full playing range of a musical instrument is made more difficult, however, by the need to capture every detail in the form of spectral transformations. Sometimes this is relatively easy, such as when playing harder only affects brightness. In other cases, it can be difficult, such as when nonlinear noise effects begin to play a role.
An excellent recent example of spectral modeling synthesis is the so-called Vocaloid developed by Yamaha in collaboration with others [5]. In this method, the short-time spectrum is modeled as sinusoids plus a residual signal, together with higher level spectral features such as vocal formants. The model enables the creation of ``vocal fonts'' which effectively provide a ``virtual singer'' who can be given any material to sing at any pitch. Excellent results can be achieved with this approach (and some of the demos are very impressive), but it remains a significant amount of work to encode a particular singer into the form of a vocal font. Furthermore, while the sound quality is generally excellent, subtle ``unnaturalness'' cues may creep through from time to time, rendering the system most immediately effective for automatic back-up vocals, or choral synthesis, as opposed to highly exposed foreground lead-singer synthesis.
Zooming out, spectral modeling synthesis can be regarded as modeling sound inside the inner ear, enabling reductions and manipulations in terms of human perception of sound.