With the phase vocoder, the instantaneous amplitude and frequency are normally computed only for each ``channel filter''. A consequence of using a fixed-frequency filter bank is that the frequency of each sinusoid is not normally allowed to vary outside the bandwidth of its channel band-pass filter, unless one is willing to combine channel signals in some fashion which requires extra work. Ordinarily, the band-pass center frequencies are harmonically spaced. I.e., they are integer multiples of a base frequency. So, for example, when analyzing a piano tone, the intrinsic progressive sharpening of its partial overtones leads to some sinusoids falling ``in the cracks'' between adjacent filter channels. This is not an insurmountable condition since the adjacent bins can be combined in a straightforward manner to provide accurate amplitude and frequency envelopes, but it is inconvenient and outside the original scope of the phase vocoder (which, recall, was developed originally for speech, which is fundamentally periodic (ignoring ``jitter'') when voiced at a constant pitch). Moreover, it is relatively unwieldy to work with the instantaneous amplitude and frequency signals from all of the filter-bank channels. For these reasons, the phase vocoder has largely been effectively replaced by sinusoidal modeling in the context of analysis for additive synthesis of inharmonic sounds, except in constrained computational environments (such as real-time systems). In sinusoidal modeling, the fixed, uniform filter-bank of the vocoder is replaced by a sparse, peak-adaptive filter bank, implemented by following magnitude peaks in a sequence of FFTs. The efficiency of the FFT makes it computationally feasible to implement an enormous number of bandpass filters in a fine-grained analysis filter bank, from which the sparse, adaptive analysis filter bank is derived.
Thus, many modern sinusoidal models can be thought of as ``pruned phase vocoders'' in that they follow only the peaks of the short-time spectrum rather than the instantaneous amplitude and frequency from every channel of a uniform filter bank. Peak-tracking in a sliding short-time Fourier transform has a long history going back approximately half a century [193,260]. Sinusoidal modeling of the STFT of speech was introduced by Quatieri and McAulay [204,153,205,158,171,206], and application to musical sounds was initiated by Smith and Serra [248,224].
For carrying out additive synthesis, early systems used an explicit sum of sinusoidal oscillators [151,168,215,248]. For large numbers of sinusoidal components, it is more efficient to use the inverse FFT. Inverse-FFT synthesis was introduced in computer music by Chamberlin [30] and Rodet and Depalle [221]. The technique has been extended more recently by Laroche and Dolson [126,125,123]. See §7.6.2 below for further discussion.
In the late 1980s, Serra and Smith combined sinusoidal modeling with noise modeling to enable more efficient synthesis of the noise-like components of sounds [224,227,228]. In this extension, the output of the sinusoidal model is subtracted from the original signal, leaving a residual signal. Assuming that the residual is a random signal, it is modeled as filtered white noise, where the magnitude envelope of its short-time spectrum becomes the filter characteristic through which white noise is passed during resynthesis.
Historically, both vocoders and sinusoidal models have focused on modeling single-pitched sound sources such as a single saxophone note. By going to multiresolution sinusoidal modeling, it becomes possible to encode general polyphonic sound sources with a single unified system [132,130,131]. Multiresolution refers to the use of a non-uniform filter bank, such as a wavelet or ``constant Q'' filter bank, in the underlying spectrum analysis. See Fig.10.17 for an example time-frequency resolution grid.