The possibilities that STFT techniques offer for modifying the analysis results before resynthesis have an enormous number of musical applications. Quatieri and McAulay [222] give a good discussion of some useful modifications for speech applications. By scaling and/or resampling the amplitude and the frequency trajectories, a host of sound transformations can be accomplished.
Time-scale modifications can be accomplished by resampling the amplitude, frequency, and phase trajectories. This can be done simply by changing the hop size in the resynthesis (although for best results the hop size should change adaptively, avoiding time-scale modifications during voice consonants or attacks, for example). This has the effect of slowing down or speeding up the sound while maintaining pitch and formant structure. Obviously this can also be done for a time-varying modification by having a time-varying hop size . However, due to the sinusoidal representation, when a considerable time stretch is done in a ``noisy'' part of a sound, the individual sinewaves start to be heard and the noise-like quality is lost.
Frequency transformations, with or without time scaling, are also possible. A simple one is to scale the frequencies to alter pitch and formant structure together. A more powerful class of spectral modifications comes about by decoupling the sinusoidal frequencies (which convey pitch and inharmonicity information) from the spectral envelope (which conveys formant structure so important to speech perception and timbre). By measuring the formant envelope of a harmonic spectrum (e.g., by drawing straight lines or splines across the tops of the sinusoidal peaks in the spectrum and then smoothing), modifications can be introduced which only alter the pitch or only alter the formants. Other ways to measure formant envelopes include cepstral windowing [198] and the fitting of low-order LPC models to the inverse FFT of the squared magnitude of the spectrum [157]. By modulating the flattened (by dividing out the formant envelope) spectrum of one sound by the formant-envelope of a second sound, ``cross-synthesis'' is obtained. Much more complex modifications are possible.
Not all spectral modifications are ``legal,'' however. As mentioned earlier, multiplicative modifications (simple filtering, equalization, etc.) are straightforward; we simply zero-pad sufficiently to accommodate spreading in time due to convolution. It is also possible to approximate nonlinear functions of the spectrum in terms of polynomial expansions (which are purely multiplicative). When using data derived filters, such as measured formant envelopes, it is a good idea to smooth the spectral envelopes sufficiently that their inverse FFT is shorter in duration than the amount of zero-padding provided. One way to monitor time-aliasing distortion is to measure the signal energy at the midpoint of the inverse-FFT output buffer, relative to the total energy in the buffer, just before adding it to the final outgoing overlap-add reconstruction; little relative energy in the ``maximum-positive'' and ``minimum negative'' time regions indicates little time aliasing. The general problem to avoid here is drastic spectral modifications which correspond to long filters in the time domain for which insufficient zero-padding has been provided. An inverse FFT of the spectral modification function will show its time duration and indicate zero-padding requirements. The general rule (worth remembering in any audio filtering context) is ``be gentle in the frequency domain.''