In this paper an analysis/synthesis technique based on a sinusoidal representation was presented that has proven to be very appropriate for signals which are well characterized as a sum of inharmonic sinusoids with slowly varying amplitudes and frequencies. The previously used harmonic vocoder techniques have been relatively unwieldy in the inharmonic case, and less robust even in the harmonic case. PARSHL obtains the sinusoidal representation of the input sound by tracking the amplitude, frequency, and phase of the most prominent peaks in a series of spectra computed using the Fast Fourier Transform of successive, overlapping, windowed data frames, taken over the duration of a sound. We have mentioned some of the musical applications of this sinusoidal representation.
Continuing the work with this analysis/synthesis technique we are implementing PARSHL on a Lisp Machine with an attached FPS AP120B array processor. We plan to study further its sound transformation possibilities and the use of PARSHL in conjunction with other analysis/synthesis techniques such as Linear Predictive Coding (LPC) .
The basic ``FFT processor'' at the heart of PARSHL provides a ready point of departure for many other STFT applications such as FIR filtering, speech coding, noise reduction, adaptive equalization, cross-synthesis, and many more. The basic parameter trade-offs discussed in this paper are universal across all of these applications.
Although PARSHL was designed to analyze piano recordings, it has proven very successful in extracting additive synthesis parameters for radically inharmonic sounds. It provides interesting effects when made to extract peak trajectories in signals which are not describable as sums of sinusoids (such as noise or ocean recordings). PARSHL has even demonstrated that speech can be intelligible after reducing it to only the three strongest sinusoidal components.
The surprising success of additive synthesis from spectral peaks suggests a close connection with audio perception. Perhaps timbre perception is based on data reduction in the brain similar to that carried out by PARSHL. This data reduction goes beyond what is provided by critical-band masking. Perhaps a higher-level theory of ``timbral masking'' or ``main feature dominance'' is appropriate, wherein the principal spectral features serve to define the timbre, masking lower-level (though unmasked) structure. The lower-level features would have to be restricted to qualitatively similar behavior in order that they be ``implied'' by the louder features. Another point of view is that the spectral peaks are analogous to the outlines of figures in a picture--they capture enough of the perceptual cues to trigger the proper percept; memory itself may then serve to fill in the implied spectral features (at least for a time).
Techniques such as PARSHL provide a powerful analysis tool toward extracting signal parameters matched to the characteristics of hearing. Such an approach is perhaps the best single way to obtain cost-effective, analysis-based synthesis of any sound.