The following correlograms are shown here:
Vowel sounds can be thought of as periodic signals with very special spectral patterns. Their spectral envelopes are dominated by the formant resonances, which are basically independent of the fundamental frequency. Through learning spoken language, people become particularly sensitive to the spectral envelopes for the vowels in that language.
In his doctoral thesis, Stephen McAdams showed that when different vowels having the same fundamental frequency were mixed, the resulting mixture did not have a vowel-like quality; however, when the glottal pulse train for any one vowel was frequency modulated, the sound of that vowel would "emerge" as a separate, recognizable sound stream [McAdams84]. The effect was very strong, again suggesting that the auditory system makes central use of the comodulation of harmonic components to separate sound sources.
Several experiments were performed to see how both amplitude and frequency modulation of the glottal pulse train affected the auditory perception and the correlograms for the vowel mixtures. All experiments used the same three synthetic vowels: /a/ (as in "hot"), /i/ (as in "beet") and /u/ (as in "boot"). These vowels were synthesized using a cascade model due to [Rabiner68]. Specifically, a pulse train with fundamental frequency f0 was passed through a cascade of two single-pole glottal-pulse filters, three two-pole formant filters, and a first-difference stage to simulate the effects of radiation. The glottal-pulse filter has poles at 250 Hz, and the formant resonances had a 50-Hz bandwidth. The following table lists the fundamental and formant frequencies:
F0 | F1 | F2 | F3 | |
---|---|---|---|---|
/a/ | 140 | 730 | 1090 | 2440 |
/i/ | 132 | 270 | 2290 | 3010 |
u/ | 125 | 300 | 870 | 2240 |
The correlograms of these individual vowel sounds and their mixture are shown on the video tape. One notes at once the distinctly different visual patterns of these three vowels. The horizontal organization (rows of energy blobs) reveal the formants, the high frequency formants for the /i/ being particularly distinctive. The vertical organization (columns of energy blobs) reveal the distinctly different pitch periods, which are about a semi-tone apart. Note, however, that there is no clear pitch period in the mixture, whose low-frequency organization is murky. The mixture sounds comparably murky; like a dissonant blend of unidentifiable tones.
A variety of different synthesized vowel mixtures were produced by modulating the glottal pulse trains in different ways. The standard method was to create 4-second pulse trains for the three vowels as follows:
Vowel | 0.0 | 0.5 | 1.0 | 1.5 | 2.0 | 2.5 | 3.0 3.5 | 4.0 (sec) |
---|---|---|---|---|---|---|---|---|
/a/ | MMM | |||||||
/b/ | MMM | |||||||
/c/ | MMM |
For example, the pulse train for /a/ was held steady for 1 second, modulated (M) for the next 0.5 second, and then held steady again for the remaining 2.5 seconds. Signals with the following kinds of modulation were generated in this fashion (not all are shown on the tape):
The perceptual character of these signals can be summarized as follows:
All of the changes that could be easily heard could also be easily seen in videotapes of the correlograms. Even the small 1% frequency shifts produced clearly visible motions. While the formants can also be seen in the correlograms, the untrained eye is not naturally sensitive to these spectral patterns. That is, recognizing the vowel from viewing the changes is not obvious, and would require an ability to read correlograms similar to the ability of trained spectrogram readers to identify vowels in spectrograms [Zue85]. However, it seems likely that a vowel-recognition procedure that worked on the correlograms of isolated vowels would also work on the fragments of correlograms separated by comodulation.
However, these experiments also revealed a phenomenon that might limit a recognition strategy based on unsophisticated motion detection procedures. Low-frequency beats between the three fundamental frequencies produced moving patterns in the correlogram even in the absence of any modulation of the pulse train. The difference between motion due to beats and motion due to comodulation or other temporal changes in the acoustic input needs to be better understood before grouping models based on comodulation can be developed.
Frequency
|
||
|
||
Time Delay |
Frequency
|
||
|
||
Time Delay |
Frequency
|
||
|
||
Time Delay |