next Formant-Filter Based Vowel Synthesis Examples
up Audio Speech Research Note
previous Describing Speech Sounds



Analysis and Synthesis of Pure Vowels

Though the discussion that proceeds treats pure vowels (defined in §3), almost identical principles may be applied to the liquids or semi-vowels, and diphthongs (previously explained) may be thought of as two pure vowels chained together.

During the phonation of pure vowels, a roughly periodic acoustic pressure wave is produced at the vocal chords, and subsequently transmitted through the vocal tract. It is then broadcast from the oral and/or nasal cavities to the environment. The vocal folds may thus be thought of as a sound source, and the vocal tract (including the oral and/or nasal cavities) functions as a filter (hence the phrase ``source-filter description of speech sounds'' used by Fant in [Fant 1960]). The waveform produced by the vocal chords looks roughly like a lowpass-filtered band-limited impulse train, with fundamental frequency $f_{0}$ that gives rise to the apparent pitch of the voiced sound (not necessarily constant). More information may be found in [Fant 1960].

The resonances (or modes) of the vocal tract give rise to peaks in the spectrum of the vowel sound, or ``formants.'' It is thought that the formants of a vowel sound play a primary role in distinguishing it from other vowels, as any two vowels may well have the same fundamental frequency. The term ``formant'' has evolved over the past two centuries[Dunn 1950], as phoneticians/phonologists first used the term loosely to describe the spectral properties of vowel sounds, and acousticians, physicists, and electrical engineers subsequently used it to denote spectral peaks created by vocal tract resonances. The earliest formal use of the term, according to the Oxford English Dictionary, was in 1901, and the first technical use was in 1952[Simpson and Weiner 1989]. Fant gives a precise definition in [Fant 1960]: ``The spectral peaks of the sound spectrum $\left\vert P(f)\right\vert$ are called formants,'' where $P(f) = S(f)T(f)$, $S(f)$ is the spectrum of the glottal waveform, and $T(f)$ is the transfer function of the vocal tract. Note that although, strictly speaking, the term formant refers to the peaks of $\left\vert P(f)\right\vert$, it is often used simply to refer to the peaks of the vocal tract transfer function, $\left\vert T(f)\right\vert$[Fant 1960].

The aforementioned source-filter model of speech production lends itself nicely to implementation on a digital computer. Such an implementation is described in [Klatt 1980]. A band-limited impulse train with fundamental frequency of approx. 100-200 Hz may be created and subsequently lowpass-filtered. This signal may then be applied to a network of second-order all-pole filters. Finally, the network output may be filtered to simulate radiation of the sound from the nose and mouth.


next Formant-Filter Based Vowel Synthesis Examples
up Audio Speech Research Note
previous Describing Speech Sounds

``Audio Speech Research Note'', Ryan J. Cassidy, published electronically by author, July 2003.
Download PDF version (audio_speech.pdf)
Download compressed PostScript version (audio_speech.ps.gz)

Copyright © 2003-11-28 by Ryan J. Cassidy.
Please email errata, comments, and suggestions to Ryan J. Cassidy <ryanc@ieee.org>
Stanford University