Next: Speech Recognition and Synthesis
Previous: Keyboard Instruments
In this section, we study the behavior of our vocal mechanism. Despite the fact that there are many aspects of this system that we do not completely understand (particularly around the vocal folds), our ability to conduct experiments with our own speech mechanism allows us to quickly verify much of its behavior.
- Lungs serve as an air reservoir and energy source.
- The Larynx and the Vocal Cords:
- The larynx contains the vocal folds.
- The vocal cords consist of folds of ligament extending from the thyroid cartilage in the front to the arytenoid cartilages at the back.
- The space between the vocal folds, called the glottis, is controlled by the arytenoid cartilages.
- For normal breathing, the arytenoids are spaced well apart. They come together when sound is produced.
- The vocal cords may be closed, blocking the flow of air, and then opened suddenly to produce a glottal stop.
- For unvoiced consonants, the folds may be completely open (such as when producing ``s'', ``sh'', and ``f'' sounds) or partially open (for ``h'' sounds).
- Voiced sounds are created by vibrations of the vocal folds.
- The rate of vibration of the vocal cords is determined primarily by their mass and tension, though air pressure and velocity can contribute in a smaller way.
- Normal speech varies over an approximate range of one octave. Typical speech center frequencies are 110 Hz (men), 220 Hz (women), and 300 Hz (children).
- A ``breathy'' voice quality is produced during an open phase mode of vibration, such that the folds never completely stop the air flow through them.
- A minimum of air passes through the folds, in short puffs, when producing a ``creaky'' voice.
- Feedback from the vocal tract has little influence on the vibrations of the vocal folds (in contrast to the lips and air column interaction for brass instrument playing).
- For ``normal'' vocal effort, the vocal cords open and close completely during the cycle and generate an air flow waveform which is roughly triangular in shape over time. This produces a ``buzzy'' sound which is rich in harmonics, falling off in amplitude as .
- Unvoiced consonants make extensive use of broadband noise, caused by turbulent air flow through a constriction in the vocal tract.
- The Vocal Tract:
- The vocal tract can be considered a single tube extending from the vocal folds to the lips, with a side branch leading to the nasal cavity.
- The length of the vocal tract is typically about 17 centimeters, though this can be varied slightly by lowering or raising the larynx and by shaping the lips.
- The pharynx connects the larynx (as well as the esophagus) with the oral cavity.
- The oral cavity is the most important component of the vocal tract because its size and shape can be varied by adjusting the relative positions of the palate, the tongue, the lips, and the teeth.
- The smallest units of speech sounds are called phonemes. One or more phonemes combine to form a syllable, and one or more syllables to form a word.
- Phonemes can be divided into two groups: vowels and consonants. Vowels are always voiced.
- There are approximately 12 to 21 different vowel sounds used in the English language. Discrepancies usually are due to disagreement over what constitutes a pure vowel sound rather than a diphthong (a combination of two or more vowels into one phoneme).
- Consonants involve rapid and sometimes subtle changes in sound.
- Consonants may be classified according to their manner of articulation as plosive (p, b, t, etc.), fricative (f, s, sh, etc.), nasal (m, n, ng), liquid (r, l), and semivowel (w, y).
- Consonants are more independent of language than vowels are.
- Phonemes are distinguished from one another by the resonances of the vocal tract.
- The peaks that occur in the sound spectra of the vowels, independent of pitch, are called formants.
- Just three formants are typically distinguished.
- Though the exact shape of the vocal tract is quite complex, many of its most prominent features can be recreated with simple models.
- The resonances of a closed-open cylinder of 17 centimeters occur around 500, 1500, and 2500 Hz, which are close to the formant frequencies of the vowel sound .
- Two-tube models of the vocal tract capture the many of the important features of the vowel sounds ``ah'', ``ee'', and ``oo'' (Matlab example of formants produces from a two-tube model).
- Models composed of two cavities with a connecting constriction can approximate the formants associated with several consonant sounds.
- Prosodic features are characteristics of speech that convey meaning, emphasis, and emotion without actually changing the phonemes.
- Pitch, rhythm, and accent.