Vocal Models for Data Sonification

Ryan J. Cassidy (ryanc(at)ieee.org)
Kyogu Lee (kglee(at)ccrma.stanford.edu)
Center for Computer Research in Music and Acoustics (CCRMA)
Department of Music, Stanford University
Stanford, CA 94305

            Abstract: This document discusses a few popular synthesis methods for voice simulation, and presents their implementations in different platforms. In addition, hyperspectral data sonification using these vocal models are exemplified.

    1. FM Voice Model
  This model uses FM synthesis technique [1. Chowning 1973] developed by John Chowning to produce sounds with vocal texture [2. Chowning 1989]. A command-line application has been written to allow users to experiment with various FM synthesis parameters for vowel synthesis. The utility is called fm_vowel, and may be downloaded at http://www-ccrma.stanford.edu/~rjc/audio/speech/vowel/fm_vowel/fm_vowel.tar.gz. It takes three control parameters: 1) freq determines a pitch of synthesized voice; 2) tilt sets spectral tilt parameter; 3) md sets modulation depth. While its implementation is easy as well as its computational cost is very low, it is quite difficult to produce convincing sounds. Furthermore, this model is not suitable for sonification of hyperspectral data because it has only three control parameters.

    2. Formant Synthesis (Source-Filter model)

  When identifying dissimilar sounds such as human vowels, the ears are most sensitive to peaks in the signal spectrum. These resonant peaks in the spectrum are called formants. The frequencies of these peaks corresond to resonant frequencies of vocal tract, through which glottal pulse is filtered. Each vowel has different formant frequencies and bandwidths. Furthermore, every human being has his/her unique formant frequencies and bandwidths. Using these characteristics of vowel sound production mechanism, a band-limited impulse train can be used as a glottal source, which is then filtered by multiple resonators (arranged in parallel or cascade) with corresponding formant frequencies and bandwidths [3. Klatt 1980] to generate vowel sounds.  We used Matlab to implement formant synthesis technique, and referred to Peterson and Barney's formant table to create a formant matrix, which contains the first three formant frequency values from 10 American-English monophthong vowels as spoken by 76 speakers (33 men, 28 women and 15 children) [4. Peterson, Barney 1952]. Although it has only three control parameters - pitch, gender (male, female, or child), and vowel type - it is far more suitable for hyperspectral data sonification because if we map data values to the amplitudes and the bandwidths of formant peaks, we could obtain vowel sounds with different sonority. In addition to Matlab implementation,  the STK (Synthesis Tool Kit), a CCRMA-created collection of C++ classes for the synthesis and processing of musical instrument sounds, contains a C++ class VoicForm for the synthesis of vowel sounds based on formant filtering of a band-limited impulse train. Sound examples as well as Matlab GUI for sonification are available at http://www-ccrma.stanford.edu/~kglee/sonification/formant_synthesis/formant_synthesis.html

    3. Digital Waveguide Modeling of the Vocal Tract

  In his thesis [5. Cook 1990], Perry Cook describes a method of vocal tract modeling superior to the previously described formant-filter based approach. The method involves approximating the vocal tract by a series of acoustic tube sections, each with a radius that varies from one vowel sound to the next. As shown in [Cook 1990], the radii of adjacent tube sections govern the transmission and reflection of acoustic energy at the junction between such sections. For each tube section, discrete-time delay elements are used to model the forward- and reverse-traveling wave components of the digital waveguide simulation [6. Smith 2002]. Between the delay elements, a scattering junction is used to handle the change in radius from one tube section to the next. In addition to the convincing sounds that it generates, this physical model can have as many tube sections as possible, which makes it perfect for sonification of very high-dimensional data. A C++ command-line application allows users to control three parameters: 1) freq sets underlying glottal pulse train frequency; 2) shape sets tract radii for a desired phoneme whose presets are saved in a separate file; 3) radii is a vector that sets radii of N-tube sections. In the future version, length of tube sections as well as radii can be determined by users. A PD (Pure Data, a real-time graphical programming enviroment for audio signal processing by Miller S. Puckette) patch with the same function is also designed for real-time usage.

  The following table summarizes the above three vocal models with a few sound examples.

Vocal Model
Control Parameters
Implementation Tool
Sound Examples
Sonification Examples
FM Voice
 pitch, tilt, modulation depth

Formant Synthesis
 pitch, gender, vowel type (and amplitudes/bandwidths of formant peaks)
Matlab, C/C++

Vocal Tract Physical Modeling
 pitch, shape, radii
C++, PD


  (under construction: description of data sonification, sound examples...)


1. Chowning, J.1973
"The Synthesis of Complex Audio Spectra by means of Frequency Modulation"
Journal of the Acoustical Society of America, 21(7):526-534

2. Chowning, J. 1980
"Frequency Modulation Synthesis of the Singing Voice"
Pages 57-63 of: Mathews, M. V., and J. R. Pierce (eds), Current Directions in Computer Music Research
Cambridge, MIT Press

3. Klatt, D. 1980
"Software for a Cascade/Parallel Formant Synthesizer"
Journal of the Acoustical Society of America, 67:13-33

4. Peterson, G.E. & Barney, H.L. 1952
"Control methods used in a study of the vowels"
Journal of the Acoustical Society of America, 24:175-184

5. Cook, P. R. 1990
"Identification of Control Parameters in an Articulatory Vocal Tract Model, with Applications to the Synthesis of Singing"
Ph.D. thesis, Elec. Engineering Dept., Stanford University (CCRMA)

6. Smith III, J. O. 2002
"Digital Waveguide Modeling of Musical Instruments"