Vocal Models for Data Sonification
Abstract: This document
discusses a few popular synthesis methods for voice simulation, and
presents their implementations in different platforms. In addition,
hyperspectral data sonification using these vocal models are
exemplified.
1. FM Voice
Model
This model uses FM synthesis technique [1. Chowning 1973]
developed by John Chowning to produce sounds with vocal texture [2.
Chowning 1989]. A command-line application has been written to allow
users to experiment with various FM synthesis parameters for vowel
synthesis. The utility is called fm_vowel, and may be
downloaded at http://www-ccrma.stanford.edu/~rjc/audio/speech/vowel/fm_vowel/fm_vowel.tar.gz.
It takes three control parameters: 1) freq determines a pitch of
synthesized voice; 2) tilt
sets spectral tilt parameter; 3) md
sets modulation depth. While its implementation is easy as well as its
computational cost is very low, it is quite difficult to produce
convincing sounds. Furthermore, this model is not suitable for
sonification of hyperspectral data because it has only three control
parameters.
2. Formant Synthesis
(Source-Filter model)
When identifying dissimilar sounds such as human vowels, the
ears are most sensitive to peaks in the signal spectrum. These resonant
peaks in the spectrum are called formants. The frequencies of these
peaks corresond to resonant frequencies of vocal tract, through which
glottal pulse is filtered. Each vowel has different formant frequencies
and bandwidths. Furthermore, every human being has his/her unique
formant frequencies and bandwidths. Using these characteristics of
vowel sound production mechanism, a band-limited impulse train can be
used as a glottal source, which is then filtered by multiple resonators
(arranged in parallel or cascade) with corresponding formant
frequencies and bandwidths [3. Klatt 1980] to generate vowel
sounds. We used Matlab
to implement formant synthesis technique, and referred to Peterson and
Barney's formant table to create a formant matrix, which contains the
first three formant frequency values from 10 American-English
monophthong vowels as spoken by 76 speakers (33 men, 28 women and 15
children) [4. Peterson, Barney 1952]. Although it has only three
control parameters - pitch, gender (male, female, or child), and vowel
type - it is far more suitable for hyperspectral data sonification
because if we map data values to the amplitudes and the bandwidths of
formant peaks, we could obtain vowel sounds with different sonority. In
addition to Matlab
implementation, the STK (Synthesis Tool Kit), a CCRMA-created
collection of C++ classes for the synthesis and processing of musical
instrument sounds, contains a C++ class VoicForm for the synthesis of
vowel sounds based on formant filtering of a band-limited impulse
train. Sound examples as well as Matlab
GUI for
sonification are available at http://www-ccrma.stanford.edu/~kglee/sonification/formant_synthesis/formant_synthesis.html
3. Digital Waveguide
Modeling of the Vocal Tract
In his thesis [5. Cook 1990], Perry Cook describes a method
of vocal tract modeling superior to the previously described
formant-filter based approach. The method involves approximating the
vocal tract by a series of acoustic tube sections, each with a radius
that varies from one vowel sound to the next. As shown
in [Cook 1990], the radii of adjacent tube sections govern
the transmission and reflection of acoustic energy at the junction
between such sections. For each tube section, discrete-time delay
elements are used to model the forward- and reverse-traveling wave
components of the digital waveguide simulation [6. Smith 2002]. Between
the delay elements, a scattering junction is used to handle the
change in radius from one tube section to the next. In addition to the
convincing sounds that it generates, this physical model can have as
many tube sections as possible, which makes it perfect for sonification
of very high-dimensional data. A C++ command-line application allows
users to control three parameters: 1) freq sets underlying glottal
pulse train frequency; 2) shape sets tract radii for a desired phoneme
whose presets are saved in a separate file; 3) radii is a vector that
sets radii of N-tube sections. In the future version, length of tube
sections as well as radii can be determined by users. A PD (Pure Data,
a real-time graphical programming enviroment for audio signal
processing by Miller S. Puckette) patch with the same function is also
designed for real-time usage.
The following table summarizes the above three vocal models with
a few sound examples.
Vocal Model
|
Control Parameters
|
Implementation Tool
|
Sound Examples
|
Sonification Examples
|
FM Voice
|
pitch, tilt, modulation depth
|
C++
|
|
|
Formant Synthesis
|
pitch, gender, vowel type (and
amplitudes/bandwidths of formant peaks)
|
Matlab, C/C++
|
|
|
Vocal Tract Physical Modeling
|
pitch, shape, radii
|
C++, PD |
|
|
(under construction: description of data sonification, sound
examples...)
Bibliography
1. Chowning, J.1973
"The Synthesis of Complex Audio Spectra
by means of Frequency Modulation"
Journal of the Acoustical Society of America, 21(7):526-534
2. Chowning, J. 1980
"Frequency Modulation Synthesis of the
Singing Voice"
Pages 57-63 of: Mathews, M. V., and J. R. Pierce
(eds), Current Directions in Computer Music Research
Cambridge, MIT Press
3. Klatt, D. 1980
"Software for a Cascade/Parallel
Formant Synthesizer"
Journal of the
Acoustical Society of America, 67:13-33
4. Peterson, G.E. & Barney, H.L. 1952
"Control methods used in a study of the
vowels"
Journal of the Acoustical Society of America, 24:175-184
5. Cook, P. R. 1990
"Identification of Control
Parameters in an Articulatory Vocal Tract Model, with Applications to
the Synthesis of Singing"
Ph.D. thesis, Elec. Engineering Dept., Stanford University
(CCRMA)
6. Smith III, J. O. 2002