- Numerical Integration of Partial Differential Equations (February 1999)
- Efficient Pitch Detection Techniques for Interactive Music (April 2002)
- Estimation of Multiple Fundamental Frequencies in Audio Signals Using a Genetic Algorithm (April 2000)
- Synthesis of Ecologically-Based Sound Events (April 2000)
- Modeling the Sound Event at the Micro Level (April 2000)
- Scalable Audio Models for Data Compression and Modifications (February 1999)
- Multichannel Acoustic Echo Cancellation for Telepresence Applications (April 2002)
- Linear Prediction Analysis of Voice Under the Presence of Sinusoidal Interference (July 2001)
- Virtual Analog Synthesis with the Comb Filter
- Nonstationary Sinusoidal Modeling
- Scanned Synthesis (April 2002)
- Scanned Synthesis, A New Synthesis Technique (April 2000)
- Perceptual Audio Coding Based on the Sinusoidal Transform (April 2000)
- Optimal Signal Processing for Acoustical Systems (January 1998)
- An efficient and fast octave-band filter bank for audio signal processing of a lowpower wireless device (July 2001)
- Spectral Modeling Synthesis (SMS) (January 1998)
- Doppler Simulation and the Leslie
- Signal Processing Algorithm Design Stressing Efficiency and Simplicity of Control (April 2000)
- Synthesis and Algorithmic Composition Techniques Derived from the Sonification of Particle Systems; And a Resultant Meta-Philosophy for Music and Physics (April 2000)
- An Iterative Filterbank Approach for Extracting Sinusoidal Parameters from Quasi-Harmonic Sounds
- Analysis and Resynthesis of Quasi-Harmonic Sounds: an Iterative Filterbank approach
- Antialiasing for Nonlinearities: Acoustic Modeling and Synthesis Applications (April 2000)
- A Flexible Analysis/Synthesis Method for Transient Phenomena (April 2000)
- Applying Psychoacoustic Principles to Soundfield Reconstruction (January 1998)
- Voice Gender Transformation with a Modified Vocoder (May 1996)
- A Speech Feature Based on Bark Frequency Warping - The Non-uniform Linear Prediction (NLP) Cepstrum (February 1999)

This work focuses on a numerical integration method for partial differential equations (PDEs) which is an outgrowth of Wave Digital Filtering (WDF), a well-known digital filter design technique. The idea is, as in the lumped case, to map a Kirchoff circuit to a signal flow diagram in such a way that the energetic properties of the various circuit elements are preserved. The method can be extended to distributed systems as well. The chief benefit of this method, which has been around for some ten years now, is its guaranteed stability under very general conditions. Applications are to modelling distributed acoustic, electromagnetic and even highly nonlinear fluid dynamics phenomena. Current work is concerned with, in particular, a WDF version of the PML (perfectly matched layer) used for unbounded domain problems in electromagnetics, flux-splitting in the WDF framework, incorporating the entropy variable formulation of fluid dynamics, higher-order accurate discretization formulae which preserve passivity, and other projects.

Several pitch detection algorithms have been examined for use in interactive computer-music performance. We define criteria necessary for successful pitch tracking in real-time and survey four tracking techniques: Harmonic Product Spectrum (HPS), Cepstrum-Biased HPS (CBHPS), Maximum Likelihood (ML), and the Weighted Autocorrelation Function (WACF).

**Reference:**

- Cuadra, P., Master, A., and Sapp, C.,
*Efficient Pitch Detection Techniques for Interactive Music*, Proceedings of the 2001 International Computer Music Conference, Havana, Cuba, pp. 403-406. Computer Music Association. Available online at`http://www-ccrma.stanford.edu/~craig/papers/01/icmc01-pitch.pdf`.

A method for estimating multiple, simultaneous fundamental frequencies in a polyphonic audio spectrum is presented. The method takes advantage of the power of genetic algorithms to explore a large search space, and to find a globally optimal combination of fundamental frequencies that best models the polyphonic signal spectrum. A genetic algorithm with variable chromosome length, a special crossover operator and other features is proposed. No a-priori knowledge about the number of fundamental frequencies present in the spectrum is assumed. Assessment of the first version of this method has shown correct detection (in number and value) of up to five fundamental frequencies. Planned refinements on the genetic algorithm operators could enhance this performance.

We present techniques for the efficient synthesis of everyday sounds, that is, sounds like rain, fire, breaking glass, scraping and bouncing objects, etc. These sounds present dynamic temporal and spectral states that cannot be described by either deterministic or stochastic models alone (Cook, 1997; Roads, 1997). We propose a conceptually simple method for resynthesizing decorrelated, unique sound events using constrained parametric control of stochastic processes.

Granular synthesis has proven to be a convenient and efficient method for stochastic, time-based synthesis (Truax, 1988). To better control spectral details, we extend asynchronous granular synthesis to include phase-correlation, time-dependent overlap, amplitude scaling, and synchronicity between granular streams. We propose a representation of ecologically-based sound events comprising three control levels: micro, meso, and macro. By having a control structure across all three time resolutions we can better manage time-frequency boundary phenomena, thus taking into account windowing and overlap effects, spectral evolutions, and emergent perceptual properties.

**Related Article**

- Keller, D. ``Introduction to the Ecological Approach.''
*Virtual Sound*, R. Bianchini and A. Cipriani, Eds. Contempo Edizioni, Italy, 1999. (CD-ROM)

Environmental sounds present a difficult problem for sound modeling because spectral and temporal cues are tightly correlated. These cues interact to produce sound events with complex dynamics. In turn, these complex sounds form large classes which can be defined by statistical measurements. Thus, environmental sounds cannot be handled by traditional deterministic synthesis methods. The objective of this project is to implement algorithmic tools which allow to define sound events by multilevel parameter manipulation.

Micro-level representations of sounds provide a way to control spectral and spatial cues in sound synthesis. Meso-level representations determine the temporal structure of sound events. By integrating these approaches into a coherent data structure we expect to be able to model sound events with complex dynamic evolutions both at a micro and at a meso level. Consequently, these tools will extend the parameter space of ecological models to include spectral and spatial cues.

The best methods currently for high quality, low bitrate audio compression algorithms are based on filterbanks. While current algorithms, such as MPEG-AAC (Advanced Audio Compression), achieve very high data efficiency, it is very difficult to perform modifications such as time stretching and pitch shifting on the compressed data.

In this study, we investigate a more flexible model for audio that allows competitive scalable data compression rates while allowing for simple modifications on the compressed data. Through a combination of multiresolution sinusoidal modeling, transient modeling, and noise modeling, we can achieve both a scalable, efficient audio data representation that is also easy to modify.

See Also: http://www-ccrma.stanford.edu/~scottl/thesis.html

With the availability of increased communication bandwidths in recent years, people have become interested in full-duplexed, multichannel sounds for telepresence services because of the potentials for providing much better hearing experiences. Nevertheless, in any full-duplex connection of audio network, the problem of acoustic echoes arises due to the coupling between loudspeaker(s) and microphone(s) placed in the same room. Moreover, it is known that the cancellation of multichannel acoustic echoes is a mathematically ill-conditioned problem. What happens is that, due to the high correlation between signals in multiple channels, an adaptive echo canceller tends to converge to a degenerate solution and fails to find the true coupling paths.

Although several types of algorithms for decorrelating the channels have been proposed to regularize the problem in the context of speech teleconferencing, these algorithms are all developed based on the criterion that any sound other than the speech signals should not be heard. However, it is not necessarily so in applications such as video games and performing arts where background sound effects and background music are common and desired practices.

I am currently working on utilizing background sounds for multichannel echo canceling. Methods are developed to generate arbitrarily many orthogonal and perceptually similar sounds from a mono source, and the sounds are fed into a multichannel echo canceler for the canceler to better identify the echo paths.

Through the AEC experiments that have been conducted, it is found that there are tunable parameters in both the sound pre-processing stage and the adaptive learning stage. While the parameters are selected in an *ad hoc* way presently, in the future, I would like to pursue the direction of formulating the selection of parameters as an optimization problem. Also, another interesting domain is to design an adaptive learning alrogithm that is customized to the AEC problem, possibly with some physical knowledge incorporated.

**Reference:**

- Liu, Y. and Smith, J.O. (2002)
*Perceptually similar orthogonal sounds and applications to multichannel acoustic echo canceling*, Proceedings of AES 22nd International Conference, Espoo, Finland (June 2002). Available online at`http://www.stanford.edu/~jacobliu/papers/`.

We are interested in tackling the single channel sound source separation problem of voice and non-voice signals. An interesting task would be to separate singing from instrumental accompaniment, pianos or guitars for example. In that case, it is crucial to make estimation of the glottal source of the voice part in the presence of interfering sinusoids.

The focus of our ongoing research is to study the linear prediction analysis of voice and try to come of with new methods to separate voice and non-voice from a single channel mixture.

Particularly, we've worked on an adaptive linear prediction (LP) analysis framework that is based on the LMS algorithm. The adaptive algorithm is causal, and has the potential of following the statistics of the voice more closely. However, the estimation of the LP coefficients is fluctuating around the optimal solution due to the nature of the LMS algorithm.

The bandlimited digital synthesis model of Stilson and Smith is extended with a single feed-forward comb filter. Comb filter techniques are shown to produce a variety of classic analog waveform effects, including waveform morphing, pulse-width modulation, harmonization, and frequency modulation. The techniques discussed do not guarantee perfect bandlimiting; however, they are generally applicable to any waveform synthesis method.

=master.separation.tex

=master.separation2.tex

;''

;''

The sinusoidal model has been a fundamentally important signal representation for coding and analysis of audio. We present an enhancement to sinusoidal modeling in the form of a linear frequency chirp parameter estimator applicable to Hann-windowed quasi-sinusoidal signals. The estimator relies on models of the phase curvature and peak width of a given chirp signal's FFT magnitude domain peak. We show that different models are applicable for smaller and larger values of the chirp parameter, derived respectively from Taylor series and Fresnel integral analysis of the signal. We construct an estimator for the transition region between the two models via a neural net. Results indicate that the estimator is robust to noise and outperforms any known chirp parameter estimators for Hann windowed signals.

**References:**

- Aaron S. Master and Yi-Wen Liu, (2003)
*Robust Chirp Parameter Estimation for Hann Windowed Signals*, Proceedings of the 2003 IEEE International Conference on Multimedia and Exposition (ICME), Baltimore Maryland. - Aaron S. Master and Yi-Wen Liu, (2003)
*Nonstationary Sinusoidal Modeling with Efficient Estimation of Linear Frequency Chirp Parameters*, Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, Hong Kong. - Aaron Master, (2002)
*Nonstationary Sinusoidal Modeling*. Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL.

``Scanned synthesis" is done by scanning the slowly varying shape of an object and converting this shape to samples of a sound wave. The shape of the object is determined by the dynamic reactions of the object to forces applied by the performer. These forces vary at "haptic" rates (0-20 Hz). If the scanning path is closed, the sound wave is quasiperiodic and a fundamental pitch is perceived at the cycling frequency (20 Hz-20 kHz). Scanned synthesis provides direct dynamic control by the performer over the timbre of sounds as they are produced. The object can be real or simulated. With finite-element models, we have simulated the one-dimensional wave equation for a generalized slowly vibrating string. Timbres generated by manipulating the string at haptic rates are perceived as having a very pleasing live quality caused by the continually changing spectrum. To achieve additional richness, the performer can change the properties of the string in time and over the length of the string.

Developed at the Interval Research Corporation in 1998 and 1999, Scanned Synthesis is a new technique for the synthesis of musical sounds. We believe it will become as important as existing methods such as wave table synthesis, additive synthesis, FM synthesis, and physical modeling. Scanned Synthesis is based on the psychoacoustics of how we hear and appreciate timbres and on our motor control (haptic) abilities to manipulate timbres during live performance. A unique feature of scanned synthesis is its emphasis on the performer's control of timbre.

Scanned synthesis involves a slow dynamic system whose frequencies of vibration are below about 15 hz. The system is directly manipulated by motions of the performer. The vibrations of the system are a function of the initial conditions, the forces applied by the performer, and the dynamics of the system. Examples include slowly vibrating strings, two dimensional surfaces obeying the wave equation, and a waterbed. We have simulated the string and surface models on a computer. Our waterbed model is purely conceptual.

The ear cannot hear the low frequencies of the dynamic system. To make audible frequencies, the "shape" of the dynamic system, along a closed path, is scanned periodically. The "shape" is converted to a sound wave whose pitch is determined by the speed of the scanning function. Pitch control is completely separate from the dynamic system control. Thus timbre and pitch are independent. This system can be looked upon as a dynamic wave table controlled by the performer.

The psychophysical basis for Scanned Synthesis comes from our knowledge about human auditory perception and human motor control abilities. In the 1960's Risset showed that the spectra of interesting timbres must change with time. We observe that musically interesting change rates are less than about 15 hz which is also the rate humans can move their bodies. We have named these rates Haptic rates.

We have studied Scanned Synthesis chiefly with a finite element model of a generalized string. Cadoz showed the musical importance of finite element models in the 1970s. Our models differ from Cadoz's in our focus on slow (haptic) vibration frequencies. Our finite element models are a collection of masses connected by springs and dampers. They can be analyzed with Newton's laws. We have generalized a traditional string by adding dampers and springs to each mass. All parameters-mass, damping, earth spring strength and string tension-can vary along the string. The performer manipulates the model by pushing or hitting different masses and by manipulating parameters.

We have already synthesized rich and interesting timbres and we have barely started to explore the range of possibilities in our present models. Many other different models can be conceived. We find the prospects exciting.

In this work, we have explored the possibilities of the sinusoidal model as a frequency-domain representation for perceptual audio coding of various types of audio signals. We have designed a set of techniques for data rate reduction and developed a codec software prototype consisting of three basic blocks:

- Partial pruning based upon psychoacoustics masking.
- Smart sinusoidal frame decimation based upon transient detection.
- Bit allocation based upon psychoacoustics masking, and quantization.

We have evaluated the codec on monophonic musical instruments (harmonic and inharmonic), polyphonic orchestral music, singing voice and speech. Results have been quite satisfying and have shown that the sinusoidal model can be used to achieve interesting compression factors at high quality, for a wide variety of audio signals. In particular, we believe this work shows that the sinusoidal model is not at all limited to monophonic, harmonic signals, when high quality audio compression is the goal.

Recent advances in optimization theory have made it feasible to solve a class very large scale optimization problems in an efficient manner. Specifically, if a problem can be shown to be /emphconvex, then one can make use of recent advances in interior point optimization methods to achieve optimal solutions to problems whose scale is beyond the capabilities of more traditional optimization techniques.

Many interesting problems in audio and acoustical signal processing can be shown to belong to the class of convex optimization problems. My research has focused on several of these problems.

**Inverse Filtering of Room Acoustics - Echo Cancellation**: Previous research in the field, has shown that under certain conditions, one can achieve perfect cancellation of a rooms acoustic response using multiple sources. No technique has been presented to design a set of optimal filters to achieve this goal. My work in this area has been to apply convex optimization theory to determine an optimal set of filters.**Broadband Acoustical Arrays**: Beamforming using arrays of transducers operating at a specific frequency is a well understood and researched topic. Typical audio applications require an array to perform over a wide range of frequencies (typically 1-2 octaves). The problem becomes one of designing multiple filters (one for each transducer), to achieve the desired beam pattern over the frequency range of interest.

A real time system is being developed for both of the above
applications. This system is capable of measurement, and subsequent
implementation of a parallel bank of filters. The `Frankenstein'
hardware is used to allow for up to 16 separate channels of audio. A
version of the software using commercially available DSP hardware will
be available from `http://www.ccrma.stanford.edu/~putnam`.

**Fractional Delay Filters**: In order to accurately*tune*a physical model of an musical instrument, delay lines with fractional sample delay are needed. To achieve this, one needs to implement a filter whose group delay is a fraction of a sample. Typical optimal filter design methods such as the Remez exchange do not extend to the case where the desired frequency is complex, which is necessary in this case. This problem is convex, and hence can be solved with interior point methods.

Currently there is an integration between computer and portable wireless device as a hardware technology advances rapidly. In this noisy and power hungry envirionment, a fast and efficient signal processing front-end to compactly represent the audio signal is important.

The need to perform signal processing at an extremely low power, for applications such as in a portable MP3 player or cellular phone motivates the study of FIR filters having only a few taps with small integer coefficients. For example, digital watches do not have floating point operations.

The proposed octave bank filter bank using Gabor like wavelet consists of 12 bandpass filters covering a frequency range from 1Hz to 4KHz. The signal is stored in an one dimensional buffer before the processing, just long enough to be processed by lowpass and highpass filters, which makes both filters non-causal symmeteric ones. The index of the buffer is partitioned so that downsampling by two is done automatically.

What is unique in this approach is the organization of the code, namely one loop going through the samples, performing just a few assignment statements per sample. The proposed algorithm runs on the order of N, which is much faster than the widely used FFT based algorithm. The only calculations needed are addition and shifting. There are no floating point operations in the algorithm. The resolution of the filter is 6 samples/cycle at the maximum sensitivity of each filter.

Spectral Modeling Synthesis (SMS) is a set of techniques and software implementations for the analysis, transformation and synthesis of musical sounds. SMS software implementations were first done by Xavier Serra and Julius Smith at Stanford University, and more recently by the first author and the music technology group of the Audiovisual Institute of the Pompeu Fabra University in Barcelona. The aim of this work is to get general and musically meaningful sound representations based on analysis, from which musical parameters might be manipulated while maintaining high quality sound. These techniques can be used for synthesis, processing and coding applications, while some of the intermediate results might also be applied to other music related problems, such as sound source separation, musical acoustics, music perception, or performance analysis.

Our current focus is on the development of a general purpose musical synthesizer. This application goes beyond the analysis and resynthesis of single sounds and some of its specific requirements are:

- it should work for a wide range of sounds;
- it should have an efficient real time implementation for polyphonic instruments;
- the stored data should take little space;
- it should be expressive and have controls that are musically meaningful;
- a wide range of sound effects, such as reverberation, should be easily incorporated into the synthesis without much extra cost.

The implementation of these techniques has been done in C++ and Matlab, and the graphical interfaces with Visual C++ for Windows 95. Most of the software and the detailed specifications of the techniques and protocols used are publicly available via the SMS Web site.

The Doppler effect causes the pitch of a sound source to appear to rise or fall due to motion of the source and/or listener relative to each other. The Doppler effect has been used to enhance the realism of simulated moving sound sources for compositional purposes, and it is an important component of the ``Leslie effect.'' The Leslie is a popular audio processor used with electronic organs and other instruments. It employs a rotating horn and rotating speaker port to ``choralize'' the sound. Since the horn rotates within a cabinet, the listener hears multiple reflections at different Doppler shifts, giving a kind of chorus effect. Additionally, the Leslie amplifier distorts at high volumes, producing a pleasing ``growl'' highly prized by keyboard players.

In this research, an efficient algorithm for simulating the Doppler effect using interpolating and de-interpolating delay lines was developed. The Doppler simulator is used to simulate a rotating horn to achieve the Leslie effect. Measurements of a horn from a real Leslie were used to calibrate angle-dependent digital filters which simulate the changing, angle-dependent, frequency response of the rotating horn.

**Reference:**

- Julius O. Smith, Jonathan Abel, Stefania Serafin, and David Berners,
*Doppler Simulation and the Leslie*, in Proceedings of the International Conference on Digital Audio Effects (DAFx-02), Hamburg, Germany, September 26 2002, pp. 13-20.

This project deals with the design of digital filters, oscillators, and other structures that have parameters that can be varied efficiently and intuitively. The main criteria for the algorithms are:

**Efficiency:**The algorithms are intended to be as efficient as possible. This constraint is weighted very high in design decisions.**Non-Complexity of Controls:**As a large part of efficiency, the amount of processing that must be done on an input control to make it useful for the algorithm should be minimized. As an example, some filter may have ``center frequency'' as a control input, but may actually go through a bunch of expensive calculations to turn it into some lower level coefficients that are actually used in the filter calculation. On the other hand, another filter may have design whereby center frequency goes directly into the filter with little change, and the filter uses it in a rather simple calculation (i.e. the ugly math hasn't simply been absorbed into the filter). This constraint often influences the choice of basic algorithms, but also influences the control paradigms. For example, some algorithms may turn out to be vastly more efficient if given some variation of frequency as an input, say period, or log(frequency). In order to remain efficient, the control paradigm may also need to change (the whole system may use period rather than frequency, for example), otherwise there will need to be excessive parameter conversions, which violate the control complexity criterion.**Intuitiveness of Controls:**As alluded to in the previous item, certain forms of controls can be more efficient than others. Unfortunately, some efficient parameters may be hard to use for an end-user, i.e. a musician will likely prefer to specify center frequency to a filter algorithm rather than filter coefficients. In order to make algorithms usable, one must either introduce parameter conversion procedures (inefficient) or look for an algorithm that has the desired inputs yet is more efficient.

Often, one decides that a certain amount of inefficiency is livable, and in cases where a parameter changes only rarely, large amounts of inefficiency can be tolerated. But when a parameter must change very often, such as in a smooth sweep or a modulation, inefficiency is intolerable.

In this project, the main application is the field referred to as ``Virtual Analog Synthesis'', which tries to implement analog synthesis algorithms (in particular, subtractive synthesis) in digital systems. Characteristics of many analog patches were the blurring of the distinction between control signals and audio signals, such as in modulation schemes, or the ability to dynamically (smoothly) control any parameter. Both of these abilities require parameters to change at very high rates, even as fast as the sampling rate. Thus the necessity for efficiently controllable algorithms.

Two subprojects within this project are currently under being researched. First: the design and implementation of an efficient signal generator which generates bandlimited pulse trains, square waves, and sawtooth waves. The algorithm is being designed for basic efficiency, along with considerations for efficient variation of the main parameters: frequency and duty cycle.

Secondly, the connections between control-system theory and filter theory are being explored. One particular avenue of research is the application of Root-Locus design techniques to audio filter design. Root Locus explores the movement of system (filter) poles as a single parameter changes. Certain patterns in root loci appear repeatedly, and can be used in audio filter design to get various effects. A good example is the Moog VCF, which uses one of the most basic patterns in root-locus analysis to generate a filter that has trivial controls for both corner frequency and Q. Several other families of sweepable digital filters based on root-locus have already been found. A particular goal is to find a filter family that efficiently implements constant-Q sweepable digital filters (a problem that, it turns out, is particularly simple in continuous time -- the Moog VCF -- but is quite difficult in discrete-time).

de Broglie's hypothesis from Quantum Mechanics (QM) states a particle can behave as either a particle or a wave. Thus a system of particles could become a complex superposition of dynamic waves. Motivated by this the author develops a method for sonification of particle systems in a logical manner. Thinking of sound in terms of an evolving system of particles, potentials, and initial conditions, a unique position is gained. A direct correspondence between sound composition and many-body physics allows ideas from each field to enrich the other, such as using sound to gain a higher comprehension of a phenomenon, or using radioactivity as a compositional device. One application so far explored has been algorithmic composition using a simulated particle system. It has been readily observed that the composer must also become physicist to make effective musical use of these techniques. Paradoxically, the audience need not be versed in physics to visualize and appreciate what they hear-a sign of a successful analogue. But by the very act of uniting physics and music several interesting questions arise, encouraging a possible meta-philosophy of the two. The traditional purposes, meanings, and practices of each, are challenged; and the results are very pertinent to our current techno-culture. Several sound examples will be presented; and if accepted for programming, the first composition made with these techniques: *50 Particles in a Three-Dimensional Harmonic Potential: An Experiment in 5 Movements*.

We propose an iterative filterbank method for tracking the parameters of exponentially damped sinusoidal components of quasi-harmonic sounds. The quasi-harmonic criteria specialize our analysis to a wide variety of acoustic instrument recordings while allowing for inharmonicity. The filterbank splits the recorded signal into subbands, one per harmonic, in which time-varying parameters of multiple closely-spaced sinusoids are estimated using a Steiglitz-McBride/Kalman approach. Averaged instantaneous frequency estimates are used to update the center frequencies and bandwidths of the subband filters; by so doing, the filterbank progressively adapts to the inharmonicity structure of a source recording.

We employ a hybrid state-space sinusoidal model for general use in analysis-synthesis based audio transformations. This model combines the advantages of a source-filter model with the flexible, time-frequency based transformations of the sinusoidal model.

For this paper, we specialize the parameter identification task to a class of ``quasi-harmonic'' sounds. The latter represent a variety of acoustic sources in which multiple, closely spaced modes cluster about principal harmonics loosely following a harmonic structure (some inharmonicity is allowed.) To estimate the sinusoidal parameters, an iterative filterbank splits the signal into subbands, one per principal harmonic. Each filter is optimally designed by a linear programming approach to be concave in the passband, monotonic in transition regions, and to specifically null out sinusoids in other subband regions. Within each subband, the constant frequencies and exponential decay rates of each mode are estimated by a Steiglitz-McBride approach, then time-varying amplitudes and phases are tracked by a Kalman filter. The instantaneous phase estimate is used to derive an average instantaneous frequency estimate; the latter averaged over all modes in the subband region updates the filter's center frequency for the next iteration. In this way, the filterbank structure progressively adapts to the specific inharmonicity structure of the source recording. Analysis-synthesis applications are demonstrated with standard (time/pitch-scaling) transformation protocols, as well as some possibly novel effects facilitated by the ``source-filter'' aspect.

Nonlinear elements have manifold uses in acoustic modeling, audio synthesis and effects design. Of particular importance is their capacity to control oscillation dynamics in feedback models, and their ability to provide digital systems with a natural overdrive response. Unfortunately, nonlinearities are a major source of aliasing in a digital system. In this paper, alias suppression techniques are introduced which are particularly tailored to preserve response dynamics in acoustic models. To this end, a multirate framework for alias suppression is developed along with the concept of an aliasing signal-to-noise ratio (ASNR). Analysis of this framework proceeds as follows: first, relations are established between ASNR vs. computational cost/delay given an estimate of the reconstructed output spectrum; second, techniques are given to estimate this spectrum in the worst case given only a few statistics of the input (amplitude, bandwidth and DC offset). These tools are used to show that "hard" circuit elements (i.e. saturator, rectifier, and other piecewise linear systems found in bowed-string and single-reed instrument models) generate significant ASNR given reasonable computational constraints. To solve this problem, a parameterizable, general-purpose method for constructing monotonic "softening approximations" is developed and demonstrated to greatly suppress aliasing without additional computational expense. The monotonicity requirement is sufficient to preserve response dynamics in a variety of practical cases. Applications to bowed-string modeling and virtual analog filter emulation are discussed.

Sinusoidal models provide an intuitive, parametric representation for time-varying spectral transformations. However, resynthesis artifacts result to the degree the signal violates assumptions of local stationarity. Common types of transients (or local non-stationary regions) are abrupt changes in spectra, rapid exponentially-decaying modes, and rapid spectral variations (e.g. fast vibrato, chirps, etc.). These phenomena cover a considerably wider framework than that of onset regions in monophonic contexts. Our extended sinusoidal model proceeds with a presegmentation phase followed by region-dependent modeling and resynthesis. In presegmentation, information-theoretic criteria are used to localize abrupt change boundaries, windows are aligned with segment boundaries, then segments are classified as to local stationarity or transience. Locally stationary regions are handled by a sinusoids+noise model. For transients, we adapt parametric models which naturally extend the sinusoids+noise model, such as the time-varying Prony/Kalman model, to mode decay/variation problems. As well as reducing artifacts, extended sinusoids+noise models permit different kinds of processing to be applied to transients, shown to offer the composer considerable flexibility in timestretching-related applications. Finally, we show applications to the single-channel source separation problem and also to that of rhythm-following using a Bayesian framework to handle side information concerning the change boundaries.

Simulations and simple experiments have indicated that a broad class of musical signals can benefit from some simple processes aimed at reproducing a soundfield's perception accurately through loudspeakers. These processes attempt to recreate relative phase and amplitude information accurately at the listeners' ears, while allowing distortions elsewhere. The net effect should be to give a more accurate reproduction of important cues for localization and other factors to the listeners. Current work is geared toward expanding these results, by increasing the mathematical rigor and creating further generalizations, and by looking at how other psychoacoustic effects such as masking effects can be applied to further increase the accuracy of the reproduced soundfield's perception.

The goal of this project is to develop a voice transformation system that makes the transformed voice close to a natural voice of the opposite sex. The transformation considers the differences of fundamental frequency (pitch) contours and spectral characteristics.

The transformation algorithm employs components of a vocoder well known as the LPC-10 vocoder. By using the analyzer and the synthesizer of the LPC-10 vocoder and inserting the transformer in between them, we can modify the LPC analysis parameters at the transformer stage, and so change the acoustic nature of the input speech by feeding the modified parameters into the synthesizer.

In converting the gender of a voice, two parameters - pitch and formants- are modified. Pitch is transformed by viewing the pitch as a random variable, and changing the mean and standard deviation of the original pitch values. Formant frequency is defined as the frequency corresponding to a peak of the speech spectrum, while formant bandwidth is defined as the 3-dB bandwidth of the peak. The first three formant frequencies are scaled separately by empirically derived factors. The scale factors for formant bandwidths are set equal to those for formant frequencies.

Based on the above ideas, an algorithm for voice gender transformation is implemented. Its performance depends greatly on the original speaker. Also, female-to-male conversion was found to produce more natural sounding speech compared to male-to-female conversion. This is mainly due to the fact that the LPC-10 vocoder is poor in synthesizing female voice.

In statistically based speech recognition systems, choosing a feature that captures the essential linguistic properties of speech while suppressing other acoustic details is crucial. This could be more appreciated by the fact that the performance of the recognition system is bounded by the amount of linguistically-relevant information extracted from the raw speech waveform. Information lost at the feature extraction stage can never be recovered during the recognition process.

Some researchers have tried to convey the perceptual importance in speech features by warping the spectrum to resemble the auditory spectrum. One example is the mel cepstrum (Davis, 1980), where a filterbank that has bandwidths resembling the critical bands of human hearing is used to obtain a warped spectrum. Another is the Perceptual Linear Prediction (PLP) method proposed by Hermansky (Hermansky, 1990), where a filterbank similar to the mel filterbank is used to warp the spectrum, followed by perceptually motivated scaling and compression of the spectrum. Low-order all-pole modeling is then performed to estimate the smooth envelope of the modified spectrum.

While the PLP provides a good representation of the speech waveform, it has some disadvantages that should be pointed out. First, since the PLP method relies on obtaining the FFT spectrum before the warping, its ability to model peaks of the speech spectrum - formants - depends on the characteristics of the harmonic peaks for vowels. This could hinder the process of modeling formants of female speech through filterbank analysis, since there are fewer harmonic peaks under a formant region than in the male case. Second, various processing schemes (e.g. Bark-scale transformation, equal-loudness weighting, cubic-root compression) require memory, table-lookup procedure and/or interpolation, which might be computationally inefficient.

We propose a new method of obtaining parameters from speech that is based on frequency warping of the vocal-tract spectrum, rather than the FFT spectrum. The Bark Bilinear Transform (BBT) (Smith, 1995) is first applied on a uniform frequency grid to generate a grid that incorporates the non-uniform resolution properties of the human ear. Frequency warping is performed by taking the non-uniform DFT (NDFT) of the impulse response related to the vocal-tract transfer function using the warped grid. The warped spectrum is then modeled by low-order Linear Prediction (LP), which provides a good estimate of the spectral envelope, especially near peaks. This results in features that effectively model the warped peaks of the vocal-tract spectrum, which are considered to be perceptually important. Results of vowel classification experiments show that the proposed feature effectively captures linguistic information while suppressing speaker-dependent information due to different acoustic characteristics across speakers.

**References**

- Davis, S. B. and Mermelstein, P., ``Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences'',
*IEEE Transactions on Acoustics, Speech and Signal Processing*, 1980, Vol. 28, 357-366. - Hermansky, Hynek, ``Perceptual linear predictive (PLP) analysis of speech'',
*Journal of the Acoustical Society of America*, Vol. 87, No. 4, 1990, 1738-1752. - Smith, J. O. and Abel, J. S., ``The Bark bilinear transform'',
*Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics*, New Paltz, NY, 1995.

© Copyright 2005 CCRMA, Stanford University. All rights reserved. |