next up previous contents
Next: Machine Recognition in Music Up: Research Activities Previous: Physical Modeling of Musical Sound Sources

Audio Signal Processing


Dynamic Range Compression of Audio Signals Consistent with Recent Time-Varying Loudness Models

Ryan J. Cassidy

Dynamic range compression may be used to increase the volume of the softer passages of an audio signal relative to its louder portions, thus making the signal better suited to transmission through or storage on a given medium. The level-detection characteristics of typical contemporary dynamic range compressors are analyzed and investigated, thus revealing the shortcomings of such models in light of knowledge about steady-state and time-varying loudness as perceived by the human auditory system. The design of an equal-loudness filter, desired to improve the steady-state properties of compressor level detection, is presented. Finally, the time-varying properties of the level detection scheme presented, configured via attack and release times, are tuned to provide optimal correspondence with a recently proposed model of time-varying loudness.

A Tunable, Nonsubsampled, Non-Uniform Filter Bank for Multi-Band Audition and Level Modification of Audio Signals

Ryan J. Cassidy

The need for investigation and (possibly non-linear) modification (e.g. dynamic range compression) of the components of an audio signal corresponding to different spectral bands is widespread. Two examples of present focus include multi-band dynamic range compression of a musical signal and frequency-dependent gain control in a hearing aid. We present a technique we call sub-band audition, which enables the user to listen to individual sub-bands of an audio signal, and thus better determine required modifications. The filter bank requirements of such an application are presented. Finally, the design of a tunable, nonsubsampled, non-uniform filter bank, based on allpass-complementary filter banks and Elliptic Minimal Q-Factor (EMQF) filters, is presented.

Distributed Internet Reverberation for Audio Collaboration

Chris Chafe

Low-latency, high-quality audio transmission over next-generation Internet is a reality. Bidirectional, multichannel flows over continental distances have been demonstrated in musical jam sessions and other experimental situations. The dominating factor in delay is no longer system issues, but the transmission time bounded by lightspeed. This paper addresses a method for creating shared acoustical spaces by ``echo construction.'' Where delays in bidirectional paths are sufficiently short and ``room-sized,'' they can be used to advantage as components in synthetic, composite reverberation.

The project involves setting up two collaborating audio hosts (e.g., Seattle and San Francisco locations) separated by short internet delay times (e.g., in this example RTT = 20ms). Monitoring on both ends includes a composite reverberation in which the round-trip delay is used to construct multipath echoes, corresponding to multiple ``rays'' in a composite room.

The first implementation involves two identical rooms with identical monitoring (microphone and speaker locations). For simplicity, the rooms can be thought of as small, 10ft on a side. Using the technique described, a composite room is heard which incorporates the 10ms network delay in a synthetic reverberation circuit running in software as part of the audio transmission system. The added 10ms roughly corresponds to an additional 10ft inserted between the monitoring locations. The listener's have the impression of communicating with each other in the same 30ft room.

A recent paper describes the audio transmission techniques, multichannel monitoring and reverberation circuit and initial subjective evaluation of this ``echo construction'' method.


Loudness-Based Display and Analysis Applied to Artificial Reverberation

Patty Huang, and Julius O. Smith III

We propose a psychoacoustically motivated method to analyze the quality of reverberation and other audio signals. A time-varying loudness model is used as a front end to produce a visual display of a signal which emphasizes its perceptual features. A time-frequency display of reverberation impulse responses based on specific loudness is shown to produce a psychoacoustically relevant visualization of response features. In addition, a metric based on instantaneous loudness is proposed as an objective measure of quality and texture of late reverberation.

Reference: ICMC-04 paper (same title and authors).

Audio Watermarking based on Parametric Representations

Yi-Wen Liu

Synthesized multimedia objects are emerging everywhere now. One can talk on the phone to a virtual representative that speaks a synthesized tongue, drink soda of synthesized taste, such as Coke, or even fall in love with Simone, a synthesized character. It becomes urgent to protect such objects as intellectual properties, for the synthesis of them often involves a lot of computational power and human labor. In my dissertation research, a framework is proposed for the design of robust watermarking algorithms on the synthesis parameter domain.

Particularly for audio signals that are sinusoidal in nature, such as vowels of human speech or sustaining tones of a musical instrument, I have been experimenting the idea of watermarking by quantizing the frequency. To implement it, a signal has to first be decomposed into sinusoidal and non-sinusoidal components. Then, frequencies of sinusoidal partials are quantized to carry binary information. The quantization is set to be as small as can not be heard by human ears, but meanwhile as large as possible so that the quantization step can easily be resolved when the watermark's binary information needs to be extracted. The frequency-quantized sinusoids are carefully synthesized and superposed with the non-sinusoidal components, which are un-altered, to form a watermarked version of the original signal.

To decode the watermark embedded as described above, a frequency estimator with very high accuracy is necessary. I developed an efficient algorithm that can track 50-100 partials from a mildly noisy observation. The algorithm often (but not always) approaches the Cramer-Rao lower bound, a theoretical limit in parameter estimation. Empirically, the algorithm works well when all partials in the spectrum are well separated to begin with.


Bayesian Two-Source Models for Stereo Sound Source Separation of N Sources

Aaron Master

We presently consider an enhancement to the DUET sound source separation system [1]. Specifically, we expand the system and the related delay and scale subtraction scoring (DASSS) [2] to allow sounds from exactly two sources to be active at the same point in time-frequency (STFT) space. We begin with a review of the DUET system and its sparsity and independence assumptions. We then consider how the DUET system and DASSS respond when faced with two active sources at the same point in time-frequency space. For this case, we show the important result that the DUET and DASSS data may reveal which two sources are active. To exploit this result, we present a Bayesian framework for determining the most likely sources given DUET and DASS data. We then show how to use this framework on DASSS data. We conclude with an example showing the efficacy of using DASS data for determining and demixing two active sources.

Aaron S. Master. Bayesian Two Source Modeling for Separation of N Sources from Stereo Signals. Submitted to ICASSP 2004, Montreal.

Aaron S. Master. Bayesian Two Source Modeling for Stereo Sound Source Separation of N Sources, Stanford University EE 391 Report, Summer 2003.



Juan Reyes

Banded waveguides are an efficient method for physical modeling of bar percussion instruments such as bowed bar of wood or metal or glass, bowls and even corrugated surfaces as proposed by Essel and Cook in 1999. A CLM instrument has been written using the bow-table paradigm for signal generation plus an array of banded waveguides arranged in parallel. A banded waveguide consists of a bandpass filter, fine tuned to the frequency of one of the modes of the object in vibration plus a delay-line also mode dependent. Since different material or shapes have different modes of vibration, the algorithm allows for manipulating the number of modes and the frequency of each mode. Delay-lines are function of the sampling rate and factor of the position of each mode. The CLM version has few advantages over a real-time implementation: scores can be generated, modes can be specified independently and allows for narrowing on specific frequencies and resonances. This instrument can be interfaced to algorithmic composition packages and its signal can be scattered on a multi-path multichannel system given the high frequency components on this kind of timbre.

Spectral Audio Signal Processing

Julius Smith

The Fast Fourier Transform (FFT) revolutionized signal processing practice in the 1960s. Today, it continues to spread as a practical basis for digital systems implementation. Only in the past decade or so has it become cost-effective to use the short-time FFT in real-time digital audio systems, thanks to the availability of sufficiently powerful, low-cost, single-chip solutions.

In the digital audio field, FFT-based techniques are useful in digital mixing consoles, post-production editing facilities, and top-quality digital audio gear. Many music and digital audio ``effects'' can be conveniently implemented in a unified way using a short-time Fourier analysis, modification, and resynthesis facility.

In contrast with physical modeling synthesis, which models the source of a sound, spectral modeling techniques model sound at the receiver, the human ear. Spectral modeling is more immediately general than physical modeling since it is capable of constructing an arbitrary stimulus along the basilar membrane of the ear. While complex coarticulation effects are more naturally provided by physical models, the short-time Fourier transform can be applied to any sound demonstrating any desired effect to determine what must happen in a spectral sequence to produce that effect.

FFT-based techniques play an important role in (1) the practical implementation of general signal processing systems (fast convolution), (2) advanced effects such as ``cross synthesis,'' time compression/expansion, duration-invariant frequency shifting, and other ``phase vocoder'' type techniques, (3) noise reduction, (4) perceptually calibrated spectral display, (5) perceptual audio compression, and (6) novel synthesis systems based on the direct creation and transformation of spectral events and envelopes.


Using a perceptually based timbre metric for parameter control estimation in a physical model of the clarinet

Hiroko Terasawa Jonathan Berger, and Julius Smith

Time intensive, trial and error based manual adjustment of control parameters poses a limitation on applied physical modeling synthesis. The goal of this study is to provide an efficient approach for automatic estimation of optimal control parameters for physical modeling synthesis. The method is based on psychoacoustically motivated timbre comparisons between a recorded reference sound and a set of corresponding synthesized sounds. The timbre comparisons are based upon the sample mean and standard deviation between Mel-Frequency Cepstral Coefficients (MFCC) computed using several steady-state time frames from the reference and synthesized sounds.

Perceptual Distance in Timbre Space

Hiroko Terasawa, Malcolm Slaney, and Jonathan Berger

This research is concerned with developing a perceptual space for timbre. We define an objective metric that takes into account perceptual orthogonality, and measures the quality of timbre interpolation applicable to perceptually valid timbral sonification. We discuss two timbre representations and measure perceptual judgment. We determined that a timbre space based on Mel-frequency cepstral coefficients (MFCC) is a good model for perceptual timbre space.


A Bayesian Framework for Joint Onset Detection, Transient Region Characterization, and Melody Transcription

Harvey Thornburg, Randal Leistikow, and Jonathan Berger

Onset detection and transient region identification for musical audio signals have proven to be vital components in many improved time and pitch scaling algorithms which reduce artifacts due to these transient phenomena. However, these detection tasks are generally quite difficult thanks to the complexities of real-world musical audio signals. Fortunately, musical signals are highly structured--both at the signal level, in terms of the spectrotemporal structure of note events, and at higher levels, in terms of melody and rhythm. These structures generate context useful in predicting attributes such as pitch content, the presence and location of onsets, and the boundaries of transient regions. To this end, we propose a dynamic Bayesian framework for which contextual predictions may be integrated with signal information in order to make optimal decisions concerning these attributes. The result is a joint segmentation and melody retrieval for nominally monophonic signals (nevertheless containing reverberation, polyphony due to note overlaps, and even background instrumentation). The system detects note event boundaries and pitches, also yielding a frame-level sub-segmentation of these events into transient/steady-state regions. The approach is successfully applied to notoriously difficult examples like bowed string recordings captured in highly reverberant environments.

Transient Detection and Modeling

Harvey Thornburg

Transient events are regions not ``well-modeled'' by a locally stationary sinusoidal model. Examples include abrupt changes or fast decays/modulations in mode amplitudes/frequencies. Transient regions usually follow onsets, which are preceded by an abrupt change. This suggests a twofold approach for transient detection:

Segmentation - Bayesian Approach:


A Robust Maximum Likelihood F0 Estimation from STFT Peaks: Exact and Fast Approximate MCMC Approaches

Harvey Thornburg and Randal J. Leistikow

We propose a robust $f_{0}$-estimation in the presence of interference and low SNR without the computational requirements of optimal time-domain methods. Sinusoidal peaks are extracted by a windowed STFT; the collection of peak frequencies and amplitudes drives our analysis. Given $f_{0}$ and a reference amplitude, peak frequency/amplitude observations are modeled probabilistically in a sense robust to undetected harmonics, spurious peaks, skewed peak estimates, and inherent deviations from ideal or other assumed harmonic structure. Parameters $f_{0}$ and $A_{0}$ are estimated by maximizing the observations' likelihood, though $A_{0}$ is treated as a nuisance parameter. Our model utilizes a hidden, discrete-valued descriptor variable identifying spurious/undetected peaks. The likelihood evaluation, requiring a computationally unwieldy summation over all descriptor states, is successfully approximated by a MCMC traversal chiefly amongst high-probability states.

Speaker Array Calibration Using Inter-Speaker Range Measurements

R. Scott Wilson, Jeffrey H. Walters and Jonathan S. Abel

Given an array of speakers and a set of noisy inter-speaker range estimates, we consider the problem of estimating the relative positions of the array elements. A closed-form position estimator which minimizes an equation error norm is presented and shown to be related to a multidimensional scaling analysis. The information inequality is used to bound position estimate mean square error and to guage the accuracy of the closed-form estimator. A geometric interpretation of the bound variance is given and used in examining our simulation results.


Audio and Gesture Latency Measurements on Linux and OSX

Matt Wright

We have measured the total system latencies of MacOS 10.2.8, Red Hat Linux (2.4.25 kernel with low-latency patches), and Windows XP from stimulus in to audio out, with stimuli including analog and digital audio, and the QWERTY keyboard. We tested with a variety of audio hardware interfaces, audio drivers, buffering and related configuration settings, and scheduling modes. All measured audio latencies tracked expectedly with buffer sizes but with a consistent amount of unexplained additional latency. With analog I/O there was also a consistent additional bandwidth-dependent latency seemingly caused by hardware. Gesture tests with the QWERTY keyboard indicate discouragingly large amounts of latency and jitter, but large improvements on Linux when real-time priorities are set.


Wright, M., R. J. Cassidy and M. F. Zbyszynski (2004). ``Audio and Gesture Latency Measurements on Linux and OSX''. Proc. International Computer Music Conference, Miami, FL, ICMA: 423-429.

© Copyright 2005 CCRMA, Stanford University. All rights reserved.