- Dynamic Range Compression of Audio Signals Consistent with Recent Time-Varying Loudness Models
- A Tunable, Nonsubsampled, Non-Uniform Filter Bank for Multi-Band Audition and Level Modification of Audio Signals
- Distributed Internet Reverberation for Audio Collaboration
- Loudness-Based Display and Analysis Applied to Artificial Reverberation
- Audio Watermarking based on Parametric Representations
- Bayesian Two-Source Models for Stereo Sound Source Separation of N Sources
- BandedWG.ins
- Spectral Audio Signal Processing
- Using a perceptually based timbre metric for parameter control estimation in a physical model of the clarinet
- Perceptual Distance in Timbre Space
- A Bayesian Framework for Joint Onset Detection, Transient Region Characterization, and Melody Transcription
- Transient Detection and Modeling
- A Robust Maximum Likelihood F0 Estimation from STFT Peaks: Exact and Fast Approximate MCMC Approaches
- Speaker Array Calibration Using Inter-Speaker Range Measurements
- Audio and Gesture Latency Measurements on Linux and OSX

Dynamic range compression may be used to increase the volume of the softer passages of an audio signal relative to its louder portions, thus making the signal better suited to transmission through or storage on a given medium. The level-detection characteristics of typical contemporary dynamic range compressors are analyzed and investigated, thus revealing the shortcomings of such models in light of knowledge about steady-state and time-varying loudness as perceived by the human auditory system. The design of an equal-loudness filter, desired to improve the steady-state properties of compressor level detection, is presented. Finally, the time-varying properties of the level detection scheme presented, configured via attack and release times, are tuned to provide optimal correspondence with a recently proposed model of time-varying loudness.

The need for investigation and (possibly non-linear) modification
(e.g. dynamic range compression) of the components of an audio signal
corresponding to different spectral bands is widespread. Two examples
of present focus include multi-band dynamic range compression of a
musical signal and frequency-dependent gain control in a hearing aid.
We present a technique we call *sub-band audition*, which enables
the user to listen to individual sub-bands of an audio signal, and
thus better determine required modifications. The filter bank
requirements of such an application are presented. Finally, the
design of a tunable, nonsubsampled, non-uniform filter bank, based on
allpass-complementary filter banks and Elliptic Minimal Q-Factor
(EMQF) filters, is presented.

Low-latency, high-quality audio transmission over next-generation Internet is a reality. Bidirectional, multichannel flows over continental distances have been demonstrated in musical jam sessions and other experimental situations. The dominating factor in delay is no longer system issues, but the transmission time bounded by lightspeed. This paper addresses a method for creating shared acoustical spaces by ``echo construction.'' Where delays in bidirectional paths are sufficiently short and ``room-sized,'' they can be used to advantage as components in synthetic, composite reverberation.

The project involves setting up two collaborating audio hosts (e.g., Seattle and San Francisco locations) separated by short internet delay times (e.g., in this example RTT = 20ms). Monitoring on both ends includes a composite reverberation in which the round-trip delay is used to construct multipath echoes, corresponding to multiple ``rays'' in a composite room.

The first implementation involves two identical rooms with identical monitoring (microphone and speaker locations). For simplicity, the rooms can be thought of as small, 10ft on a side. Using the technique described, a composite room is heard which incorporates the 10ms network delay in a synthetic reverberation circuit running in software as part of the audio transmission system. The added 10ms roughly corresponds to an additional 10ft inserted between the monitoring locations. The listener's have the impression of communicating with each other in the same 30ft room.

A recent paper describes the audio transmission techniques, multichannel monitoring and reverberation circuit and initial subjective evaluation of this ``echo construction'' method.

**References:**

- Chafe, C. (2002)
*Oxygen Flute, A Computer Music Instrument that Grows*. Proceedings of the 2002 Keihanna Multimedia Festival, Kyoto, Japan. - Chafe, C. (2003)
*Distributed Internet Reverberation for Audio Collaboration*. Proceedings of the 24th Audio Engineering Society, Banff, Canada.

We propose a psychoacoustically motivated method to analyze the quality of reverberation and other audio signals. A time-varying loudness model is used as a front end to produce a visual display of a signal which emphasizes its perceptual features. A time-frequency display of reverberation impulse responses based on specific loudness is shown to produce a psychoacoustically relevant visualization of response features. In addition, a metric based on instantaneous loudness is proposed as an objective measure of quality and texture of late reverberation.

**Reference:** ICMC-04 paper (same title and authors).

Synthesized multimedia objects are emerging everywhere now. One can talk on the phone to a virtual representative that speaks a synthesized tongue, drink soda of synthesized taste, such as Coke, or even fall in love with Simone, a synthesized character. It becomes urgent to protect such objects as intellectual properties, for the synthesis of them often involves a lot of computational power and human labor. In my dissertation research, a framework is proposed for the design of robust watermarking algorithms on the synthesis parameter domain.

Particularly for audio signals that are sinusoidal in nature, such as vowels of human speech or sustaining tones of a musical instrument, I have been experimenting the idea of watermarking by quantizing the frequency. To implement it, a signal has to first be decomposed into sinusoidal and non-sinusoidal components. Then, frequencies of sinusoidal partials are quantized to carry binary information. The quantization is set to be as small as can not be heard by human ears, but meanwhile as large as possible so that the quantization step can easily be resolved when the watermark's binary information needs to be extracted. The frequency-quantized sinusoids are carefully synthesized and superposed with the non-sinusoidal components, which are un-altered, to form a watermarked version of the original signal.

To decode the watermark embedded as described above, a frequency estimator with very high accuracy is necessary. I developed an efficient algorithm that can track 50-100 partials from a mildly noisy observation. The algorithm often (but not always) approaches the Cramer-Rao lower bound, a theoretical limit in parameter estimation. Empirically, the algorithm works well when all partials in the spectrum are well separated to begin with.

**Reference:**

- Liu, Y. and Smith, J.O. (2003)
*Watermarking parametric representations for synthetic audio*, Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, Hong Kong. Available online at`http://ccrma.stanford.edu/~jacobliu/DWM/`. - Liu, Y. and Smith, J.O. (2004)
*Audio watermarking based on sinusoidal analysis and synthesis*, Proceedings of the 2004 International Symposium on Musical Acoustics, Nara, Japan. - Liu, Y. and Smith, J.O. (2004)
*Watermarking sinusoidal audio representations by quantization index modulation in multiple frequencies*, Proceedings of the 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada.

We presently consider an enhancement to the DUET sound source separation system [1]. Specifically, we expand the system and the related delay and scale subtraction scoring (DASSS) [2] to allow sounds from exactly two sources to be active at the same point in time-frequency (STFT) space. We begin with a review of the DUET system and its sparsity and independence assumptions. We then consider how the DUET system and DASSS respond when faced with two active sources at the same point in time-frequency space. For this case, we show the important result that the DUET and DASSS data may reveal which two sources are active. To exploit this result, we present a Bayesian framework for determining the most likely sources given DUET and DASS data. We then show how to use this framework on DASSS data. We conclude with an example showing the efficacy of using DASS data for determining and demixing two active sources.

Aaron S. Master. Bayesian Two Source Modeling for Separation of N Sources from Stereo Signals. Submitted to ICASSP 2004, Montreal.

Aaron S. Master. Bayesian Two Source Modeling for Stereo Sound Source Separation of N Sources, Stanford University EE 391 Report, Summer 2003.

**Reference:**

- Aaron Master,
*Bayesian Two Source Modeling for Separation of N Sources from Stereo Signals*. Submitted to ICASSP 2004, Montreal. - Aaron Master,
*Bayesian Two Source Modeling for Stereo Sound Source Separation of N Sources*. EE 391 Technical Report, Summer 2003, Stanford University.

BandedWG.ins

Banded waveguides are an efficient method for physical modeling of bar percussion instruments such as bowed bar of wood or metal or glass, bowls and even corrugated surfaces as proposed by Essel and Cook in 1999. A CLM instrument has been written using the bow-table paradigm for signal generation plus an array of banded waveguides arranged in parallel. A banded waveguide consists of a bandpass filter, fine tuned to the frequency of one of the modes of the object in vibration plus a delay-line also mode dependent. Since different material or shapes have different modes of vibration, the algorithm allows for manipulating the number of modes and the frequency of each mode. Delay-lines are function of the sampling rate and factor of the position of each mode. The CLM version has few advantages over a real-time implementation: scores can be generated, modes can be specified independently and allows for narrowing on specific frequencies and resonances. This instrument can be interfaced to algorithmic composition packages and its signal can be scattered on a multi-path multichannel system given the high frequency components on this kind of timbre.

The Fast Fourier Transform (FFT) revolutionized signal processing practice in the 1960s. Today, it continues to spread as a practical basis for digital systems implementation. Only in the past decade or so has it become cost-effective to use the short-time FFT in real-time digital audio systems, thanks to the availability of sufficiently powerful, low-cost, single-chip solutions.

In the digital audio field, FFT-based techniques are useful in digital mixing consoles, post-production editing facilities, and top-quality digital audio gear. Many music and digital audio ``effects'' can be conveniently implemented in a unified way using a short-time Fourier analysis, modification, and resynthesis facility.

In contrast with physical modeling synthesis, which models the
*source* of a sound, spectral modeling techniques model sound
at the *receiver*, the human ear. Spectral modeling is more
immediately general than physical modeling since it is capable of
constructing an arbitrary stimulus along the basilar membrane of the
ear. While complex coarticulation effects are more naturally provided
by physical models, the short-time Fourier transform can be applied to
any sound demonstrating any desired effect to determine what must
happen in a spectral sequence to produce that effect.

FFT-based techniques play an important role in (1) the practical implementation of general signal processing systems (fast convolution), (2) advanced effects such as ``cross synthesis,'' time compression/expansion, duration-invariant frequency shifting, and other ``phase vocoder'' type techniques, (3) noise reduction, (4) perceptually calibrated spectral display, (5) perceptual audio compression, and (6) novel synthesis systems based on the direct creation and transformation of spectral events and envelopes.

**References:**

- Abe, M., and J. O. Smith, ``Design Criteria for Simple Sinusoidal
Parameter Estimation based on Quadratic
Interpolation of FFT Magnitude Peaks'', pre-print 6256,
Audio Engineering Society Convention, San Francisco,
2004.
- Levine, S.,
*Audio Representations for Data Compression and Compressed Domain Processing*, Ph.D. thesis, Electrical Engineering Department, Stanford University (CCRMA), December 1998. Available online at`http://ccrma.stanford.edu/~scottl/thesis.html`. - Serra, X., and J. O. Smith, ``Spectral Modeling Synthesis: A Sound Analysis/Synthesis System Based on a Deterministic plus
Stochastic Decomposition,''
*Computer Music J.,*vol. 14, no. 4, pp. 12-24, Win., 1990. The latest free Spectral Modeling Synthesis (SMS) software can be obtained from the SMS home page at`http://www.iua.upf.es/~sms`. - Smith, J.O., and X. Serra, ``PARSHL:
A Program for the Analysis/Synthesis of
Inharmonic Sounds Based on a Sinusoidal Representation'' (ICMC-87). Available
online at
`http://ccrma.stanford.edu/~jos/parshl/` - Smith, J. O., ``Music 421 (EE 367B) Bibliography: Audio
Spectral Modeling,''
`http://ccrma.stanford.edu/CCRMA/Courses/421/References.html`. - Smith, J. O.,
*Mathematics of the Discrete Fourier Transform*,`http://ccrma.stanford.edu/~jos/mdft/`.

Time intensive, trial and error based manual adjustment of control parameters poses a limitation on applied physical modeling synthesis. The goal of this study is to provide an efficient approach for automatic estimation of optimal control parameters for physical modeling synthesis. The method is based on psychoacoustically motivated timbre comparisons between a recorded reference sound and a set of corresponding synthesized sounds. The timbre comparisons are based upon the sample mean and standard deviation between Mel-Frequency Cepstral Coefficients (MFCC) computed using several steady-state time frames from the reference and synthesized sounds.

This research is concerned with developing a perceptual space for timbre. We define an objective metric that takes into account perceptual orthogonality, and measures the quality of timbre interpolation applicable to perceptually valid timbral sonification. We discuss two timbre representations and measure perceptual judgment. We determined that a timbre space based on Mel-frequency cepstral coefficients (MFCC) is a good model for perceptual timbre space.

**Reference:**

- Terasawa, H., Slaney, M., and Berger, J.
*Perceptual distance in timbre space*. Proceedings of International Conference on Auditory Display (ICAD05) Limerick, Ireland (2005)

Onset detection and transient region identification for musical audio signals have proven to be vital components in many improved time and pitch scaling algorithms which reduce artifacts due to these transient phenomena. However, these detection tasks are generally quite difficult thanks to the complexities of real-world musical audio signals. Fortunately, musical signals are highly structured--both at the signal level, in terms of the spectrotemporal structure of note events, and at higher levels, in terms of melody and rhythm. These structures generate context useful in predicting attributes such as pitch content, the presence and location of onsets, and the boundaries of transient regions. To this end, we propose a dynamic Bayesian framework for which contextual predictions may be integrated with signal information in order to make optimal decisions concerning these attributes. The result is a joint segmentation and melody retrieval for nominally monophonic signals (nevertheless containing reverberation, polyphony due to note overlaps, and even background instrumentation). The system detects note event boundaries and pitches, also yielding a frame-level sub-segmentation of these events into transient/steady-state regions. The approach is successfully applied to notoriously difficult examples like bowed string recordings captured in highly reverberant environments.

Transient events are regions not ``well-modeled'' by a locally stationary sinusoidal model. Examples include abrupt changes or fast decays/modulations in mode amplitudes/frequencies. Transient regions usually follow onsets, which are preceded by an abrupt change. This suggests a twofold approach for transient detection:

*Segmentation:*Find boundaries of abrupt change in the underlying signal model (e.g. sinusoidal, Gaussian AR, or some physically-informed model)*Transient Characterization:*Classify each region depending on some predetermined cost criteria (e.g. expected bitrate).

**Segmentation - Bayesian Approach:**

- The classical approach concerns a piecewise constant signal model, with statistically independent segments. We have recently adopted a more general, unified Bayesian framework, as follows: Let
indicate change at time . Then, the model for the signal is described:

- This framework allows us to exploit additional prior information and information about musical structure, according to the specifications:
*Prior probability of change*: The distribution may encode information about the structure of rhythm.*Markov evolution of parameter jumps*: The distribution may encode information about melodic/timbral evolution.*Allowance for slow parameter variations*: The distribution may be used to allow for slow, continus variations in the model parameter within a segment, to emphasize the ``abruptness'' of change.

**Applications:**

*Joint Rhythm Tracking and Onset Detection:*- A three-layer switching state space model (rhythm tracker) is used to learn the pattern of onsets. The top layer encodes the discrete rhythmic interval and metrical position; middle layer encodes tempo and inherent onset position; bottom layer gives the observed onset position.
- The rhythm tracker produces the posterior distributions about the next segment points given segments observed so far, e.g. The segmenter uses these distributions as local priors to detect the next batch of segments, which provide subsequent observations for the rhythm tracker. The net effect is improved segmentation performance about musically relevant changes (onsets); spurious changes are ignored or suppressed. The behavior mimics somewhat the cognitive activity of the human listener, though rigorous parallels are not yet established.

- A three-layer switching state space model (rhythm tracker) is used to learn the pattern of onsets. The top layer encodes the discrete rhythmic interval and metrical position; middle layer encodes tempo and inherent onset position; bottom layer gives the observed onset position.
*Harmonic Comb Models for Piano Transcription:*To improve segmentation performance for specific musical signals, we wish to exploit a higher degree of structure than is available from generic (unconstrained) AR or sinusoidal models. Additional structure allows us to support a high model order, or high number of sinusoids modeled, because the model is highly constrained in a probabilistic sense. The additional structure may be motivated by explicit knowledge of the physics of a particular instrument, say, piano.*Changeograms:*The changeogram gives a nonparametric view of the posterior probability that an abrupt change occurs at a particular time based only on information in local windows. Uses and properties are as follows:- The changeogram may be peak-picked/thresholded to yield a ``quick and dirty'' estimation of change points
- The changeogram itself serves as an ``empirical Bayes'' prior for further offline Bayesian segmentation. The inherent structural assumption is that changes are infrequent but occur in clumps.
- Changes spaced far enough apart with respect to the window size appear resolved as ``peaks'' in the representation. The height of the peaks corresponds to the intensity of the change.
- The size of the window limits resolution: When two changes are spaced at less than the window size, the change with less intensity does not survive.
- A kernel may be chosen such that peaks are dilated in the representation. For the ``empirical Bayes'' approach, the kernel expresses uncertainty that additional change points have been masked by the main peaks.

We propose a robust -estimation in the presence of interference and low SNR without the computational requirements of optimal time-domain methods. Sinusoidal peaks are extracted by a windowed STFT; the collection of peak frequencies and amplitudes drives our analysis. Given and a reference amplitude, peak frequency/amplitude observations are modeled probabilistically in a sense robust to undetected harmonics, spurious peaks, skewed peak estimates, and inherent deviations from ideal or other assumed harmonic structure. Parameters and are estimated by maximizing the observations' likelihood, though is treated as a nuisance parameter. Our model utilizes a hidden, discrete-valued descriptor variable identifying spurious/undetected peaks. The likelihood evaluation, requiring a computationally unwieldy summation over all descriptor states, is successfully approximated by a MCMC traversal chiefly amongst high-probability states.

Given an array of speakers and a set of noisy inter-speaker range estimates, we consider the problem of estimating the relative positions of the array elements. A closed-form position estimator which minimizes an equation error norm is presented and shown to be related to a multidimensional scaling analysis. The information inequality is used to bound position estimate mean square error and to guage the accuracy of the closed-form estimator. A geometric interpretation of the bound variance is given and used in examining our simulation results.

**References:**

- Jeffrey Walters, Scott Wilson, and Jonathan Abel. (2004)
*Speaker Array Calibration Using Inter-Speaker Range Measurements*. Presented at the 116th Convention of the Audio Engineering Society, Berlin, Germany, May 2004. - Scott Wilson, Jeffrey Walters, and Jonathan Abel. (2004)
*Speaker Locations From Inter-Speaker Range Measurements: Closed-Form Estimator and Performance Relative to the Cramer-Rao Lower Bound*. Proceedings of the 2003 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), Montreal, Canada, May 2004.

We have measured the total system latencies of MacOS 10.2.8, Red Hat Linux (2.4.25 kernel with low-latency patches), and Windows XP from stimulus in to audio out, with stimuli including analog and digital audio, and the QWERTY keyboard. We tested with a variety of audio hardware interfaces, audio drivers, buffering and related configuration settings, and scheduling modes. All measured audio latencies tracked expectedly with buffer sizes but with a consistent amount of unexplained additional latency. With analog I/O there was also a consistent additional bandwidth-dependent latency seemingly caused by hardware. Gesture tests with the QWERTY keyboard indicate discouragingly large amounts of latency and jitter, but large improvements on Linux when real-time priorities are set.

**Reference:**

- Wright, M., R. J. Cassidy and M. F. Zbyszynski (2004). ``Audio and Gesture Latency Measurements on Linux and OSX''. Proc. International Computer Music Conference, Miami, FL, ICMA: 423-429.

© Copyright 2005 CCRMA, Stanford University. All rights reserved. |