MUSIC 422/EE 367C - Prof. Marina Bosi

RESEARCH AND PROGRAMMING PROJECT

Perceptual Audio Coding Based on the Sinusoidal Transform

Guillermo Garcia Juan C. Pampin
guille@ccrma.stanford.edu juan@ccrma.stanford.edu

Abstract

In this work, we have explored the possibilities of the sinusoidal model as a frequency-domain representation for perceptual audio coding of various types of audio signals. We have designed a set of techniques for data rate reduction and developed a codec software prototype consisting of three basic blocks that we describe below. We have evaluated the codec on monophonic musical instruments (harmonic and inharmonic), polyphonic orchestral music, singing voice and speech. Results have been very satisfying and have showed that the sinusoidal model can be used to achieve interesting compression factors at high quality, for a wide variety of audio signals. In particular, we believe this work shows that the sinusoidal model is not at all limited to monophonic, harmonic signals, when high quality audio compression is the goal.

1. Introduction

The sinusoidal model supports one of the most ancient techniques for sound synthesis: additive synthesis has been implemented in the first analog synthesizers and is widely known by musicians. The sinusoidal model represents the signal as a sum of sinusoids with frequencies and amplitudes varying over time.

However, the use of additive synthesis for high quality natural sounding synthesis was severely limited in the digital era, because of three main disadvantages:

1- Determining by hand the sinusoidal parameters - frequency and amplitude - necessary to achieve natural sounding synthesis and/or recreate musical instruments is extremely hard and in most cases impossible.

2- The classic additive synthesizer - architectured as a bank of oscillators - has a very high computation cost.

3- It is not easy to manipulate hundreds of oscillators in order to consistently control the timbre of the sound.

Problem #1 was approached by the development of analysis techniques that automatically determine the model parameters. McAulay and Quatieri from MIT Lincoln Lab were the first to propose an automatic analysis technique for sinusoidal modelling of speech ([3]). This idea was taken over by the Computer Music community which has developed methods that are useful for additive synthesis of musical sounds ([1],[2],[5],[6],[7],[8]).

Problem# 2 and #3 have been approached by the development of inverse-FFT based synthesizers and the use of spectral envelopes ([4]).

Since then, there has been a lot of work on sinusoidal modelling, both for speech and computer music applications. These include low bit-rate speech coding, speech synthesis and modification, and music synthesis. However, the use of this type of model has been limited to monophonic harmonic signals when targeting high quality, because of the difficulty to perform a correct analysis of more complex signals.

We have personally spent a lot of time doing research and development on sinusoidal modelling techniques for music synthesis over the past seven years. In this work, we have used the sinusoidal representation for a different kind of application, which is to perceptually code and compress a wide variety of audio signals: monophonic musical instruments - harmonic and inharmonic -, polyphonic orchestral music, singing voice and speech.

Our goal has been to determine what data rates can be achieved while maintaining very high quality for all these types of signals, targeting scores close to 4.0 or more in the ITU-R 5-point impairment scale.

We describe below the details of the sinusoidal codec that we have designed and implemented.

2. The analysis/resynthesis system

The sinusoidal analysis and additive synthesis system is the counterpart of the perfect reconstruction filter banks used in transform-based codecs.

We have used a sinusoidal analysis system based upon an HMM (Hidden Markov Models) partial tracking technique ([2]). This technique is designed to track partials in a wide variety of sounds (e.g. inharmonic, noisy/unvoiced, polyphonic, including partial crossings). Of course, the technique is well suited for harmonic sounds as well, but does not make any harmonicity hypothesis.This analysis system has proved to be very robust for a wide variety of signal types.

In particular, this technique allows to represent noise as a sum of sinusoids, which is quite unusual, as opposed to the classic filtered white noise model. The synthetic noise quality achieved with the sinusoidal representation is particularly good. Thus, the stochastic and deterministic parts of the signal are represented and treated the same way.

The resynthesis has been done with a classic oscillator-bank additive synthesizer. The phase information issued from the analysis was taken into account in the synthesis, thus reconstructing the original wave shape. This is particularly important to accurately reconstruct transients. However, for many sounds, original phase information can be ignored without significant loss of quality, thus saving about one-third of memory space.

Partial Trajectories

3. Software tools and development

We have written all the encoder/decoder software for this project, and leveraged some of the code that we wrote for the M422 homeworks.

Our code also uses the library "Sm" (stands for "sinusoidal modelling") to manipulate partials. "Sm" and the sinusoidal analysis/resynthesis system were developed by Guillermo Garcia at IRCAM, between 1991 and 1995, as part of the work described in [1] and [2].

4. The encoder

The sinusoidal representation usually takes much more memory space - typically between 2 and 10 times more - than the original 16-bit sound file. Thus, to compete with codecs based upon critically-sampled filter bank systems, we have to compress the sinusoidal data file by compression factors which are actually 2 to 10 times greater than the factors achieved with those systems.

Our encoder is structured as a chain of three blocks:

1- Partial pruning based upon psychoacoustics masking.
2- Smart sinusoidal frame decimation based upon transient detection.
3- Bit allocation based upon psychoacoustics masking, and quantization.

We describe now in detail each one of these blocks.

4.1 Partial pruning

This operation is performed in two steps:

1- Masking curve evaluation:

Using a sliding window of K frames, partial trajectories of frequency and amplitude are averaged over time within the window. Frequencies are converted to the bark scale and amplitudes are expressed in decibels. We compute the masking curve for the "average" frame for each position of the sliding window, which is typically three to ten frames long. The masking curve value for each partial in the "average" frame is stored in a file. This step is time consuming.

2- Reduction of partials

A mask threshold is specified, and both the file of partials and the mask file are piped into this block. Partials whose mask value is above the threshold are discarded. This step has very low complexity.

This two-step strategy allows us to evaluate the mask values once for all, and then optimize the actual partial pruning by trying several passes with different mask thresholds, until the optimal threshold value is found. Of course, this strategy is only valid at the research stage, and the partial pruning would be performed in one step in a totally automatic system for encoding.

4.2 Frame decimation

Since simple frame decimation by a constant factor would degrade the time resolution of the decoded signal, we have designed an algorithm that eliminates frames which are redundant, i.e. can be well aproximated by interpolation of the two surrounding frames.

In this way, the original frame rate is kept during transients in order to maintain good time resolution. During stationary parts of the signal, we take advantage of the nature of the sinusoidal representation to greatly reduce the frame rate.

This algorithm is the counterpart of the window-length automatic selection in transform based codecs, but is much more powerful since sinusoidal frames can be severely decimated during steady parts of the sound, thus reducing the data rate dramatically. Note that transform-based codecs are constrained to increase the window size in order to increase the step. In our codec, we have achieved frame steps of more than one hundred milliseconds.

Amplitude Continuity Measurement

4.3 Quantization

Frequencies and amplitudes are quantized using floating point. Phases are uniformly quantized.
Bit allocation is based upon psychoacoustics masking. Scales have a fixed number of bits.
Mantissa length Rm is calculated as:

Rm = R + 20 log10(amplitude / mask) / K
where R is the average mantissa bits, and K is the number of decibels per additional bit.
R is bigger for frequencies than for amplitudes, due to the fact that the ear is more sensitive to frequency modulation than to amplitude modulation.

Encoder Block Diagram and Compression Ratios for different types of sounds

5. Complexity

The sinusoidal encoder is very sophisticated and thus presents an extremely high complexity. On the other hand, the decoder can be easily implemented in real time. It has two stages: dequantization and additive synthesis.

6. Results

We have achieved significant compression ratios (up to 14:1) keeping high quality.
The codec is robust enough to deal with a wide variety of sounds. Sound examples will be played in our presentation.
We will pursue further research to improve the codec. We plan to:

incorporate temporal masking to obtain a better masking curve evaluation.
develop a more sophisticated transient detection technique.
refine the bit allocation strategy.

9. References

Depalle, Ph., G. Garcia and X. Rodet. 1993. "Analysis of Sound for Additive Synthesis: Tracking of Partials Using Hidden Markov Models." Proceedings of the 1993 International Computer Music Conference. San Francisco: Computer Music Association.
Garcia G. 1992. "Analyse des Signaux Sonores en Termes de Partiels et de Bruit. Extraction Automatique des Trajets Frèquentiels par des Modèles de Markov Cachès." Mèmoire de DEA en Automatique et Traitement du Signal, Orsay, 1992.
McAulay, R.J. and T.F. Quatieri. 1986. "Speech Analysis/Synthesis based on a Sinusoidal Representation." IEEE Transactions on Acoustics, Speech and Signal Processing 34(4):744--754.
Rodet, X. and P. Depalle. 1992. "Spectral Envelopes and Inverse FFT Synthesis." 93rd Convention of the Audio Engineering Society. San Francisco, October 1992.
Serra, X. 1989. A System for Sound Analysis/Transformation/Synthesis based on a Deterministic plus Stochastic Decomposition. Ph.D. Dissertation, Stanford University.
Serra, X. and J. Smith. 1990. "Spectral Modeling Synthesis: A Sound Analysis/Synthesis System based on a Deterministic plus Stochastic Decomposition.'' Computer Music Journal 14(4):12--24.
Serra, X. 1994a. "Residual Minimization in a Musical Signal Model based on a Deterministic plus Stochastic Decomposition." Journal of the Acoustical Society of America 95(5-2):2958--2959.
Smith, J.O. and X. Serra. 1987. "PARSHL: An Analysis/Synthesis Program for Non-Harmonic Sounds based on a Sinusoidal Representation." Proceedings of the 1987 International Computer Music Conference. San Francisco: Computer Music Association.
Scott Levine, Julius O. Smith III. "A Sines+Transients+Noise Audio Representation for Data Compression and Time/Pitch-Scale Modifications." 105th Audio Engineering Society Convention, SF 1998.