The tracker algorithm implements high-resolution sinusoidal analysis suitable for both harmonic and non-harmonic sounds. The technique used for tracking partials is similar to the deterministic analysis of Xavier Serra's SMS (Spectral Modeling Synthesis). One difference with SMS's deterministic analysis is that tracker uses also psychoacoustic information to determine the salience of detected peaks. This information (measured as signal-to-mask ratio, or SMR) is derived from masking effects produced within critical bands, and accounts for the audibility of sinusoidal trajectories. To achieve coherent sinusoidal trajectories, both SMR and frequency deviation information are used to track partials across frames .
Tracker consists of five main modules. The windowing module breaks the analyzed sound into sort-time overlapping segments and applies an analysis window to the signal. Windows from the Blackman-Harris family are normally used but any other window type implemented by CLM can be used (see the documentation section below). The hop size (number of samples to skip by the analysis window) is expressed as a proportion of the window size (0.25 means 1/4 of the window). The size of the analysis window is calculated as a function of the number of cycles of the lowest frequency to be tracked by the system (usually the fundamental in the case of harmonic sounds). The size of the Fast Fourier Transform (FFT), used to compute the Short Time Fourier Transform (STFT) in the analysis, is internally calculated as the closest power of two greater than two windows, assuring enough zero padding. Both the window size (M) and the FFT size (N) can be forced to be any number of samples (the condition: M<=N has to be achieved, and N must be a power of 2).
The spectrum issued from the Short Time Fourier Transform (STFT) of the windowed signal is converted to polar form to obtain the magnitude and phase of each bin. After this, peaks are detected along the dB magnitude spectrum. A peak is a local maximum in the spectrum defined as: |Xk-1| < |Xk| > |Xk+1| where Xk is the peak location and Xk-1, Xk+1 its surrounding bins. Once a peak is detected, its real amplitude, frequency, and phase values are obtained by means of parabolic interpolation. Only peaks with magnitude above an indicated threshold are kept in the analysis. For more details on these peak detection and interpolation techniques you can visit Julius Smith's web page on PARSHL.
Peaks detected in one analysis frame have to be integrated to sinusoidal trajectories. This is done in three steps: first, candidates to continue a particular trajectory are found in the peak pool of the new frame; second, the best candidate is found based on masking and frequency information. Trajectory's frequency and SMR are averaged across frames and these values, called tracks, are used to evaluate which peak of the pool better continues the trajectory. An adjustable number of frames is used to average track values, the latest peak incorporated to the trajectory can also be used (in a weighted fashion) to compute track parameters (this can be useful for tracking unstable sinusoids, see the documentation below). The best peak candidate will be the one with minimal SMR difference and frequency deviation from the track (the intervention of masking information in this process can be also weighted, see the documentation below). Taking two parameters into account, SMR and frequency deviation, practically eliminates conflicts between tracks (i.e. having more than one track claiming for the same peak). Finally, tracked peaks are incorporated to their sinusoidal trajectories, trajectories that didn't incorporate peaks in this frame are "turned off" (tracks keep their last values and wait for candidate peaks in subsequent frames), and peaks left over in the pool "start up" new trajectories and tracks.
Signal to Mask Ratio (SMR) information used during tracking is computed by the Psychoacoustic Processing module. The magnitude of the peaks present in a frame are used to evaluate a masking threshold across frequency. All peak frequencies are converted to Bark scale and linear masking curves traced up and down in frequency from each peak location (note the asymmetry between the left and right slope of the lines on the picture, the right slope is inversely proportional to the magnitude of the peak). After masking curves for all peaks are traced, they are combined to create a masking threshold across the 25 critical bands (from 1 to 25 Barks). Then SMR for each peak is computed as the ratio in dB between the peak's magnitude and the level of the masking threshold at the peak's location.
After partials were tracked, an ATS-SOUND structure is created to store the spectral data. In a post-processing stage, short trajectories with low SMR average value are removed and gaps in continuous trajectories fixed. Also in this step, frequency centroid and average SMR for each partial are computed and stored in the structure.
Once sinusoidal trajectories are fixed and stored, the residual is computed. This step is performed by re-synthesizing the tracked partials using phase information, and subtracting them from the original sound in the time domain. The resulting signal is called residual and contains what was left out by the tracking process, usually noise. A two-channel file with the re-synthesis and the residual is generated by the system, and can be used as an intuitive measure of the sinusoidal tracking quality. Normally, a noisy and low-energy residual is sign of successful tracking (see the documentation below).
After being computed, the residual is analyzed at frame rate using a sliding rectangular window and the STFT. The fequency spectrum of the residual is transformed to Bark scale and energy computed at each of the 25 critical bands. The residual's energy is then re-injected as modulated narrow-bandwidth noise to partials present at each sub-band of the spectrum. Band regions with significant energy where no partials were tracked are kept in a complementary model. Modulated critical-bandwidth noise is used to model residual energy in those remaining sub-bands regions.
tracker file snd &key (start 0.0)(duration nil)(lowest-frequency 20)(highest-frequency 20000.0)(frequency-deviation 0.1)(window-cycles 4)(force-M NIL)(window-type 'blackman-harris-4-1)(force-window NIL)(hop-size 1/4)(fft-size nil)(lowest-magnitude (db-amp -60))(track-length 3)(last-peak-contribution 0.0)(SMR-continuity 1.0)(amp-threshold nil)(min-segment-length 3)(residual "ats-residual.snd")(verbose nil)
;;; clarinet analysis (tracker (concatenate 'string *ats-snd-dir* "clarinet.aif") 'cl :start 0.0 :hop-size 1/4 :lowest-frequency 100.0 :highest-frequency 20000.0 :frequency-deviation 0.05 :lowest-magnitude (db-amp -70) :SMR-continuity 0.7 :track-length 6 :min-segment-length 3 :residual "/tmp/cl-res.snd" :verbose nil) ;;; crotale analysis (tracker (concatenate 'string *ats-snd-dir* "crt-cs6.snd") 'crt-cs6 :start 0.1 :lowest-frequency 500.0 :highest-frequency 20000.0 :frequency-deviation 0.15 :window-cycles 4 :window-type 'blackman-harris-4-1 :hop-size 1/8 :lowest-magnitude (db-amp -90) :amp-threshold -80 :track-length 6 :min-segment-length 3 :last-peak-contribution 0.5 :SMR-continuity 0.3 :residual "/tmp/crt-cs6-res.snd" :verbose nil)
Parameters passed to the transformation functions can be, most of the times, of any of this forms:
Transpose the frequencies of the partials of a sound.
;;; Note: ;;; (ats-sound-partials my-sound) returns the number ;;; of partials of the sound structure my-sound. Here we ;;; are transposing the even and odd partials using different ;;; envelopes. The loop macro is creating the list of ;;; envelopes transp-env that we use in the call to the function. (let ((transp-env (loop for i from 0 below (ats-sound-partials my-sound) with even-env = '(0 1.0 1 2.0) with odd-env = '(0 1.0 1 0.5) collect (if (oddp i) odd-env even-env)))) (trans-sound 'my-sound transp-env :formants T :name 'my-new-sound :simp T))
Performs time stretching over the partials of a sound.
;;; Note: ;;; stretch-env is a list with stretch factors going from 1.0 ;;; for the first partial up to 8.2 for the last partial. After ;;; stretching, higher partials are longer than lower partials. ;;; As we apply stretch-sound to my-new-sound, the ;;; transformation will be cumulative, being the output sound ;;; structure my-new-sound-1 the result of stretching the ;;; previously transposed version of the original my-sound. (let* ((par (ats-sound-partials my-new-sound)) (stretch-env (loop for i from 0 below par for j from 1 by (/ 8.0 par) collect j))) (stretch-sound 'my-new-sound stretch-env :name 'my-new-sound-1))
Operates frequency shifting over the frequencies of the partials of a sound.
This picture shows the shifted voice spectra generated with ATS for Jonathan Harvey's piece Ashes Dance Back, for choir and electronic sounds.
;;; Note: ;;; (ats-sound-frq-av my-sound) returns a vector ;;; containing frequency centroids of the partials of my-sound. ;;; Here we are shifting the even partials ;;; up by 1/8 of their frequency and the odd partials down by ;;; 1/8 of their frequency centroid. The loop macro is creating ;;; the list of shift values shift-env that we use in the ;;; call to the function. (let ((shift-env (loop for i from 0 below (ats-sound-partials my-sound) with frq = (aref (ats-sound-frq-av my-sound) i) collect (if (oddp i)(* 1/8 frq) (* -1/8 frq))))) (shift-sound 'my-sound shift-env :formants T :name 'my-new-sound-2 :simp T))
Transformation functions are built using ATS's API. The API functions and macros make spectral data access easy for users to develop their own transformantion algorithms.
Returns a vector with amplitude data for partial
Returns a vector with frequency data for partial
Returns a vector with time data for partial
Returns a vector with noise energy data for partial
A more general interface wrapping the previous set of macros. Returns a vector with parameter (amp, frq, time, energy) data for partial
Returns amplitude value for partial at frame
Returns frequency value for partial at frame
Returns time value for partial at frame
Returns noise energy value for partial at frame
Returns noise energy value for band at frame
Returns amplitude value for partial at a fractional frame using linear interpolation.
Returns frequency value for partial at a fractional frame using linear interpolation.
Returns time value for partial at a fractional frame using linear interpolation.
Returns noise energy value for partial at a fractional frame using linear interpolation.
Returns noise energy value for band at a fractional frame using linear interpolation.
A more general interface wrapping the previous set of macros. Returns parameter (amp, frq, time, energy, or band energy) for partial or band at fractional frame.
The following set of functions should be used to access data from sounds with non-linear time structure alterations (as the ones performed by stretch-sound using envelopes as stretch factor). This functions are less efficient than the previous ones because data has to be interpolated using time information instead of frame locations.
Returns amplitude value for partial at time using linear interpolation.
Returns frequency value for partial at time using linear interpolation.
Returns noise energy value for partial at time using linear interpolation.
Returns noise energy value for band at time using linear interpolation.
The slots of an ATS-SOUND are accessible both in Lisp and in CLM's run-loop (see the Synthesis section below). Analysis information (such as number of frames and partials) is stored in the structure together with spectral data. To access slot values, Lisp accessor functions should be used. Accessor function names have the ats-sound- prefix followed by the name of the slot, for instance to access the frames slot of a sound called my-sound:
(ats-sound-frames my-sound)Also spectral data can be dereferenced using Lisp's aref function:
(aref (ats-sound-frq my-sound) 0) (aref (aref (ats-sound-frq my-sound) 0) 3)in the first case we access the frequency vector of partial 0 (it is equivalent to do: (get-par-frq my-sound 0)), in the second case we access the frequency value of the partial 0 at frame 3 (it is equivalent to do: (get-frq my-sound 0 3)).
The following list describes the ATS-SOUND structure slots:
(frq-to-bark 1000.0) -> 9.520021or to find which band a particular frequency falls in you can use the find-band function:
(find-band 1000.0) -> 8Note that the Bark scale is defined from 1 to 25, but ATS band numbers go from 0 to 24, that is why 1000Hz falls into band 8 and not 9. To get the edges and center frequency of an ATS band you can use the band-edges and band-center macros:
(band-edges 8) -> (920.0 1080.0) (band-center 8) -> 1000.0The macro band-partials (band-partials band sound frame) returns a list with the numbers of partials present in a particular band at a particular frame
(band-partials 8 my-sound 40) ->(10 11 12)
ATS-HEADER ATS-FRAME-#0 ... ATS-FRAME-#N-1for N frames of data. An ATS-HEADER contains the following data:
The global parameter *ats-magic-number* has a default value of 123.0, it is used for data sanity test only (byte endianess). An ATS-FRAME can be of the following four types:
time (frame starting time) amp (par#0 amplitude) frq (par#0 frequency) ... amp (par#N-1 amplitude) frq (par#N-1 frequency)for N partials.
time (frame starting time) amp (par#0 amplitude) frq (par#0 frequency) pha (par#0 phase) ... amp (par#N-1 amplitude) frq (par#N-1 frequency) pha (par#N-1 phase)for N partials.
time (frame starting time) amp (par#0 amplitude) frq (par#0 frequency) ... amp (par#N-1 amplitude) frq (par#N-1 frequency) energy (band#0 energy) ... energy (band#24 energy)for N partials and 25 critical bands.
time (frame starting time) amp (par#0 amplitude) frq (par#0 frequency) pha (par#0 phase) ... amp (par#n amplitude) frq (par#n frequency) pha (par#n phase) noise (band#0 energy) ... noise (band#n energy)for N partials and 25 critical bands.
The ats-save function saves an ATS-SOUND to disk:
ats-save sound file &key (save-phase T)(save-noise T)
The ats-load function loads a sound from disk into the system:
ats-load file sound &key (dist-energy T)
;;; saving sound with both phase and noise information
(ats-save my-sound "/tmp/my-sound.ats")
;;; saving sound with no phase information
(ats-save my-sound "/tmp/my-sound-no-pha.ats" :save-phase NIL)
;;; loading a file from disk
(ats-load "/tmp/my-sound.ats" 'my-new-sound)
The ats-load function loads a sound from disk into the system:
ats-load file sound &key (dist-energy T)
Performs additive synthesis using oscillators (only sinusoidal components are synthesized).
;;; synthesize all partials of a clarinet (with-sound (:play nil :output "/tmp/cl-1.snd" :srate 44100 :statistics t :verbose t) (sin-synth 0.0 cl)) ;;; synthesize only odd partials (with-sound (:play nil :output "/tmp/cl-2.snd" :srate 44100 :statistics t :verbose t) (sin-synth 0.0 cl :par (loop for i from 1 by 2 below (ats-sound-partials cl) collect i))) ;;; transpose a semitone up during synthesis (with-sound (:play nil :output "/tmp/cl-3.snd" :srate 44100 :statistics t :verbose t) (sin-synth 0.0 cl :frq-scale (expt 2 1/12))) ;;; expand 4 times (with-sound (:play nil :output "/tmp/cl-4.snd" :srate 44100 :statistics t :verbose t) (sin-synth 0.0 cl :duration (* (ats-sound-dur cl) 4)))
General Purpose ATS Synthesizer. This instrument sythesizes both sinusoids and noise. The noise part can contain the partials energy only (band-noise NIL), or both the partials energy and the complementary critical-band energy (if they exist). Time information can be handled in two ways: using time information from partials (time-ptr NIL), or using a time-pointer envelope. In time-pointer mode X values of the time-ptr envelope are proportional time in the ATS sound (1.0=ats-sound-dur) and Y values are proportional times in the output sound (1.0=duration).
;;; plain resynthesis (sines plus noise) using time pointer (with-sound (:play nil :output "/tmp/cl-5.snd" :srate 44100 :statistics t :verbose t) (sin-noi-synth 0.0 cl :time-ptr '(0 0 1 1))) ;;; plain resynthesis (noise only) (with-sound (:play nil :output "/tmp/cl-6.snd" :srate 44100 :statistics t :verbose t) (sin-noi-synth 0.0 cl :time-ptr '(0 0 1 1) :noise-only t)) ;;; using time pointer to modify the attack (with-sound (:play nil :output "/tmp/cl-7.snd" :srate 44100 :statistics t :verbose t) (sin-noi-synth 0.0 cl :time-ptr '(0.0 0.0 0.5 0.1 0.7 0.7 1.0 1.0))) ;;; play backwards and gradually adding noise (with-sound (:play nil :output "/tmp/cl-8.snd" :srate 44100 :statistics t :verbose t) (sin-noi-synth 0.0 cl :time-ptr '(0.0 1.0 0.9 0.3 1.0 0.0) :noise-env '(0.0 0.0 0.9 1.0 1.0 1.0) :amp-env '(0 0 0.1 0 0.9 1 1 1)))
(Temporary sources for CMUCL: ftp://ccrma-ftp.stanford.edu/pub/Lisp/ATS/ATS-1.0-CMUCL.tar.gz)
Read the README file coming with the distribution for installation details.