Machine Recognition in Music

Next: Historical Aspects of Computer Music Up: Research Activities Previous: Psychoacoustics and Cognitive Psychology

Machine Recognition in Music

Optical Recognition of Printed Music: A New Approach

Walter Hewlett

Recent projects in optical recognition of printed music have tended to give top priority to the extraction of pitch symbols (i.e., noteheads). Noteheads give some information about duration (i.e., they are filled or unfilled), but definitive information also requires the accurate reading of stems, flags, and beams. Symbols for articulation (staccato marks, dynamics, slurs, and so forth) are sometimes ignored if the intended use of the scanned material is in sound applications.

In an effort to create a scanning front-end for the CCARH databases of classical music, which are stored in an attribute-rich format (MuseData) to support notation, sound, and analysis, we have taken the following approach: large objects are identified first. This clarifies contextual properties that may bear on pitch (key signatures, clef changes, octave-transposition signs), duration (beams, stems, and flags), and articulation (slurs, ornaments, et al.). The pitch content of the notehead is the last item to be recognized and completes the representation.

Realtime Chord Recognition of Musical Sound: a System Using Common Lisp Music

Takuya Fujishima

I designed an algorithm to recognize musical chords from the input signal of musical sounds. The keys of the algorithm are:

use of ``pitch class profile (PCP)'', or an intensity map of twelve semitone pitch classes
numerical pattern matching between the PCP and built-in ``chord-type templates'' to determine the most likely root and chord type.

I introduced two major heuristics and some other improvements to Marc Leman's ``Simple Auditory Model'' so as to make it a practical one.

I implemented the algorithm using Common Lisp Music. The system worked continuously in realtime on an Silicon Graphics O2 workstation and on Intel PC platforms running Linux. It could take the input sound from the audio input, or from a sound file, and could display the recognition results on the fly.

In the experiments, I first used the pure tone and three other timbres from electronic musical instruments to estimate the potential capability of the system. The system could distinguish all of 27 built-in chord types for chord tones played in the above mentioned timbres. Then I input to the system a 50 second audio excerpt from the opening theme of Smetana's Moldau. Without the two heuristics, the accuracy remained around 80 percent level. When they were applied, the accuracy rose to 94 percent ...196 out of 208 guesses that the system made were musically correct. These experiments showed that the system could recognize triadic harmonic events, and to some extent more complex chords such as sevenths and ninths, at the signal level.

References

Leman, Marc. Music and Schema Theory, Springer-Verlag.
Schottstaedt, William. Machine Tongues XVII. CLM: Music V Meets Common Lisp, Computer Music Journal, Vol.18, No.2, pp.30-38, 1994.
Fujishima, Takuya. Realtime Chord Recognition of Musical Sound: a System Using Common Lisp Music, Proceedings of the 1999 International Computer Music Conference, Beijing, China, pp. 464-467.

Speaker Normalization for Speech Recognition Based On Maximum-Likelihood Linear Transformation

Yoon Kim

In speaker-independent speech recognition, where a pre-trained model is used to recognize speech uttered by an arbitrary speaker, minimizing the effects of speaker-dependent acoustics is crucial. Acoustic mismatches between the test speakers and the statistical model result in considerable degradation of recognition performance. In this work, we apply linear transformation to the cepstral space, which can be viewed as the Fourier dual of the log spectral space. Gaussian mixture is the most popular distribution used for modeling speech segments. If we restrict the feature transformation to be that of convolution, which is equivalent to filtering in the log spectrum domain, the resulting normalization matrix exhibits a Toeplitz structure, simplifying the parameter estimation. The problem of finding the optimal matrix coefficients that maximize the likelihood of the utterance with respect to the existing Gaussian-based model then becomes nothing but a constrained least-squares problem, which is convex in nature, yielding a unique optimum. Applying the optimal linear transformation to the test feature space yielded in considerable improvements over the baseline system in frame-based vowel recognition using data from 23 British speakers, and in isolated digit recognition using the TI digits database, consisting of over 300 speakers.

References

Kim, Yoon and Smith, Julius O., ``A speech feature based on Bark frequency warping - The Nonumiform Linear Prediction cepstrum'', Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, October 1999.
McDonough, J. and Byrne, W., ``Speaker Adaptation with All-Pass Transforms'', Proc. ICASSP, March 1999, 757-760.
Lee, Li and Rose, Richard, ``A frequency warping approach to speaker normalization'', IEEE Trans. on Speech and Audio Processing, Vol. 6, No. 1, 1998, 49-60.

Next: Historical Aspects of Computer Music Up: Research Activities Previous: Psychoacoustics and Cognitive Psychology