next up previous contents
Next: Historical Aspects of Computer Music (Past) Up: Past Research Activities Previous: Psychoacoustics and Cognitive Psychology (Past)

Machine Recognition in Music (Past)


Statistical Pattern Recognition for Prediction of Solo Piano Performance (February 1999)

Chris Chafe

The research involves modeling human aspects of musical performance. Like speech, the exquisite precision of trained performance and mastery of an instrument does not lead to an exactly repeatable performed musical surface with respect to note timings and other parameters. The goal is to achieve sufficient modeling capabilities to predict some aspects of expression in performance of a score. The present approach attempts to capture the variety of ways a particular passage might be played by a single individual, so that a predicted performance can be defined from within a closed sphere of possibilities characteristic of that individual. Ultimately, artificial realizations might be produced by chaining together different combinations at the level of the musical phrase, or guiding in real time a synthetic or predicted performance.

A pianist was asked to make recordings (in the Disklavier MIDI data format) from a progression of rehearsals during preparation of a work (by Charles Ives) for concert. The samples include repetitions of the excerpt from the same day as well as recordings over a period of months. This performance data (containing timing and velocity information) was analyzed using classical statistical feature extraction methods tuned to classify the variety of realizations. Chunks of data representing musical phrases were segmented from the recordings according to an ``effort parameter'' that has been previously described. Presently under study is a simulation system stocked with a comprehensive set of distinct musical interpretations which permits the model to create artificial performances. It is possible that such a system could eventually be guided in real time by a pianist's playing, such that the system is predicting ahead of an unfolding performance. Possible applications would include present performance situations in which appreciable electronic delay (on the order of 100's of msec.) is musically problematic.

Koto Musical Score Database (April 2002)

Sachiko Deguchi and Craig Stuart Sapp

Research is being done on the representation of koto musical scores in electronic format for use in music analysis with computers. The koto is a Japanese instrument with 13 plucked strings. Each string can be tuned to any pitch by positioning the bridge of each string, although tunings used in traditional koto music are limited. Koto music was originally transmitted orally, using syllables to indicate basic musical patterns, but not exact pitches. Western music notation is not sufficient to describe the numerous ornamental techniques, so modern koto players have developed their own notation systems based on numbers for each string plucked by the right hand and a set of modifying symbols for the pitch ornamentations done primarily played with the left hand. Converting these scores to Western notation is not sufficient for analysis by computer because much of the ornamental techniques are unique to koto playing. Sample musical encodings of koto scores are available on the web at

Realtime Chord Recognition of Musical Sound: a System Using Common Lisp Music (April 2000)

Takuya Fujishima

I designed an algorithm to recognize musical chords from the input signal of musical sounds. The keys of the algorithm are:

  1. use of ``pitch class profile (PCP)'', or an intensity map of twelve semitone pitch classes
  2. numerical pattern matching between the PCP and built-in ``chord-type templates'' to determine the most likely root and chord type.
I introduced two major heuristics and some other improvements to Marc Leman's ``Simple Auditory Model'' so as to make it a practical one.

I implemented the algorithm using Common Lisp Music. The system worked continuously in realtime on an Silicon Graphics O2 workstation and on Intel PC platforms running Linux. It could take the input sound from the audio input, or from a sound file, and could display the recognition results on the fly.

In the experiments, I first used the pure tone and three other timbres from electronic musical instruments to estimate the potential capability of the system. The system could distinguish all of 27 built-in chord types for chord tones played in the above mentioned timbres. Then I input to the system a 50 second audio excerpt from the opening theme of Smetana's Moldau. Without the two heuristics, the accuracy remained around 80 percent level. When they were applied, the accuracy rose to 94 percent ...196 out of 208 guesses that the system made were musically correct. These experiments showed that the system could recognize triadic harmonic events, and to some extent more complex chords such as sevenths and ninths, at the signal level.


Estimation of Sinusoids in Audio Signals Using an Analysis-By-Synthesis Neural Network (July 2001)

Guillermo Garcia

In this paper we present a new method for estimating the frequency, amplitude and phase of sinusoidal components in audio signals. An analysis-by-synthesis system of neural networks is used to extract the sinusoidal parameters from the signal spectrum at each window position of the Short-Term Fourier Transform. The system attempts to find the set of sinusoids that best fits the spectral representation in a least-squares sense. Overcoming a significant limitation of the traditional approach in the art, preliminary detection of spectral peaks is not necessary and the method works even when spectral peaks are not well resolved in frequency. This allows for shorter analysis windows and therefore better time resolution of the estimated sinusoidal parameters. Results have also shown robust performance in presence of high levels of additive noise, with signal-to-noise ratios as low as 0 dB.

Optical Recognition of Printed Music: A New Approach (April 2000)

Walter Hewlett

Recent projects in optical recognition of printed music have tended to give top priority to the extraction of pitch symbols (i.e., noteheads). Noteheads give some information about duration (i.e., they are filled or unfilled), but definitive information also requires the accurate reading of stems, flags, and beams. Symbols for articulation (staccato marks, dynamics, slurs, and so forth) are sometimes ignored if the intended use of the scanned material is in sound applications.

In an effort to create a scanning front-end for the CCARH databases of classical music, which are stored in an attribute-rich format (MuseData) to support notation, sound, and analysis, we have taken the following approach: large objects are identified first. This clarifies contextual properties that may bear on pitch (key signatures, clef changes, octave-transposition signs), duration (beams, stems, and flags), and articulation (slurs, ornaments, et al.). The pitch content of the notehead is the last item to be recognized and completes the representation.

Instrument Identification of Polyphonic Signals Using Independent Subspace Analysis

Pamornpol (Tak) Jinachitra

The problem of sound source identification is not only an academic curiousity on how the human brains work and how to make a computer system which can do the same. A desire for automatic classification of audio materials according to instruments makes the problem a practical one. Now with the presence of MPEG-7 standard for more search-amenable audio material database, a system for automatic instrument identification from a real song is even more desirable.

In this research, a system which tries to identify the musical instruments playing concurrently in a mixture is investigated. The features used in classification are derived from the Independent Subspace Analysis (ISA) which somewhat decomposes each source, and the mixture, into its statistically ``independent'' components. Without re-grouping or actually separating the sources, these features can be used as fingerprints of each instrument, assuming the decomposition is robust to the mixing process. The test on two-tonal instrument mixes from a set of five instruments gives a 67 percent success rate of having one instrument correctly identified and a 40 percent rate of both correct.

While the results show that features corresponding to note attacks or non-stationary components such as FM may be useful in an identification problem of simple tonal mixes, their roles diminish greatly in real songs. In the future, a sytem which can better deal with real recordings will be investigated along with alternative statistical decompositions.

Constrained EM Estimates for Harmonic Source Separation

Pamornpol (Tak) Jinachitra

A constrained iterative method for harmonic source sinusoidal parameter estimation is proposed based on an EM algorithm with an intent for harmonic source separation. The problem of coinciding partials and interference among them in general is mitigated by the constraints on the ``weak'' partials on the stronger ones of the same harmonic source. A useful scheme to determine the weakness of a partial is proposed. The constrained iteration is shown to give more accurate estimates of the sinusoidal parameters which results in good source separation for most cases of mixture with overlapping spectra.


Speaker Normalization for Speech Recognition Based On Maximum-Likelihood Linear Transformation (April 2000)

Yoon Kim

In speaker-independent speech recognition, where a pre-trained model is used to recognize speech uttered by an arbitrary speaker, minimizing the effects of speaker-dependent acoustics is crucial. Acoustic mismatches between the test speakers and the statistical model result in considerable degradation of recognition performance. In this work, we apply linear transformation to the cepstral space, which can be viewed as the Fourier dual of the log spectral space. Gaussian mixture is the most popular distribution used for modeling speech segments. If we restrict the feature transformation to be that of convolution, which is equivalent to filtering in the log spectrum domain, the resulting normalization matrix exhibits a Toeplitz structure, simplifying the parameter estimation. The problem of finding the optimal matrix coefficients that maximize the likelihood of the utterance with respect to the existing Gaussian-based model then becomes nothing but a constrained least-squares problem, which is convex in nature, yielding a unique optimum. Applying the optimal linear transformation to the test feature space yielded in considerable improvements over the baseline system in frame-based vowel recognition using data from 23 British speakers, and in isolated digit recognition using the TI digits database, consisting of over 300 speakers.


Extensions: From Spectral Pitch Estimation to Automatic Polyphonic Transcription

Randal J. Leistikow and Harvey Thornburg

Recently, we have begun to extend the pitch estimation developments, presently pursuing a method for Bayesian spectral chord recognition, with our likelihood evaluation as the main computational engine. Octave, key, mode, root, chroma and tuning are abstracted, and may jointly be identified given only the peaks from a STFT frame. As well, we outline a scheme for integrating framewise correspondences. The latter results in nothing more sophisticated than an HMM, where exact inference is possible (apart from the aforementioned MCMC steps in likelihood evaluation), and it is at least clear where to insert "domain knowledge" (from music theory) in the hierarchy. Two levels of structure exist: interframe dependencies and dependencies across musical transitions. As well, a third, independent level of structure may be exploited in terms of metrical accent. Our approach provides an extremely simple and hopefully robust method for handling automatic transcription tasks, at least those involving polyphonic recordings of a single instrument.

Computational Models for Musical Style Identification

Yi-Wen Liu and Craig Sapp

Research is underway to identify musical features which can be used to distinguish between different composers writing in a common style. In the preliminary experiments on Mozart and Haydn's string quartet movements, probabilities of transition between classes of musical events (such as pitch classes, rythmic classes, etc.) are computed and compared in the information theoretic sense. It is shown that the classification is 66% ``successful'' (68/100 in Mozart and 136/212 in Haydn) based only on examining note transition probabilities of the first violin part. A web-paper can be found at

As a control for the identification accuracy of computational models, a human-based experiment is being conducted over the web at where randomly selected MIDI files of string quartet movements composed by either Mozart or Haydn are played to listeners. The test takers must then choose the composer who they think wrote the musical sample being played. Summary statistics for identifications are viewable in real-time on the experiment's webpage. For example, a prototypical Mozart composition would be K 285, movement 1, where 84% (21/25) correct identifications have been made. A prototypical Haydn quartet movement would be Op. 74, no. 2, movement 3, which so far has 92% (11/12) correct identifications. For all test takers, the average accuracy of distinguishing between Mozart and Haydn string quartets is 56% over 4172 trials. Trained classical musicians with no experience listening to the string quartets of either Mozart or Haydn can identify the correct composers with about 70% accuracy.

Musical data for the experiments has been provided by CCARH which has electronically encoded the scores for nearly all of Mozart and Haydn's string quartets.

A Method of Automatic Recognition of Structural Boundaries in Recorded Musical Signals

Unjung Nam

This research explores a method of determining appropriate analysis settings for the self-similarity method in order to determine meaningful structural segmentation of the music. Instead of arbitrarily selecting a kernel size, or pre-determining a kernel size at a level that will detect redundancy at a particular structural level (e.g. note, measure, phrase structure), we recursively grow the kernel size in order to find multiple hierarchical musical structures within the signal. The meaningful kernel sizes are extracted by detecting the local peaks from the normalized variances of the novelty matrix. Finally, novelty scores at these kernel sizes are plotted to observe the hierarchical musical structure of the signal with regard to the novelty and the redundancy.

The research is applicable both to segmentation tasks within a recorded musical excerpt of work, and for comparative tasks amongst multiple excerpts or works. The implications of this research on machine recognition of music and music information retrieval are explored and the applications of automatic music segmentation on music summarization, and other genre classification methods are explored.

Audio Content-Based Retrieval Methods and Automatic Style Classification

Unjung Nam

The rapid proliferation of user accessable digital music data poses significant challenges to the tasks of searching and retrieving music from massive data sets. Signal analysis methods to derive high level musical information such as pitch and instrumentation could be used to formulate classification methods that may prove useful in efficient and flexible content retrieval. My research evaluates some current approaches to content-based music retrieval methods and proposes a model that attempts to distinguish among three classes of music (jazz, popular, and classical) by analysis and extraction of features in the frequency and time domains.

Harmonic Visualizations of Tonal Music

Craig Stuart Sapp

Multi-timescale visualization techniques for displaying the output from key-finding algorithms have been developed for harmony analysis in music. The horizontal axis of the key graphs represents time in the score, while the vertical axis represents the duration of an analysis window used to select music for the key-finding algorithm. Each analysis window result is shaded according to the output key's tonic pitch. The resulting diagrams can be used to compare differences between key-finding algorithms at different time scales and to view the harmonic structure and relationships between key regions in a musical composition. Example plots are available on the web at:


Themefinder: A Musical Theme Search Engine

Craig Stuart Sapp

Suppose you have a melody stuck in your head, but you don't know the name of it. You can now search a collection of over 36,000 musical themes on the web at in an attempt to identify it. Musical themes can be searched with different levels of exactness, going from a precise sequence of pitch names to basic melodic contours. Wildcards similar to those used in regular expressions are supported in most types of searches. Themefinder is useful for research purposes as well. It has been used to identify melodies suitable for musical performance experiments and has also been used to idenitfy common starting pitch patterns in music. Themefinder is a collaboration between the Center for Computer Assisted Research in the Humanities at Stanford University and the Cognitive and Systematic Musicology Laboratory at Ohio State University. Themefinder uses the Humdrum data format for encoding and manipulating music for search and display on the website.

© Copyright 2005 CCRMA, Stanford University. All rights reserved.