A central part of my research goes around the problem of characterizing and modeling expressive characteristics of instrumental performance, with the final aim of reproducing and/or expanding human performance in virtual environments. In particular, I have been interested in excitation-continuous musical instruments, like wind instruments, bowed strings, or the singing voice. In general, my approach to expressive performance modeling can be divided into three tasks: (a) Acquisition of performance-related cues from real recordings, (b) Extraction of meaningful descriptors suited to the characterization of performance expression, and (c) Construction of flexible, generative models able to render expressive instrumental controls from low-dimensional user inputs and/or a partially annotated score.
      While much of my early work has been based on the extraction of expressive characteristics from audio analysis (a more classical approach), during the past few years I have been leaning towards multi-modal data capture and processing. In order to obtain a deeper understanding of instrumental playing technique, I strongly believe it is necessary to measure, describe, and model instrumental gestures (i.e., sound-producing gestures) during music performance. Besides jazz saxophone and recorder playing, I have worked on singing voice and classical violin as part of research projects with Yamaha Japan. While my work on jazz saxophone and singing voice has been based on analysis of audio recordings, violin and recorder playing techniques were studied by combining sound (microphone) and gesture (sensors) data acquired from real performance. Here is a list of relevant papers:

Maestre, E., Bonada, J., Mayor, O. Modeling musical articulation gestures in singing voice performances. AES 121th Convention, 2006.

Maestre, E., Ramírez, R., Kersten, S., Serra, X. Expressive concatenative synthesis by reusing samples from real performance recordings. Computer Music Journal, Vol. 33, No. 4, pp. 23-42, 2009.

Ramírez, R., Maestre, E., Serra, X. Automatic performer identification in commercial monophonic Jazz performances. Pattern Recognition Letters, Vol. 31, pp. 1514-1523, 2010.

Maestre, E., Blaauw, M., Bonada, J., Guaus, E., Pérez, A. Statistical modeling of bowing control applied to violin sound synthesis. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, No. 4, pp. 855-871, 2010.

García, F., Vinceslas L., Maestre, E., Tubau, J. Acquisition and study of blowing pressure profiles in recorder playing. International Symposium on New Interfaces for Musical Expression (NIME), 2011.


In the context of expressive performance modeling, studying playing technique through analysis and synthesis of instrumental gestures has become my main topic of research. In particular, it is my belief that future sound synthesis paradigms must incorporate an explicit generative model able to render gestural controls, therefore mimicking the role of the performer when she translate symbolic, discrete events appearing in a music score into continuous-nature controls (instrumental gestures) used to drive sound production.
      During the past six years, I have devoted a significant amount of my research efforts to the problem of acquiring, analyzing, modeling, and synthesizing bowing gestures in bowed-string performance. By means of specialized sensing techniques, bowing control parameter signals are acquired from real performance and modeled using piecewise parametric functions, so that during synthesis stage, realistic bowing controls can be obtained from an annotated music score. The figure below illustrates bowing control modeling applied to violin sound synthesis:

Regarding research on instrumental gestures (in particular on acquisition and analysis), I have also worked on studying blowing pressure profiles in recorder playing. Despite the fact that sound off-line synthesis from an annotated score is presented here as the most direct application of my work on gesture modeling, many other applications are easily envisioned, like for instance (i) Interactive high-level control of automatic performance, (ii) Analysis / modeling / morphing of performance styles, or (iii) Educational / pedagogical use (e.g., high performance training). Below are found some of my publications on gesture analysis / synthesis:

Maestre, E., Bonada, J., Blaauw, M., Pérez, A., Guaus, E. Acquisition of violin instrumental gestures using a commercial EMF device. International Computer Music Conference (ICMC), 2007.

Maestre, E. Modeling Instrumental Gestures: an Analysis/Synthesis Framework for Violin Bowing. PhD Thesis, Universitat Pompeu Fabra, 2009. Advisors: Xavier Serra (Universitat Pompeu Fabra) and Julius O. Smith III (Stanford University).

Maestre, E., Pérez, A., Ramírez, R. Gesture sampling for instrumental sound synthesis: violin bowing as a case study. International Computer Music Conference (ICMC), 2010.

Maestre, E., Blaauw, M., Bonada, J., Guaus, E., Pérez, A. Statistical modeling of bowing control applied to violin sound synthesis. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, No. 4, pp. 855-871, 2010.

García, F., Vinceslas L., Maestre, E., Tubau, J. Acquisition and study of blowing pressure profiles in recorder playing. International Symposium on New Interfaces for Musical Expression (NIME), 2011.

Marchini, M., Papiotis P., Pérez, A., Maestre, E. A Hair Ribbon Deflection Model for Low-Intrusiveness Measurement of Bow Force in Violin Performance. International Symposium on New Interfaces for Musical Expression (NIME), 2011.


Difficulties in quantitatively analyzing and modeling the continuous nature of instrumental controls have prevented excitation-continuous instrumental sound synthesis from achieving a greater success. The inherently continuous nature and flexibility of physical models makes them the most promising approach to sound synthesis, but the unavailability of appropriate input controls (i.e., faithfully representing the actual performer controls) is still a crucial issue.
       As a clear application of my research on bowing gesture synthesis, I have worked on controlling bowed-string physical models (mostly based on digital waveguides), obtaining very promising results even when using very simplified models. In an attempt to prove the potential of gesture modeling applied to physical modeling sound synthesis, I am currently working towards a more sophisticated bowed-string physical model (violin, viola, cello), involving research on two-dimensional bridge admittance modeling through digital filter design (see picture below), or bow-string or finger-string interaction models using finite-difference schemes combined with digital waveguides. The objective of my research on physical modeling of bowed-string instruments is to construct a fully functional physical model of a string quartet, controlled by means of bowing and fingering controls rendered from a four-voice annotated score. Some of my publications involving different aspects of physical modeling synthesis are:

Maestre, E., Blaauw, M., Bonada, J., Guaus, E., Pérez, A. Statistical modeling of bowing control applied to violin sound synthesis. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, No. 4, pp. 855-871, 2010.

Maestre, E., Scavone, G. P., Smith III, J. O. Modeling of a violin input admittance by direct positioning of second-order resonators. 162nd Meeting of the Acoustical Society of America, 2011.

Maestre, E., Scavone, G.P., Smith, J.O. Digital modeling of bridge driving-point admittances from measurements on violin-family instruments. Sotckholm Music Acoustics Conference, 2013.

Maestre, E. Analysis/synthesis of bowing control applied to violin sound rendering via physical models. International Congress on Acoustics / Proceedings of Meetings of Acoustics, 2013.


In the context of the ongoing EU-funded research SIEMPRE project on social behavior in music making, I have been leading the Universitat Pompeu Fabra partnership and supervising PhD students working on ensemble (violin duets and string quartets) performance analysis. Our approach consists on multi-modal data capture (sound, video, motion capture) of a real performance situation, with the aim of observing significant differences between solo and ensemble performances, and eventually identifying mechanisms driving interdependence among musicians (synchronization, entrainment, leadership).
       One of the basis grounds for our approach is the explicit consideration of the music score (musical structure) as an influential aspect determining the dynamics of interdependence in joint music making, and paying special attention to audio-extracted features and sound-producing gestures. Therefore, we need to carry out accurate note-level (automatic) segmentation of multi-modal data signals (sound, instrumental gestures, ancillary gestures), so that a subsequent analysis of timing, dynamics, intonation, timbre, and articulation can be performed.
       With respect to the technical setup for data acquisition, below is a diagram depicting the actual setup being used for a series of experiments on string quartet performance carried out through a joint collaboration between Universitat Pompeu Fabra, McGill University, and Stanford University. In the setup, two different motion capture systems (optical IR tracking plus electro-magnetic field sensing) are combined, together with sound and video.

Acquired data is processed and conveniently stored in repoVizz, an integrated online system capable of structural formatting and remote storage, browsing, exchange, annotation, and visualization of synchronous multi-modal, time-aligned data. A screen capture of repoVizz on a web browser is shown next.

Some of my publications related to this topic are:

Papiotis, P., Marchini, M., Maestre, E. Computational Analysis of Solo versus Ensemble Performance in String Quartets: Intonation and Dynamics, International Conference on Music Perception and Cognition, 2012.

Marchini, M., Ramírez, R., Papiotis, P., Maestre, E. Inducing Rules of Ensemble Music Performance: a Machine Learning Approach. International Conference on Music and Emotion, 2013.

Mayor, O., Llop, J., Maestre, E. repoVizz: A multi-modal on-line database and browsing tool for music performance research. International Symposium on Music Information Retrieval (ISMIR), 2011.


Apart from the main research topics described above, I have been active in different areas, namely: Spectral modeling synthesis from gesture data, Sound mosaicing, Voice conversion for speech synthesis, and Context-aware mobile applications. Here are some recent publications related to these lines:

Pérez, A., Bonada, J., Maestre, E., Blaauw, M., Guaus, E. Performance Control driven Violin Timbre Model based on Neural Networks. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 20, No. 3, 2012.

Vinceslas, L., García, F., Pérez, A., Maestre, E. Mapping blowing pressure and sound features in recorder playing. International Conference on Digital Audio Effects , 2011.

Coleman, G., Maestre, E., Bonada, J. Augmenting sound mosaicing with descriptor-driven transformation. International Conference on Digital Audio Effects, 2010.

Villavicencio, F., Maestre E. GMM-PCA based speaker-timbre conversion on full-quality speech. INTERSPEECH Speech Synthesis Workshop, 2010.

Camurri, A., Volpe, G., Vinet, H., Bresin, R., Maestre, E., Llop, J., Kleimola, J., Oksanen, S., Välimäki, V., Seppänen, J. User-centric context-aware mobile applications for embodied music listening. International ICST Conference on User Centric Media, 2009.
Esteban Maestre, March 2014