A Model of Attention Driven Scene Analysis
Who: Malcolm Slaney (Yahoo! Research)
Why: How does attention affect what we hear?
What: Attention Driven Auditory Scene Analysis
When: Friday October 7th at 1:15PM
Where: CCRMA Seminar Room
This is joint work with Trevor Agus, Shih-Chii Liu, Merve Kaya, Mounya Elhilali, Barbara Shinn-Cunningham, Ozlem Kalinli, Kailash Patil, Nuno Vasconcelos, DeLiang Wang and Jude Mitchell at the 2011 Telluride Neuromorphic Cognition Engineering Workshop. Time permitting, I will also talk about some of the other attention-related projects from this year's workshop.
This paper describes a model of attention-driven auditory scene analysis (ASA). ASA is the process of listening to a complicated auditory environment, a cocktail party is the canonical example, and being able to pick out and understand a single talker. Due to its significance in both perceptual and engineering sciences, interest in tackling the ASA phenomenon has prompted multidisciplinary efforts spanning the engineering, artificial intelligence and neuroscience communities. In general, most current work on auditory scene analysis takes one of two simplified approaches. The first approaches to computational auditory scene analysis (CASA) use an exclusively bottom-up approach. Low-level perceptual signals are grouped using simple rules such as common onsets or modulations. These systems rely heavily on the conspicuity and salience of stimulus elements; and can perform reasonably well in simple and well controlled scene analysis conditions. More recent systems have taken a more sophisticated approach by including expectations in the analysis. These systems have simple models of what a talker sounds like, or what was said before. In this paper we describe a third approach based on a user’s goals.
Parsing complex acoustic scenes involves an intricate interplay between bottom-up, stimulus-driven salient elements in the scene with top-down, goal-directed, mechanisms that shift our attention to particular parts of the scene. Here, we present a framework for exploring the interaction between these two processes in a simulated cocktail party setting. The model shows improved digit recognition in a multi-talker environment with a goal of tracking the source uttering the highest value. This work highlights the relevance of both data-driven and goal-driven processes in tackling real multi-talker, multi- source sound analysis.
Dr. Malcolm Slaney is a principal scientist at Yahoo! Research and a (consulting) Professor at Stanford CCRMA where he has led the Hearing Seminar for the last 20 years. He is a Fellow of the IEEE and has served as Associate Editors of IEEE Transactions on Audio, Speech and Signal Processing, IEEE Multimedia Magazine and Proceedings of the IEEE. He has given successful tutorials at ICASSP 1996 and 2009 on “Applications of Psychoacoustics to Signal Processing”, on “Multimedia Information Retrieval” at SIGIR and ICASSP, and "Web-Scale Multimedia Data" at ACM Multimedia 2010. He is a coauthor, with A. C. Kak, of the IEEE book "Principles of Computerized Tomographic Imaging." This book was republished by SIAM in their "Classics in Applied Mathematics" Series. He is coeditor, with Steven Greenberg, of the book "Computational Models of Auditory Function." Before Yahoo!, Dr. Slaney has worked at Bell Laboratory, Schlumberger Palo Alto Research, Apple Computer, Interval Research and IBM's Almaden Research Center. For the last several years he has helped lead the auditory group at the Telluride Neuromorphic Cognition Workshop. Dr. Slaney is a Principal Scientist at Yahoo! Research where he has been working on multimedia analysis and music- and image-retrieval algorithms in databases with billions of items.