Steve Greenberg on Speech Perception
Just how do we recognize the sound of an /a/ vs. a flute? Speech recognition systems commonly due this task with measurements of a single spectral slice. Just 39 measurements of the frequency information at one time. But is this how the brain does it?
There is a wonderful demo called a "white vowel." Imagine taking a recording and measuring the frequency content at the center of the vowel. Design a filter that provides the inverse and apply it to the sound. Now the vowel sound has a perfectly flat spectrum---No formants to be seen. What do you think you will hear when you play the sound? Surprise, it sounds like a perfectly normal word! Yes, it sounds different, especially when played next to the original. But you can still hear the original sound. This suggests the auditory system is doing something different than just pattern matching with spectral slices. But what?
Steve has been interested in temporal patterns. He'll be talking about his model, based on neural oscillators at different scales, at Friday's Hearing Seminar.
Who: Steven Greenberg
Why: We don't know how sound is recognized (speech or music)
What: Time, Speech, Memory – How Does the Brain Go From Sound to Meaning?
When: Friday November 18th at 1:15PM
Where: CCRMA Seminar Room (Top Floor of the Knoll)
I also want to thank the discussants at the last seminar: Nick, Juhan and Gautham. We had a wonderful discussion about their favorite papers from ISMIR. They have put together a page of papers and links at
See you at CCRMA on Friday, for what promises to be an interesting talk and certainly a stimulating discussion!!!
Time, Speech, Memory – How Does the Brain Go From Sound to Meaning?
Spoken language is highly variable, reflecting factors of environmental (e.g., acoustic-background noise, reverberation), linguistic (e.g., speaking-style) and idiosyncratic (e.g., voice-quality) origin. Despite such variability listeners rarely experience difficulty understanding speech. What brain mechanisms underlie this perceptual resilience, and where does the invariance reside (if anywhere) that enables the signal to be reliably decoded and understood? A theoretical framework – DejaNets – is described for how the brain may go from “sound to meaning.” Key is speech representations in memory, crucial for the parsing, analysis and interpretation of sensory signals. The acoustic waveform is viewed as inherently ambiguous, its interpretation dependent on combining data streams, some sensory (e.g., visual-speech cues), others internal, derived from memory and knowledge schema. This interpretative process is mediated by a hierarchical network of neural oscillators spanning a broad range of time constants (ca. 15 ms–2,000 ms), consistent with the temporal structure of spoken language. They reflect data-fetching, parsing and pattern-matching involved in decoding and interpreting the speech signal. DejaNets accounts for many (otherwise) paradoxical and mysterious properties of spoken language including categorical perception, the McGurk effect, phonemic restoration, semantic context and robustness/sensitivity to variation in pronunciation, speaking rate and the ambient acoustic environment. [Supported by AFOSR]
Steve is the principal of Silicon Speech, a small scientific research company based in Northern California. Prior to founding the company, he was Senior Scientist/Affiliate Faculty at the International Computer Science Institute and Associate Professor of Linguistics at the University of California, Berkeley. In a previous life, Steve studied how single auditory neurons in the auditory nerve and cochlear nucleus respond to complex sounds. He holds a Ph.D. from UCLA (Linguistics and Neuroscience) and an A.B. from the University of Pennsylvania (Linguistics and Anthropology). His recent research has been funded by the Air Force Office of Scientific Research.