Learning Audio Embeddings: From Signal Representation, Audio Transformation to Understanding
Date:
Fri, 05/31/2019 - 10:30am - 12:00pm
Location:
CCRMA Seminar Room
Event Type:
Hearing Seminar 
One common technique in the DNN world is to use a deep network to learn some task, and then take the output from an intermediate layer to help guide a new task. This intermediate representation, relatively low-dimensional, is called an embedding, and contains all the information needed to perform the task. Prateek will talk about using this new type of representation for supervised/unsupervised audio transforms, speech recognition, emotion recognition, translation, and end-to-end spoken language translation
Who: Prateek Verma (Stanford)
What: Embedding spaces for audio analysis (emotion recognition, genre classification and speech translation)
When: 10:30AM on Friday May 31, 2019
Where: CCRMA Seminar Room, Top floor of the Knoll at Stanford
Why: DNNs are really good at summarizing the world, what can they do for audio?
Bring your favorite DNN to the Hearing Seminar and we’ll talk about how they represent knowledge.
- Malcolm
Title:
Learning Audio Embeddings: From Signal Representation, Audio Transformation to Understanding
Abstract:
The advent of machine learning has brought a radical shift in the approaches for classical signal processing problems and audio processing. One of them is the rise of “new representations” or embeddings which have been successful in abstracting the information of interest. Embeddings are low dimensional vector representations mapped from the signal of interest (images, text, audio, etc.) via techniques in machine learning, linear algebra and optimization. In this talk, we would highlight ways in which these representations or embeddings can be computed, interpreted and used for tasks in music and audio signals.
We will discuss how can we create alternative representations, similar to the family of Fourier/Correlation based representations (Spectrograms, Constant-Q, correlogram) via learning and stacking these embedding vectors. For applications in supervised/unsupervised audio transforms, speech recognition etc, we show how these embeddings are computed, analysed and how do they help in solving the problem of interest. We will show how these embedding vectors can summarize different attributes of the input signal both at micro and macro level like pitch, timbre, rhythm, emotions, spectral comb structure etc. We will discuss how these fundamental characteristics of audio signals were never explicitly trained, yet they somehow are encoded and implicitly learned in these embeddings depending on the application of interest.
This work was done jointly with Jonathan Berger, Albert Haque, Michelle Guo, Chris Chafe, Julius Smith and Alexandre Alahi at Stanford University.
Bio:
Prateek Verma is a Stanford CCRMA graduate interested in the intersection of machine learning, audio processing and optimization for music and audio signals. Before coming to Stanford, he graduated from IIT Bombay in Electrical Engineering with specialization in Signal Processing. He has held research positions at Stanford in Artificial Intelligence Lab in the Computer Science Department in the Natural Language Processing Group as well as the Machine Learning group. At Stanford, he has taught in the inaugural course on “Deep Learning for Music and Audio” with Julius Smith giving several lectures, and has given a guest lecture in the signal processing course in the Electrical Engineering Department. He is continuing his research at Stanford in areas of hearing perception, unsupervised learning, sound analysis and synthesis.
FREE
Open to the Public