Prateek Verma - Fourier Transforms and Filter-Banks in the Era of Transformers and GPT
Date:
Fri, 04/07/2023 - 10:30am - 12:00pm
Location:
CCRMA Seminar Room
Event Type:
Hearing Seminar 
Prateek Verma has done a large number of interesting audio ML experiments, from speech to music and many other problem areas. He’ll be talking about learning a basis for the front end.
Who: Prateek Verma
What: Fourier Transforms and Filter-Banks in the Era of Transformers and GPT
When: Friday April 7 at 10:30AM
Where: CCRMA Seminar Room (top floor of the Knoll at Stanford)
Why: Usually a little bit of knowledge goes a long way. Is that still true?
See you at CCRMA. Bring your favorite auditory front end.
- Malcolm
Prateek Verma
Fourier Transforms and Filter-Banks in the Era of Transformers and GPT
Abstract:
Transformers have revolutionized the field of artificial intelligence by propelling powerful self-supervised architectures such as GPT and, recently, Chat-GPT. They are solving and advancing multiple areas and advancing the state-of-the-art in almost every problem thrown at them.
This talk will re-imagine Fourier transforms in this age of Transformers/GPT. Before the modern advent of deep learning, we used fixed representation non-learnable front-ends like spectrogram or mel-spectrogram with/without neural architectures for various music and audio related research. With convolutional architectures supporting various applications such as ASR and audio understanding, a shift to learnable front ends occurred in which both the type of basis functions and the weight are learned from scratch and optimized for the particular task of interest e.g. raw CLDNN. With the shift to transformer-based architectures with no convolutional blocks present, a linear layer projects small waveform patches onto a small latent dimension before feeding them to a transformer architecture.
What can be the next evolution in this series? Can we learn a better time-frequency representation according to the constraints we provide — By making these front-end transforms entirely learnable according to a task ? Additionally, we will explore the strengths of Wavelet Transforms together with powerful Transformer Architectures and showcase gains achieved for acoustic understanding tasks. We will see significant improvements in performance with the addition of no extra parameter for audio understanding by incorporating various inductive biases in audio signals. Then, we will tinker and open them up to explore what they learn. We see how they know quite a rich vocabulary of basis functions, all learned from scratch instead of a sinusoidal Fourier basis, discovering all kinds of signal processing properties. This work can potentially impact every audio/signal processing task, taking Fourier transforms as the first step or operating directly on raw waveforms with neural architectures such as Transformers. It will piece together almost three decades of signal processing research, starting with STFT to filter-banks, to CLDNNs acoustic model to the current era of Transformers.
This work is done with Chris Chafe, with the backbone Transformer architecture developed jointly with Jonathan Berger in 2021, all at Stanford University.
Bio:
Prateek Verma is currently a researcher at Stanford University. He has held a research positions at various interdisciplinary groups at Stanford, and has published his research in a variety of conferences and journals at Stanford using his acoustic/music/signal processing background. He did his Masters from Stanford CCRMA, and his AI residency at Google X initiating a new direction for robotics research. Before coming to Stanford, he graduated from IIT Bombay in Electrical Engineering with specialization in Signal Processing. His primary research interest and passion is at the intersection of classic signal processing, acoustics, music/audio/speech processing, AI, music information retrieval, and music understanding/synthesis.
Background reading:
Audio Transformers: Transformers For Large Scale Audio Understanding — Adieu Convolutions
https://arxiv.org/abs/2105.00335
A Content Adaptive Front End For Audio Signal Processing
https://arxiv.org/abs/2303.10446
FREE
Open to the Public