Transformers for Applications in Audio, Speech and Music: From Language Modeling to Understanding to Synthesis

Date:

Thu, 05/19/2022 - 5:30pm - 6:30pm

Location:

CCRMA Classroom [Knoll 217]

Event Type:

DSP Seminar

Abstract: Transformers have touched many fields of research and music/audio is no different. This talk will present 3 of my papers as case studies on how we can leverage the power of Transformers in representation learning, signal processing, and clustering. First, we discuss how we're able to beat the wildly popular WaveNet architecture, proposed by Google-DeepMind for raw audio synthesis. We also show how we overcame the quadratic constraint of the Transformers by conditioning on context. Secondly, a version of Audio Transformers for large-scale audio understanding, inspired by viT, operating on raw waveforms, is presented. It combines powerful ideas from traditional signal processing, specifically wavelets, on intermediate transformer embeddings to produce state-of-the-art results. Investigating the front-end to see why it does so well, we show that it learns an auditory filter-bank having a time-frequency representation optimized for the task. For the third part, the power of operating on latent-space encodings, and language modeling on continuous audio signals using discrete tokens will be discussed. This will describe how simple unsupervised tasks can give us strong competitive results compared with that of end-to-end supervision. We give an overview of some recent trends in the field, and papers by Google, OpenAI, etc., about current “fashion trends”. It will be fun too! Finally, as time permits, we will discuss our advances in packet-loss concealment for network music performance, and touch upon the power of approaches based purely on representation learning, without any modern neural nets, and building learning-systems of that nature.

This talk was originally given for CS 25 in the Fall of 2021 at Stanford University.

This work was done in collaboration with Prof. Chris Chafe, Prof. Jonathan Berger, and Prof. Julius Smith, all at the Center for Computer Research in Music and Acoustics at Stanford University. Thanks to Stanford’s Institute for Human-Centered AI (HAI) for supporting this work with a generous Google cloud computing grant.

Bio: Prateek Verma is currently working on audio research at Stanford’s Center for Computer Research in Music and Acoustics (CCRMA) collaborating with Prof. Chris Chafe and Prof. Jonathan Berger. He got his masters degree from Stanford CCRMA, and before that, he was at IIT Bombay.

FREE

For CCRMA Users Only

Calendar

Search this site:

Fall Courses at CCRMA

Music 1A Music, Mind, and Human Behavior
Music 101 Introduction to Creating Electronic Sounds
Music 192A Foundations in Sound Recording Technology
Music 201 CCRMA Colloquium
Music 220A Foundations of Computer-Generated Sound
Music 223A Composing Electronic Sound Poetry
Music 256A Music, Computing, and Design I: Software Paradigms for Computer Music
Music 319 Research Seminar on Computational Models of Sound Perception
Music 320 Introduction to Audio Signal Processing
Music 351A Research Seminar in Music Perception and Cognition I
Music 451A Auditory EEG Research I

Main menu

Secondary menu

Transformers for Applications in Audio, Speech and Music: From Language Modeling to Understanding to Synthesis

Search this site:

Fall Courses at CCRMA