Estimating pitch with time-domain DNNs
Without a doubt, the most important recent advancement in machine learning (and AI) has been the success of deep neural networks. They are key to modern speech recognition and many other areas.
But perhaps even more interesting is the move to systems without features. For many decades pattern recognition, machine learning, and other forms of intelligence have been based on carefully engineered features. DNNs have completely out performed systems with careful feature engineering. And now the most recent work has obliterated even simple features. The best results are often possible with no features at all—just apply the waveform or the pixels to a deep-enough neural network. With enough data, state of the art performance can be achieved by anybody.
Prateek, along with Prof. Ron Schafer, have been building pitch estimators that use DNNs on raw waveform (features). The results are state of the art. Prateek will talk about how he did it, and we’ll talk about the broader trends for neural networks and perception.
Who: Prateek Verma (CCRMA)
What: Estimating pitch with time-domain DNNs
When: Friday November 4, at 10:30AM
Where: CCRMA Seminar Room (Top Floor)
Why: Because DNNs seem to rule the world.
Bring your favorite neural network to CCRMA and we’ll talk about how to use a neural network to recognize pitch.
Frequency Estimation from Waveforms using Multi-Layered Neural Networks
For frequency estimation in noisy speech or music signals, time domain methods based on signal processing techniques such as autocorrelation or average magnitude difference, often do not perform well. As deep neural networks (DNNs) have become feasible, some researchers have attempted with some success to improve the performance of signal processing based methods by learning on autocorrelation, Fourier transform or constant-Q filter bank based representations. In our approach, blocks of signal samples are input directly to a neural network to perform end to end learning. The emergence of subharmonic structure in the posterior vector of the output layer, along with analysis of the filter-like structures emerging in the DNN shows strong correlations with some signal processing based approaches. These NNs appear to learn a nonlinearlyspaced frequency representation in the first layer followed by comb-like filters. We find that learning representations from raw time-domain signals can achieve performance on par with the current state of the art algorithms for frequency estimation in noisy and polyphonic settings. The emergence of subharmonic structure in the posterior vector suggests that existing post-processing techniques such as harmonic product spectra and salience mapping may further improve the performance.