Etude #2

James Zheng

Phase One

Trial 1 – Centroid, Flux, RMS, MFCC (20 coefs, 10 filters) # of data points: 1000 dimensions: 23
fold 0 accuracy: 0.4897
fold 1 accuracy: 0.4260
fold 2 accuracy: 0.4201
fold 3 accuracy: 0.3985
fold 4 accuracy: 0.4515

Trial 2 – Centroid, Flux, RMS, MFCC (20 coefs, 10 filters), RollOff, ZeroX, Chroma, Kurtosis # of data points: 1000 dimensions: 38
fold 0 accuracy: 0.0882
fold 1 accuracy: 0.0980
fold 2 accuracy: 0.1225
fold 3 accuracy: 0.0833
fold 4 accuracy: 0.0686

Trial 3 – Centroid, Flux # of data points: 1000 dimensions: 2
fold 0 accuracy: 0.2260
fold 1 accuracy: 0.2157
fold 2 accuracy: 0.2025
fold 3 accuracy: 0.2176
fold 4 accuracy: 0.2147

Trial 4 – Centroid # of data points: 1000 dimensions: 1
fold 0 accuracy: 0.1775
fold 1 accuracy: 0.1814
fold 2 accuracy: 0.1618
fold 3 accuracy: 0.1848
fold 4 accuracy: 0.1858

Trial 5 – Centroid, Flux, RMS, MFCC (30 coefs, 15 filters), Chroma # of data points: 1000 dimensions: 45
fold 0 accuracy: 0.4382
fold 1 accuracy: 0.4466
fold 2 accuracy: 0.4691
fold 3 accuracy: 0.4559
fold 4 accuracy: 0.4515

The best performing on average seems to be trial 5 with an accuracy of ~0.45, followed by trial 1. Typically, the higher the number of dimensions, the better the performance (except for trial 2, which had the worst performance ~0.1 accuracy despite having the second most parameters). From Trial 2, we see that adding more layers doesn’t always mean better performance, and in fact drastically affected the accuracy in a negative way. It is interesting to see that even with just Centroid in trial 4, the accuracy of ~0.17 is already higher than the baseline of 0.1. For phase 3, I plan on trying out more combinations of MFCC’s parameters such as number of coefficients and filters, as well as combining Trial 1’s layers with a smaller subset of the layers added to Trial 2.

Phase Two

In my exploration, I extracted features from the Mii channel theme and the Kahoot waiting room theme. I remember having seen a remix of the two in the past, and was interested in seeing what kind of "remix" I could create through an audio mosaic. I tested out various parameters including number of MFCC coefficients, FFT size, hop size, number of frames to aggregate before averaging, the value of k for KNN, and number of voices. Here is the result. In the demo, you will notice that the Mii channel music plays when there is voice input, with the pitch of the tones relatively matching the pitch of the input. When I start adding percussive sounds (clapping), you will hear the Kahoot theme start playing. When I have both, we hear both themes playing as the sound amplifies.

Demo
Code

Phase Three

Below, you will find a new and improved version of musical genre classification (this is subjective). I constructed an audio mosaic of various anime trap remixes (credit to Musicality). This mosaic is used to output sound based on the input of songs from a collection of genres. Try to see if you can guess the genres before the answer is revealed!

Demo
Code

This project involved a lot of thought and creativity, and I enjoyed the entire process of curating the right collection of sounds, tuning for a good set of parameters, and developing the final artist statement through combining both audio and visual elements. One challenge that I experienced was the first step of this process. I realized that many of the input songs were mapping to the same output sound, simply due to the fact that it was the only song in the audio mosaic with strings/was more mellow. When I removed these from the mosaic, the resulting sounds were more diverse and satisfying. Through this kind of testing and playing around with the audio mosaic, I found myself discovering new patterns and connections in the varying genres based on which output sound they mapped to.

The creative development process of this assignment was also a nonlinear one. Initially, I wanted to construct something where I would pairs of songs that mixed well together, and then have a relevant pair play based on input sounds. I quickly realized that it would be extremely difficult to get the "mixing" to work and sound good, so I decided to embrace what the audio mosaic offered and not try to make something that was unnatural. This led me to find sounds that sound good together even if not everything was perfectly lined up. This allowed me to spend more time on creative elements of my project rather than trying to figure out how to technically "mix" two songs perfectly for the audio mosaic.

Overall, I think that applying an audio mosaic to genre classification showcases its purpose quite well. Depending on the genre of music playing, there are clearly differences in which sounds get outputted. As potential future exploration, it would be interesting to see what gets outputted if the system is given non-musical soundds, such as speech or more percussive sounds.