Calvin Laughlin | Winter 2024 | Music 356 / CS 470
Learning to work with audio features for supervised and unsupervised tasks. This includes a real-time genre-classifier and a feature-based audio mosaic tool. For more information on the project, visit the project handout here.
Experimented with different configurations.
I began by first just adding all available features to the extractor thinking that more could perhaps lead to more. Unfortunately, this strategy didn’t pan out so well and led to accuracies even below the baseline, with an average accuracy across 5 folds of 0.092. This makes sense because it appears the model may be overfitting to the data with so many features.
# of data points: 1000 dimensions: 31
fold 0 accuracy: 0.0980
fold 1 accuracy: 0.0931
fold 2 accuracy: 0.0980
fold 3 accuracy: 0.0784
fold 4 accuracy: 0.1275
I then tried removing a lot of the features and went for a balanced approach, choosing one feature per audio component. For this experiment, I chose centroid to represent the spectral component, zeroX to represent the temporal component, chroma for harmonic, and kurtosis for statistical. I thought by attempting to feature engineer and have each feature be unique, we would achieve better results. I was wrong and got an average of 0.094, slightly better than the whole kitchen sink but still poor.
# of data points: 1000 dimensions: 15
fold 0 accuracy: 0.0980
fold 1 accuracy: 0.0882
fold 2 accuracy: 0.0735
fold 3 accuracy: 0.1324
fold 4 accuracy: 0.1078
Since now the example was showing much better results than either of my experiments, I decided to just start adding to what was already present, first just adding rolloff, for a total of centroid, flux, RMS, and MFCC with 20 components. This yielded great results, with an average accuracy of 0.432. I assume this is because the rolloff gives more context for the model about its spectrum.
# of data points: 1000 dimensions: 24
fold 0 accuracy: 0.4324
fold 1 accuracy: 0.4245
fold 2 accuracy: 0.4446
fold 3 accuracy: 0.4436
fold 4 accuracy: 0.4142
I was expecting kurtosis to throw the model off similar to how the kitchen sink did, but again adding this feature further improved the accuracy. This is most likely because it is giving the model a new statistical metric that is not just noisy data. This achieved an accuracy of 0.446, even higher than the last.
# of data points: 1000 dimensions: 25
fold 0 accuracy: 0.4676
fold 1 accuracy: 0.4662
fold 2 accuracy: 0.4172
fold 3 accuracy: 0.4270
fold 4 accuracy: 0.4873
This is every feature except for ZeroX. Chroma gives the model even more context because it is now giving it harmonic analysis. And again, adding this feature led to higher accuracy. So it really seems that ZeroX throws the model off quite a bit when it is added. This model achieved an accuracy of 0.448.
# of data points: 1000 dimensions: 37
fold 0 accuracy: 0.4681
fold 1 accuracy: 0.4500
fold 2 accuracy: 0.4333
fold 3 accuracy: 0.4725
fold 4 accuracy: 0.4407
Build a database mapping sound frames (100::ms to 1::second) <=> feature vectors
warning: the following video contains flashing, fast sequenced videos with bright colors set to high (173) BPM music.
For this milestone, most of my efforts were concentrated into figuring out how to connect OSC to Chuck, and how to get the keyboard to control the videos. This was quite challenging for me, but after trial I have a working prototype that is closer to the idea I have in my head. This prototype matches the spectrums of the chosen song to the 1-nearest neighbor in each desired dance video. The user can choose which dance video to align the music with by using the keyboard (left or right) and number 1-9 to bounce around. In addition, the user can freeze the frame with ‘f’ and cause rainbow dance tint by pressing ‘r’.
phase2 codeUse prototype from Phase Two to create a feature-based musical mosaic in the form of a musical statement or performance.
Another version, using "Singing in the Rain" (the training song) on the rest of the videos...
Interestingly, the clips match quite well to the original song!
What does it mean for something to be “coherent”? And how does the medium affect what sort of meaning we take out of a performance? This project allowed me to explore these themes and more as I was presented with the challenge of matching popular songs to iconic dance scenes from movies. It was tough to connect Chuck to Processing, but the hardest part overall was getting a good performance for the video. Just like an instrument, I had to practice when to switch videos, when to turn on the rainbow effect, which videos to switch between, and how often. Some of the videos responded better to certain parts of the song, so I had to ensure those were playing when that point was reached. Just like an instrument, there were sequences that flowed well together, and those that didn’t.
But perhaps the most intriguing part for me was that, even with the videos all jumbled around and flashing quickly, one could still recognize that they were dancing and that the clip was from a certain movie. However, when sound is rearranged like that, it is almost completely incoherent and sounds like a fever dream. Why is it the case that visuals can be rearranged and repeated without loss of meaning, but sound cannot? I presume it has something to do with the temporal nature of music, how it is strictly connected to the time in which it plays. In fact, this connection has been explored quite extensively in music theory, with scholars arguing that music’s aesthetic experience comes from play between expectation and realization, which are bound to a temporal, unfolding process. That interplay is on full display with this musical mosaic, and in fact enforced, since the visuals re-introduce earlier themes of the music and give the listener a reminder of the past.
Overall, I had a very good time both making and performing with these tools, and I hope to keep expanding upon this and try it out with more songs and videos in the future.
phase3 codeSong: leavemealone by Fred Again..
Videos (in order of appearance)