Programming Etude 2: "Featured Artist"

CS 470: Music and AI | Kiran Bhat | Winter 2024

Phase One: Extract, Classify, Validate

Feature Configuration 1

For my first feature configuration, I used the provided configuration in the feature-extract.ck code. This feature configuration contains: Centroid, Flux, RMS, and MFCCs (20 coefficients, 10 filters).

Feature Configuration 2

For my 2nd feature configuration, I tried using exclusively MFCCs as a feature. I kept the provided values of 20 coefficients and 10 filters for this configuration.

Feature Configuration 3

For my 3rd feature configuration, I was curious if increasing the number of filters used in the MFCC operations would allow the same number of coefficients to better describe the data. I again used exclusively MFCCs as a feature and kept 20 coefficients, but I increased the number of filters to 20 for this configuration.

Feature Configuration 4

For my 4th feature configuration, I wanted to capture as much information about the audio as possible in my feature vectors, so I collected all available features (Centroid, Flux, RMS, RollOff, ZeroX, MFCCs (20 coefficients, 20 filters), Chroma, Kurtosis).

Feature Configuration 5

For my 5th feature configuration, I wanted to try using all features besides MFCCs. My 2nd/3rd configurations use exclusively MFCCs, so I was curious to see if the opposite of that could produce better results. I was also curious how far the accuracy would drop from configuration 4's after omitting MFCCs. So I only collected the following features: Centroid, Flux, RMS, RollOff, ZeroX, Chroma, Kurtosis.

Results

Dimensionality Fold 1 Accuracy Fold 2 Accuracy Fold 3 Accuracy Fold 4 Accuracy Fold 5 Accuracy Mean Accuracy
Configuration 1 (Default) 23 0.4520 0.4392 0.4221 0.4529 0.4250 0.4382
Configuration 2 (MFCCs, 10 filters) 20 0.4054 0.4010 0.3995 0.3779 0.4123 0.3992
Configuration 3 (MFCCs, 20 filters) 20 0.4265 0.4314 0.3838 0.4402 0.4422 0.4248
Configuration 4 (Everything) 58 0.4520 0.4676 0.4789 0.4882 0.4740 0.4721
Configuration 5 (Everything - MFCCs) 18 0.3804 0.4069 0.3917 0.3706 0.3426 0.3784

Unsurprisingly, Configuration 4 produced the highest mean accuracy in cross-validation. This was to be expected, since its feature vectors contained all available features. However, Configuration 4 also had the slowest feature extraction process, which may make Configurations 1 or 3 more practical for real-time extraction. Another interesting thing to note is that Configurations 2 and 3 outperformed Configuration 5. This means that all features besides MFCCs (Centroid, Flux, RMS, RollOff, ZeroX, Chroma, Kurtosis) combined still capture less useful information for genre classification than the MFCCs (20 coefficients) alone. We can see that adding the MFCCs back to the rest of the features leads to a ~10% accuracy increase (Configuration 5 to Configuration 4). This shows how important MFCCs are as a feature in genre classification.


Phase Two Milestone: Designing an Audio Mosaic Tool

I would like to be able to trigger samples of different instruments by mimicking those instruments with my voice. For this milestone, I am simply playing back clips of myself mimicking different instruments, but for the final submission I plan on mapping those clips to actual instrument samples (and potentially allowing users to input their own ways of mimicking instruments). I think it would be fun to try building up a song over time by mimicking instruments with your voice (and possibly manual looping of samples).

I used the provided mosaic-extrack.ck and mosaic-synth-mic.ck programs in order to create the demo. In the early stages of testing, I was interested in specifically reproducing drum sounds through beatboxing, but I was having trouble getting the playback timing to work. I tried several experiments with adjusting the FFT size and HOP duration, and even syncing my beatboxing recordings with FFT size/hop size, but was still having trouble. So I pivoted to recreating a broader range of sounds, through mimicking multiple instruments. I found better results with this approach, since the sounds are more distinct and easily identifiable with KNN (at least with the features I extracted).

Demo



Phase Two: Designing an Audio Mosaic Tool

What This Tool Does

This tool provides a musician with 4 possible tracks (A, B, X, and Y) that they can manipulate with an xbox controller and their voice.

The musician can assign an instrument of their choosing to a track by holding the track button (A/B/X/Y), holding the instrument selection button (LB), and mimicking the instrument they want with their voice into the microphone. Whatever instrument the musician mimicks will be assigned to the track whose button was held. For example, if I hold A+LB on the controller and start whistling into the mic, the tool will assign woodwinds to track A.

Additionally, the musician can alternate between 2 sections of a track by holding the track button (A/B/X/Y) and pressing the switch section button (RB). For example, if track X is playing the 1st section of some piano music, holding X and pressing RB will cause track X to start playing the 2nd section of the piano music instead.

Moreover, the musician has the ability to mute and unmute tracks on the fly by simply pressing the track button for whatever track they want to mute or unmute. All tracks are played on a loop, and this allows for enabling/disabling of tracks.

How This Tool Works

This was one of the most complex coding projects I have ever taken on. It involved 3 main components:

  1. control flow of the xbox controller,
  2. beat-aligned playback of instruments, and
  3. instrument detection with microphone input.
I first implemented the control flow of the xbox controller. This was quite difficult, since I wanted to combine multiple button presses to trigger different actions. For example, I had to make sure that if someone presses A+RB to switch sections on track A, that releasing A doesn't cause track A to mute. Only uninterrupted button presses should mute tracks. There were also other things to consider, like: Next, I implemented the beat-aligned playback of instruments. I realized early on that trying to find the right starting point of a clip to be beat aligned would be a nightmare. Luckily, I thought of the solution to play all possible instruments/sections on loop from the start, and simply unmute the instruments the musician chooses to use (and remute them once they switch instruments). This allows for switching instruments and sections seamlessly, without interrupting the flow of the music.

Finally, I moved on to instrument detection with microphone input. I was able to repurpose the existing feature extraction code (mosaic-extrack.ck) pretty easily, but had to make significant alterations to the mic synthesis code (mosaic-synth-mic.ck) to integrate it with my project. I take repeated feature-extraction+KNN results from the microphone while the musician holds the instrument selection button (LB), and select whichever instrument whose corresponding audio file is identified as the nearest neighbor the greatest number of times before the musician releases LB. Merging this process with the control flow for the controller was pretty difficult, and I needed to use ChucK events to prevent race conditions.

As difficult as this project was, it's honestly one of my favorite things that I've ever built. Type 2 fun I guess.


Phase Three: Make a Musical Mosaic!

Ideas From Milestone

Here's what I said in the milestone: "I still don't know exactly what I want to do for this part, but I think it will involve layering several instrument loops, creating by mimicking those instruments and using keyboard input to loop. I might use ChuGL to display the instruments that the user is mimicking, or the actively playing instruments if there are several layered."

What I Did

I decided to have looping instruments, but I opted to use an Xbox controller instead of keyboard input as initially planned. This was at first because I was having difficulty getting keyboard input to work on my computer, but I realized that using a controller actually feels more natural too. I also didn't have time to get to ChuGL, given the complexity of the other components of this project.

For my performance, I wanted to showcase all the features of the tool, so I incorporated vocal instrument selection, section switching, and muting/unmuting (including multitrack muting/unmuting). The song I played is a piano song I have been working on for a few weeks, but I wrote 2 parts for each of 4 additional instruments (drums, guitar, brass, woodwinds), and blended them in ways I thought were interesting throughout the performance. I hope you like the result as much as I do.

Performance