For my first feature configuration, I used the provided configuration in the feature-extract.ck
code.
This feature configuration contains: Centroid, Flux, RMS, and MFCCs (20 coefficients, 10 filters).
For my 2nd feature configuration, I tried using exclusively MFCCs as a feature. I kept the provided values of 20 coefficients and 10 filters for this configuration.
For my 3rd feature configuration, I was curious if increasing the number of filters used in the MFCC operations would allow the same number of coefficients to better describe the data. I again used exclusively MFCCs as a feature and kept 20 coefficients, but I increased the number of filters to 20 for this configuration.
For my 4th feature configuration, I wanted to capture as much information about the audio as possible in my feature vectors, so I collected all available features (Centroid, Flux, RMS, RollOff, ZeroX, MFCCs (20 coefficients, 20 filters), Chroma, Kurtosis).
For my 5th feature configuration, I wanted to try using all features besides MFCCs. My 2nd/3rd configurations use exclusively MFCCs, so I was curious to see if the opposite of that could produce better results. I was also curious how far the accuracy would drop from configuration 4's after omitting MFCCs. So I only collected the following features: Centroid, Flux, RMS, RollOff, ZeroX, Chroma, Kurtosis.
Dimensionality | Fold 1 Accuracy | Fold 2 Accuracy | Fold 3 Accuracy | Fold 4 Accuracy | Fold 5 Accuracy | Mean Accuracy | |
---|---|---|---|---|---|---|---|
Configuration 1 (Default) | 23 | 0.4520 | 0.4392 | 0.4221 | 0.4529 | 0.4250 | 0.4382 |
Configuration 2 (MFCCs, 10 filters) | 20 | 0.4054 | 0.4010 | 0.3995 | 0.3779 | 0.4123 | 0.3992 |
Configuration 3 (MFCCs, 20 filters) | 20 | 0.4265 | 0.4314 | 0.3838 | 0.4402 | 0.4422 | 0.4248 |
Configuration 4 (Everything) | 58 | 0.4520 | 0.4676 | 0.4789 | 0.4882 | 0.4740 | 0.4721 |
Configuration 5 (Everything - MFCCs) | 18 | 0.3804 | 0.4069 | 0.3917 | 0.3706 | 0.3426 | 0.3784 |
Unsurprisingly, Configuration 4 produced the highest mean accuracy in cross-validation. This was to be expected, since its feature vectors contained all available features. However, Configuration 4 also had the slowest feature extraction process, which may make Configurations 1 or 3 more practical for real-time extraction. Another interesting thing to note is that Configurations 2 and 3 outperformed Configuration 5. This means that all features besides MFCCs (Centroid, Flux, RMS, RollOff, ZeroX, Chroma, Kurtosis) combined still capture less useful information for genre classification than the MFCCs (20 coefficients) alone. We can see that adding the MFCCs back to the rest of the features leads to a ~10% accuracy increase (Configuration 5 to Configuration 4). This shows how important MFCCs are as a feature in genre classification.
I would like to be able to trigger samples of different instruments by mimicking those instruments with my voice. For this milestone, I am simply playing back clips of myself mimicking different instruments, but for the final submission I plan on mapping those clips to actual instrument samples (and potentially allowing users to input their own ways of mimicking instruments). I think it would be fun to try building up a song over time by mimicking instruments with your voice (and possibly manual looping of samples).
I used the provided mosaic-extrack.ck
and mosaic-synth-mic.ck
programs in order to create the demo. In the early stages of testing, I was interested in specifically
reproducing drum sounds through beatboxing, but I was having trouble getting the playback timing to work. I tried several experiments with adjusting the FFT size and HOP
duration, and even syncing my beatboxing recordings with FFT size/hop size, but was still having trouble. So I pivoted to recreating a broader range of sounds, through
mimicking multiple instruments. I found better results with this approach, since the sounds are more distinct and easily identifiable with KNN (at least with the features I extracted).
This tool provides a musician with 4 possible tracks (A, B, X, and Y) that they can manipulate with an xbox controller and their voice.
The musician can assign an instrument of their choosing to a track by holding the track button (A/B/X/Y), holding the instrument selection button (LB), and mimicking the instrument they want with their voice into the microphone. Whatever instrument the musician mimicks will be assigned to the track whose button was held. For example, if I hold A+LB on the controller and start whistling into the mic, the tool will assign woodwinds to track A.
Additionally, the musician can alternate between 2 sections of a track by holding the track button (A/B/X/Y) and pressing the switch section button (RB). For example, if track X is playing the 1st section of some piano music, holding X and pressing RB will cause track X to start playing the 2nd section of the piano music instead.
Moreover, the musician has the ability to mute and unmute tracks on the fly by simply pressing the track button for whatever track they want to mute or unmute. All tracks are played on a loop, and this allows for enabling/disabling of tracks.
This was one of the most complex coding projects I have ever taken on. It involved 3 main components:
Finally, I moved on to instrument detection with microphone input. I was able to repurpose the existing feature extraction
code (mosaic-extrack.ck
) pretty easily, but had to make significant alterations to the mic synthesis code
(mosaic-synth-mic.ck
) to integrate it with my project. I take repeated feature-extraction+KNN results from the microphone
while the musician holds the instrument selection button (LB), and select whichever instrument whose corresponding audio file
is identified as the nearest neighbor the greatest number of times before the musician releases LB. Merging this process with
the control flow for the controller was pretty difficult, and I needed to use ChucK events to prevent race conditions.
As difficult as this project was, it's honestly one of my favorite things that I've ever built. Type 2 fun I guess.
Here's what I said in the milestone: "I still don't know exactly what I want to do for this part, but I think it will involve layering several instrument loops, creating by mimicking those instruments and using keyboard input to loop. I might use ChuGL to display the instruments that the user is mimicking, or the actively playing instruments if there are several layered."
I decided to have looping instruments, but I opted to use an Xbox controller instead of keyboard input as initially planned. This was at first because I was having difficulty getting keyboard input to work on my computer, but I realized that using a controller actually feels more natural too. I also didn't have time to get to ChuGL, given the complexity of the other components of this project.
For my performance, I wanted to showcase all the features of the tool, so I incorporated vocal instrument selection, section switching, and muting/unmuting (including multitrack muting/unmuting). The song I played is a piano song I have been working on for a few weeks, but I wrote 2 parts for each of 4 additional instruments (drums, guitar, brass, woodwinds), and blended them in ways I thought were interesting throughout the performance. I hope you like the result as much as I do.