Well, I had to completely restart my Unity project (for the second time) because OneDrive offloading corrupted the core Chunity library among other assets. It was okay, though, because it really gave me the chance to reevaluate the mission of my project from the barest of bones. I wanted to do some kind of real time sequencer or sample pad with the keyboard, but I decided that both BPM-matching loop boundaries and mapping loops to keyboard triggers would be too tough in the limited time that I had. I ended up ditching the webcam from the milestone because I felt that it would have distracted from the immersive, all-screen experience of the final product. The control scheme is simple: hold down 1 through 4 on the keyboard to record a sample for AI Ge to try his best at replicating. The catches are that you must wait until all four KNNs are trained (look for “READY!!” in the console), and there is no subsequent user input upon a bank after the loop begins (too hard). The transformation of the UGen topology in the backend was kind of funny, as well as illuminating. The live sampling is (of course) accomplished with instances of LiSa, but initially I had the ADC recording directly into LiSa's, which would live play-back raw audio through the KNNs. This proved entirely too intensive to do in four-way parallel, especially with the live spatialization. The quick fix ended up moving the LiSa all the way to the end of the chain, after the KNN SndBuf array, so we cache the AI's output and loop that. That decision alone saved my project. Thank you for your time and attention! :3
Also, I did not create a .app build for this project, mainly because the core functionality of the game doesn't work in builds for some reason.
Boy, was this one kind of challenging. I didn't know where to take my project onward into Phase 2 for days, but all of a sudden inspiration hit me. I wanted to mimic being on a Zoom call, using the KNN embeddings as a sort of call and response. To create more intrigue, I started with an apartment living room environment and adapted the ChAI boilerplate code from Andrew and Ge into it. Since I love a little meta, I ripped the recording of the Phase 2 zoom tutorial and made that the basis for the feature vector training. It took hours of tuning hyperparameters to make the output of the model speechlike. Sadly, at midnight upon the due date, my Unity project file was corrupted and I basically had to remake it from scratch! Thankfully, the code and data survived, or I would have cried more than I already did. I'm kind of thinking where I want to take the final product; maybe I should write a poem?
I did not embed the relevant files for this milestone because the Unity project folder takes up 18 GB of space on my SSD. In addition, the webcam currently doesn't work in the built out executable, and even if it did it would still be an 8 GB download. I will reserve these stresses upon the CCRMA web filesystem for the final deliverables. On another note, with this project I hope to challenge the proliferation of AI a little bit by making a proof-of-concept Ge deepfake, if he will allow it. Rice was always meant to go on top of beans, anyway!
I would like to thank the following resources for making my Unity project possible so far:
Through hours of experimentation, I was able to curate a feature collection that performed 20% better on cross-validation than the base collection. I initially wanted to go off in my own direction, but quickly hit pitfalls. For example, creating a chain just out of ZeroX, Chroma, SFM, and MFCC resulted in a cross-validation score of about 0.1, where the baseline was about 0.4. I decided to reinforce the existing set of Centroid, RMS, Flux, and MFCC by adding a few more features that performed well (Chroma, SFM, RollOff), and I tweaked the parameters of relevant unit analyzers until I was happy, ending up with a cross-validation score just above 0.5.
This experience of trying to tune audio features to optimally identify and classify genres made me verbalize my gut feeling of wanting to stay away from engineering AIs in my career. There is just an utter lack of transparency about how one configuration may be less accurate than another, and I feel so far removed from the music of which I'm trying to gain a new perspective. I found it really troubling being tasked with improving the accuracy of a feature collection, as the framings I'm presented with are not how I think about genres, or music at all. I would like somebody to explain to me intuitively how changing the rolloff percentage feature from 80% to 50% slightly increased my fold accuracy from about 0.5 to 0.53. I guess I can call myself a software engineer at this point. As one, I really like to be in touch and in tune with the systems I'm operating upon. If I see a line of code I don't recognize or understand, there's always a human who wrote it or references it often and could explain it to me. I don't know what will happen if we enter an era of AI-generated code; what will happen to accountability? I think accountability and transparency are two sides of the same coin. Perhaps this music genre classifier is a powerful yet graspable enough experience that exposes me to these deep-seated feelings and anxieties.