B R O [ K N N ] E C H O E S
a vaporwAIve piece
Cick here to watch
I present this project as an interactive audiovisual tool and as a showcase of how it might be used creatively. In the last few weeks, we've discussed issues of ethical AI, creative property, generative art, to name a few. Even outside of class, cutting edge research at the intersection of music and AI seemed inescapable; the release of Google's MusicLM model, and the recent discussion panel with TikTok's SAMI team prompted a lot of critical reflection that influenced my project design. Given that our assignment was to create a "mosaic" tool that used concatenative synthesis to generate sound, I started to question the nature of whatever kind of "art" would result. Training on and re-synthesizing data consisting of the existing creative work of others felt weird. However, I was very inspired by the milestone demos from others in this class--despite being inherently "derivative" creations, everyone's projects manipulated and recombined the audio data in truly creative, transformative ways. After all, there are plenty of non-AI related art forms, like sampling in hiphop or collage in visual arts, that leave room for aesthetically valuable and novel creations. The musical and graphical aesthetic of vaporwave is another great example: I have been a huge fan and follower of this niche artistic genre that has existed primarily on the internet in the last decade. For those unfamiliar with the genre, I will give a brief tour of vaporwave's aesthetic features.
Vaporwave relies heavily, if not entirely, on repurposing the sounds and visuals of decades past. So much so, that fans of vaporwave often quip that the genre is "just slowed-down Diana Ross." But even beyond leaning on retro styles and appeals to nostalgia, there are some additional layers. Vaporwave often heavily features blatant and unabashed images of consumerism. A slew of brand names with no apparent relevance, or excerpts from advertisements that, without context, just speak about how tasty and exceptional an unnamed product is. This combination of retro setting and embrace of hyper-commercialism, as well as a strong appreciation for technology and futurism, is what leads Vaporwave artists to draw heavily from American and East Asian media of the 80s-90s in their creations.
While I could go further into the characteristics and theory of vaporwave, suffice it to say that its key tenet of explicitly re-sampling and recycling the creative material of years past made it particularly appropriate as an aesthetic environment for an AI-fueled musical mosaic. I think it is especially relevant that so much of the current conversation around ethics and AI-generated art involves the shameless use of this new technology to push hyper-capitalistic interests and encourage the consumption of food, entertainment, and so on. In a way, the modern landscape of AI-in-art represents the next stage of vaporwave's evolution.
Features & Usage:
With this aesthetic framework in mind, I wanted to build a creative tool that would allow for the sounds and images of vaporwave to be created in real-time. Most media in the vaporwave genre exists in the form of individual users' YouTube videos, often getting copyright strikes for blatant, low-effort lifting of 80s pop tunes. That life cycle is compressed to every few seconds in this piece, with the user sort of "live remixing" music videos and commercials from the 80s-90s at their own pace, processing and modifying both audio and visual elements on the fly.
The key mechanism for the user to create sounds and images is by interacting with an object in the 3d virtual space. I first extracted audio features from several hours of Japanese City Pop, MTV hits, live concerts, 90s action movies, and other typical vaporwave source material. The resulting feature data was reduced to 3 dimensions using PCA. Then, the 3D position vector of a player-controlled object is used as input into a KNN search of the PCA-reduced vector space. This data is then used to resynthesize a number of audio windows (and accompanying video frames) in real time. So, the player is shown a cube that represents the 3d vector space, and a 3d object within this cube is controllable by the player who can then traverse the vector space to drive changes in the resynthesized audio/video.
Additional key controls allow the user to modify various aspects of the audio and resynthesis parameters:
There were a few design features I wished I had implemented, but couldn't get to work in time. Right now, the particle system in the cube is just a visual representation of what is going on with KNN search. I originally wanted this to be an actual representation of the individual audio windows in the feature space. I wanted to "highlight" the exact points in the feature space that were returned by KNN, and play around with having the particle behavior affecting the audio resynthesis. I also would have liked to implement the kind of beat-tracking and envelope follower systems that some others in the class discussed or used. It would have been nice to have the output a bit more clear and potentially beat-aligned, so that it was less abstract and more song-like. Lastly, my more ambitious idea, related to the last idea, was to introduce a sort of loop-pedal system where the user could "record" a loop by selecting a point in featuure space and then assign it to one of the prop objects in the scene. They could then "play" these loop fragments in real-time to create more structured musical phrases and grooves.
I felt that I was able to realize my core objectives, however, and I am happy with how my project turned out functionally and aesthetically. It feels cohesive, and flexible enough to allow for creative use and further development. It would be cool to make this VR-capable, and to implement some of the ideas mentioned above. The ultimate design feature to add would be to allow for users to upload their own videos and music as input, and perform extraction as well. Maybe I will work towards these additional goals in the future!
I could not have completed this project without the help from Ge, Yikai, Andrew, Terry, Celeste, Victoria, Nick, and everyone in the class who sent late-night discord advice or inspired me with their incredible ideas. This was one of the most enjoyable and fulfilling projects I've done since coming to Stanford, and the conceptual challenges relating to generative art have forced me to really think about what, how, and why I want to create music and art.
(previous milestones below)
I had fun with phase 1, and was surprised at how simple and effective (most of) the new features in ChucK are. I enjoyed playing with the genre classifier, and experimented with various combinations of extracted features. I did notice that more features does not necessarily mean better performance. The features that seemed to matter the most were the RMS and MFCC coefficients. I was getting some decent results also with Kurtosis, and so I included that in my phase 2 feature extraction. I was trying to use Chroma, SFM, and Zero-crossing, but it seems there are some issues under the hood. I imagine some of those would also carry a lot of information about the sonic signature of these songs, and I was particularly interested in SFM. Maybe if it gets fixed I will try to include it in my phase 2/3 methods.
Click here to watch phase 2 milestone
I experimented wtih a LOT of ideas in phase 2, most of which didn't go very far before I ended up rethinking my approach. My original plan was to train my model on a large dataset of songs that have influenced me over my lifetime (I called it an "audiobiography"). I went through a lot of trouble compiling this dataset, but in addition to some technical issues, I also wasn't really getting results that felt meaningful. Since the source material was so diverse and expansive, any kind of similarity retrieval would pull from all over the place with little sense of continuity. I had originally wanted to feed in audio files of my own original songs as input, to see how they might be "reconstructed" using snippets of the songs that have influenced me. This didn't really work. Perhaps with more tuning it might be workable, but I felt like I wanted to do something more creatively satisfying and responsive in real-time.
I also played with using speech audio datasets to create a sort of vocal transformer. I had great input data, and the re-synthesis sounded surprisingly clear. However, there was not the kind of correspondance with whatever I would speak into the mic that I wanted. Basically, it would turn my speech into babble, and I think having different speakers (many female samples) in my training set made it sort of bad at matching my vocal tone.
I struggled a lot to think of a concept for this project that felt like a fun creative tool I would actually use. I noticed during my experimentation that sometimes the collage-like sound of the re-synthesis reminded me of the kind of sampling and remixing done with Vaporwave music. The aesthetic values in vaporwave line up nicely with this idea of recycling and transforming older, existing audio content, usually in the form of taking American and East Asian music and audio from the 80s-90s and changing playback speed, adding reverb and other effects, and pairing it with visual scenes filled with anachronistic elements, blatant commercialism, and a distinct color pallette. By moving this project into the audiovisual domain, I felt that I might be able to create an interactive space that suits vaporwave's aesthetic principles perfectly by using this kind of similarity retrieval.
Keyboard commands currently operate mostly on the ChucK side to manipulate various aspects of the synthesis. I built on the keyboard-osc example code, borrowing the "Freeze" and "Use Closest Window" modes, as well as using keys 1-9 to change K. I also encapsulated some parts of the code to allow for on-the-fly modifications to things like the NUM_FRAMES/EXTRACT_TIME, playback speed, and effects like reverb and chorus. Mic input is also off by default and is only active when the user is holding Space. I envision a sort of live sampling tool where users can activate mic input for a short period of time, and the resulting resythesized audio is looped and saved, able to be triggered within the virtual world. Within the virtual world, I want to build upon my final project from 256A, using the video render textures to create sort of TV objects that float around and can be selected and interacted with by the user--each one will represent a different source video/audio. I compiled a couple hours worth of material: japanese city-pop, American and British pop music videos from the 80s and 90s, as well as anime and movie clips from those eras. I also downloaded a lot of 3d assets to populate my virtual space and make it vaporwave-y.
Sources & Acknowledgements
Download ChucK files
Training data sources: