Project 2

My initial inspiration for this project came from feeling nostalgic about a recent trip to Hong Kong. Yet what started out as just a tool for exploring sounds and songs I heard during the trip ended up becoming a full-blown audiovisual narrative / tribute to Hong Kong film “Chungking Express” of sorts. Although unexpected, my final result follows from attempting to deliver a coherent musical statement above all else.

From the start, I thought that K-nearest neighbors did a good job of blending together similar sounds and creating interesting soundscapes. Even after reducing the feature space down to 2 dimensions with PCA, different audio clips seemed to map cleanly to different regions. While I initially included jarring street and subway noises, my priority of delivering a musical statement led me to discard them (despite being the more recognizable sounds of the city). As in my milestone, I wanted to contrast the chaos of K-nearest neighbors audio blending with the low-stakes feel of regular playback, as well as navigate the sounds in a 2D space using PCA. Per my milestone feedback, I played around with varying parameters such as window size, and I eventually found that varying the playback speed (i.e., playing sounds in reverse) achieved a satisfying effect.

For the visual component, I curated movie clips from the Hong Kong film “Chungking Express” and manually mapped them to my audio clips. I used Processing to handle video playback. I also wanted to play certain video clips in reverse, but it seems like the built-in Processing video library doesn’t support this.

Overall, I found that compared to Word2Vec, K-nearest neighbors felt much more enjoyable to work with. I felt like I had to go out of my way to incorporate Word2Vec into my process, whereas at least to me, K-nearest neighbors (and more generally, leveraging audio features / latent spaces) seems like a useful sound design tool in itself (along the lines of AI as a "mirror of the past" / AI for data analysis).

Final

Phase 2 -- The sounds I chose are ones I remember from a recent trip to Hong Kong, whether walking through the streets, sitting in a cafe, or riding the subway. I wanted to explore the idea of navigating through sounds in a 3D space, analagous to someone walking through a city. Currently, my system has two modes: concatenate mode (the default mode), where sounds are sampled using KNN on a PCA-reduced feature space, and playback mode, where the audio is played through unaltered. The idea is to create a sense of calm and hopefully timelessness as the listener stays still, contrasted with the chaos as the listener navigates the 3D space.

Phase 1 -- I tried the following feature extractors: (1) Default (Centroid, Flux, RMS, MFCC), (2) All listed features (added RollOff, ZeroX, Kurtosis, Chroma), (3) RMS and MFCC only, (4) Default, but MFCC 32 coeffs (instead of 20), and (5) All features except Chroma. In general, I found that with more features, thebetter the classifier performed (albeit with diminishing returns). The default features seemed to perform well enough right off the bat, and removing any one feature seemed to hurt performance considerably (which surprised me, considering that Centroid, Flux, and RMS are only 1-dimensionsal). In the end, I thought that using all features except Chroma was the optimal tradeoff between performance and feature size. (Extraction, Classification, Validation)

Milestone


Reading Response

As AI becomes increasingly widespread in everyday life, interactive machine learning seems to have emerged as the preferred form of AI for most users. As Amershi et al. mention in their 2014 study, not only can cleverly-designed interactive AI systems facilitate more enjoyable and trustworthy user experiences, but by empowering the human in the loop, they can even outperform more traditional “AI-as-oracle” approaches in certain cases (e.g., ManiMatrix enabling AI novices to achieve comparable classifier accuracy as AI experts manually fine-tuning decision boundaries, in less time too).

While the case for human-in-the-loop AI is clear, I wonder if the distinction between more traditional offline machine learning versus interactive AI systems fundamentally boils down to a distinction between theory versus practice. As Amershi et al. mention, one constraint with interactive AI systems is that they require the model to run close to real-time, which limits the model’s architecture and likewise its capabilities. So although human-in-the-loop AI systems often achieve comparable (if not better) performance than their fully-automated counterparts using simpler models, I imagine significant effort had to have gone into determining which techniques/architectures were suited for the task in the first place, along with navigating the tradeoff between performance and efficiency. I am not an AI researcher myself, but as a researcher making simulation tools for animation artists instead, I think every computer-based tool (including but not limited to AI) ought to eventually be designed as an interactive tool. Unfortunately, I often find myself worrying more about whether certain simulations are even possible in the first place, rather than how to make them faster or what the user interaction would look like.

At the same time, I can see why such a mindset can be dangerous, especially if AI research continues to push the boundaries of what is possible without giving the public enough time to figure out how to use or live with these capabilities. It seems like there’s an equilibrium point between traditional “what-if'' machine learning research and human-centric AI systems design, perhaps even a middle-ground where the machine learning researchers themselves play a larger role in figuring out how their methods can be adapted to better suit human preferences in the real world. Yet I’ve often heard that research is about specialization. With how competitive the current machine learning research landscape is – where seemingly every idea is a race between who can implement and iterate on it the fastest – how would you convince researchers to essentially “slow down”? It feels like any meaningful change begins with redefining what constitutes valuable machine learning research in the first place.

Given these considerations, I think an interactive AI approach/mindset is best suited for any task where the underlying theory and space of techniques has already been reasonably developed, rather than necessarily being attached to a certain domain or application. More broadly, I think “AI-as-oracle” and human-in-the-loop methods should be developed in tandem and at a mutually beneficial pace. Thus, here are my 10 activities that I think could benefit from an interactive AI approach/mindset:

1.) Recommender system for "filling in chord voicings/drums/backing harmonies/etc." that adapts to the user's own musical style: Generative music transformers already exist for completing musical tracks, so an interactive system would mainly require making them faster and incorporating ways to weight a user's evaluations (e.g., whether they choose to keep the generated content, and what they replace it with).
2.) "__" image classifier: Machine learning has outperformed humans on ImageNet classification accuracy for years now, so designing interactive GUIs for experimenting with and customizing these models for user-defined tasks becomes more important than any fine-tuning of the models architectures themselves.
3.) Potato peeling: I would argue that tasks that "ought to be automated" should ideally be interactive as well, letting users easily customize the level of potato-peeledness, even if it means constantly prompting the user for feedback (the user can always ignore and go with the default setting).
4.) Barebones website designer: Instead of my reading response being just .html files, a tool that can package them into basic webpage templates, consistent across different structures (number of headings, paragraph size, etc.) and with my user-directed aesthetic preferences would be much appreciated.
5.) Automated light dimmer: a simple physical light-switch + sensor combo that tracks its on/off history and is able to automatically fade to dark when it predicts you're sleeping (can correct by turning brightness up again)
6.) Spotify autoplay with more controls: being able to predict when I want to listen to familiar songs, versus when I'm in a more exploratory mood, as well as allowing me to override it by specifying a slider value.
7.) Automated equipment lists for music practice: having a database of instrumentation for every song/practice setup and being able to generate a draft of the equipment list, while adapting for year-to-year variations in personnnel and inventory when the user manually makes corrections.
8.) Semantic documennt auto-organization system: large pre-trained language model that's able to extract features from my documents (e.g., research papers) and automatically cluster them into different semantic categories, which I can then interactively explore, assign labels, and possibly re-organize / guide the model.
9.) Dynamic text message notification classifier: a lightweight program on your laptop/phone that determines the urgency/relevancy of a text and whether/when/how to notify you (as opposed to having your device on "Do Not Disturb" all the time), responding to user feedback whether you choose to mute or un-mute notifications.
10.) Interactive piano duet system: similar to the first idea, but more constrainted and in a live setting; feedback based on whether the user goes along with the current part (indicating a strong, positive match) or abruptly changes / simplifies their playing.