CS470 Project #2: "Featured Artist"

  • Stanford Vision and Learning Lab

  • Stanford University

Phase One: Extract, Classify, Validate

All possible combinations for all 8 features (centroid, flux, rms, mfcc, rolloff, zerox, chroma, kurtosis) are experimented through cross validation. Average classification accuracy for the 256 combinations of featrues are shown in the figure below (better viewed in zoom in).

mehod

From the figure, we can observe that:

  • The combination of (Flux, RMS, MFCC, Rolloff, Kurtosis) with 17d features gives the best classification accuracy of ~0.43. Feel free to use this finding in your projects!
  • Runner ups are: (Centroid, Flux, RMS, MFCC, Kurtosis), (Centroid, Flux, RMS, MFCC, Rolloff, Kurtosis).
  • We can see that the most effective featrues are: Centroid, Flux, RMS, MFCC, Rolloff, Kurtosis.
  • MFCC by its own can already achieve ~0.36 classification accuracy, which is the most effective feature from all possible choices.
  • ZeroX is the least effective feature. Don't use ZeroX.
  • Phase Two: Designing an Audio Mosaic Tool

    Wanna find out the similarities between PSY (singer of Gangnam Style) and Steve Jobs (Ex-CEO/founder of Apple)?
    Features & Instructions:

  • Run chuck our-osc.ck:jobs.txt:gangnam.txt to load features and start the transmission process. Meanwhile, run phase2.pde via Processing to start the game UI.
  • Left click on the screen plays a piece of query audio.
  • Use 'A' and 'D' on keyboard to move the character left or right.
  • There are three doors, each linked with a video snippet, when the character is close to the door, the video will play automatically.
  • The goal of the program is to find the video with the most similar audio w.r.t the query audio.
  • Go find the best matching video, and press 'SPACE' at the door.
  • If you are correct, you will score one point. If you are wrong, you will lose one point.
  • After each try, the query audio as well as the videos will be refreshed.
  • Have fun :)
  • Phase Three: Make a Musical Mosaic!

    Creative statement:
    To make this form of interaction more creative, I will polish the prototype more to be a more formal game. It will present a new way of musical mosaic and a new way of human/music interaction in this scenario. To be more specific, sampling algorithm may be redesigned; there will be difficulty variations if a player score certain points. The entire visual representation will be significantly polished to be as 'artful' as possible.



    Features & Instructions:

  • Run chuck our-osc.ck:jobs.txt:feifei.txt:ge.txt:gangnam.txt to load features and start the transmission process. Meanwhile, run phase3.pde via Processing to start the game UI. (remember to set the path to videos properly)
  • Left click on the screen plays a piece of query audio.
  • Use 'A' and 'D' on keyboard to move the character left or right.
  • There are 10 flag-separated regions, a video will paly when the character locates within a region.
  • The goal of the program is to find the video with the most similar audio w.r.t the query audio.
  • Go find the best matching video, and press 'SPACE' at the door.
  • If you are correct, you will move to the next level!
  • After each try, the query audio as well as the videos will be refreshed.
  • Have fun :)
  • Reflections: It is meaningful and fun in general to dive deep into musical mosaic. The major components of this task are two parts: feature extaction and feature matching. For feature extraction, although I have studied many combinations of features from FFT audio frequencies, it is still very uncertain to claim which feature could be the best to use. Also, most of the features are specialised for musical audio analysis, which may not be appropriate to describe spoken languages. In this way, I wonder if automatic feature engineering might be a better apporach. Moreover, for feature matching, although KNN is a common approach for similarity-based feature retrival (with good propertires), it is not tolerant to noise or large variance.

    Besides the reflections on the core of musical mosaic, I also acknowledge some limitations of my current implementations. The biggest limitation is its inability to support infinte levels with varying difficulties. In its current implementtaion, three levels are pre-defined with fixed difficulties. Ideally, users should have access to a custom setting, where they are able to change the difficulties of the game and navigate to any level as he/she wants. Another limitation is the OSCP5 communication mechanism. there is an approach to transimit information from chuck to processing, but I still cannot find a way to do the same other way around. With this being said, most of information has to be sent from chuck to processing repeatedly in an infinite loop. This is extremely inefficient and should be considered to improve in the future.

    In general, I like the concept of creating a music mosaic through KNN-based feature matching to find coherent music pieces from different sources and I enjoy this project a lot. By extracting and analyzing specific features such as tempo, rhythm, pitch, and timbre from various songs, KNN can identify and cluster similar musical segments. This process enables the generation of a music mosaic, where segments from different tracks are seamlessly stitched together based on their similarities, becoming a new representation of music. This technique not only showcases the potential for innovative audio creations but also highlights the power of AI/ML algorithms in understanding and manipulating complex patterns within music. The ability to merge diverse musical elements into a cohesive whole opens up new avenues for creative expression and exploration within the music industry, offering listeners a unique auditory journey through familiar yet distinctly new soundscapes.

    Acknowledgements

    Both chuck scripts and processing scripts borrow a lot from the sample codes provided in this course. The official processing documentation was of great help. Tiange thanks Ge Wang and Andrew for their help.