d_Jai - DJing as an AI astronaut

Nick Shaheed

nshaheed@ccrma.stanford.edu

d_Jai is a machine learning-powered audiovisual DJing tool that reinterprets DJing's core technique of mixing as the combination and manipulation of musical features instead of waveforms. Trained on 9 hours of downtempo house music, the model driving d_Jai breaks down each deck into its salient components and embeds them in a high-dimensional intermediary representation called latent space. This tool provides a richly visual and multimodal means of interacting with this complex, high-dimensional latent space, enabling a new means of DJing that let you combine features of three separate tracks, smoothly manipulate style, and more!

Using a custom ChuGin to render models trained using Realtime Audio Variational autoEncoder (RAVE) framework, this tool allows the generation of audio in real time, with several different interfaces for manipulating latent space in unique and rich ways.

Usage

d_Jai has four components: the latent space visualizer, the interpolation grid, the style sphere, noise, and the exaggeration slider.

The Latent Space Visualizer is a visualization of the latent space. It is an 8-dimensional vector. This vector is encoded from the incoming audio data, manipulated using d_Jai and then decoded back into audio. As you interface with d_Jai you'll see the shape of the latent space morph depending on what you are doing.

The Intepolation Grid is your main mode of interaction. It allows you to interpolate between three different tracks. As you move the dot through the square, different amounts of the three tracks will be incorporated into the output. You dynamically change this balance to mix tracks.

The Style Sphere lets you navigate style! Navigating the 3D sphere by holding left shift and then moving the mouse, plus using the scroll wheel to change the magnitude of the change lets you shift the style of the output (by adding values to the latent space).

The Exaggeration Slider lets you control intensity. This slider mulitplies the latent values by a scalar. At the center is 1 where there is no change, the left goes to zero where the audio qualities are subdued. Going to the right exaggerates the latent points, flying into far off distances in latent space.

Noise! Hold n for noise!

Starting d_Jai

This tool is for Windows.

After downloading d_Jai, you will need to launch two programs: DJai.exe in the root directory and a ChucK program (to generate the audio) in the DJai_release/ directory.

To start ChucK, open powershell and cd to ./ChucK/chugins/. After this run the following command:

../chuck.exe --srate44100 --chugin:./rave.chug ../../DJai_Data/StreamingAssets/dj.ck

You should now be hearing sound and can start manipulating latent space!

Downloads

Release

Unity Project

Attribution

Thank you Angela for the sonic curation in the process of making this.

The tracks used:

Rey & Kjavik - Baba City (Rkadash Version)

Feathered Sun - Hmm hm hmmm

Budakid - Matahari

Milestone 3

Lots of changes for this milestone. This week's work mostly focused on the visual - what are the different ways to interface with the latent space with a computer screen.

Milestone 2

The main workflow here is to take two tracks, break them down into their latent values, and then manipulated these latent values to do mixing and djing! This milestone has a very basic working example with two types of manipulation: interpolation & exaggeration!

Interpolation refers to interpolating between points in latent space.: The model takes an audio input and converts to latent space, essentially sliders of different features of the model. If you have two tracks you can interpolate between these features, essential cross fading their features instead of their audio
Exaggeration is multipling the latent space by a scalar value: This goes from 0 to 2, so that when the slider is to the right the latent features are exaggerated, and are minimized when it's to the left

The visual interface is very bare-bones right now. Most of my work for this milestone was on the backend:

I've been making a chugin for using the RAVE VAE. This needed to be updated to include features such as being able to change it's input/ouput structure (i.e. audio to audio, audio to latent space, latent space to audio). Everything is basically running but there's still a lot of work to do on this. In partcular there is some bug with the return latent space values that's resulting in reduced audio quality.
Training a model on ~9 hours of curated downtempo house (thanks Angela for the setlist!)