In /usr/ccrma/media/databases/hiphop-gene/ are the following files:
A list of each artist in the dataset. Rather than extracted from tags in the mp3 file, they are hand-entered via categorize.py to ensure correct normalization.
Loosely organized directory of mp3/m4a/etc. files for the base data set. New data set examples go in here, to be sorted out by categorized.py.
List of each possible genre in the dataset. Handwritten and used by categorize.py for manual genre entry.
The main catalogue of metadata associated with each WAV file. Currently includes genre and artist(s) info, in addition to file paths of compressed/WAV versions of the audio data.
Directory of uncompressed audio data files. Automatically populated by decompress.py
Convert files in compressed/ to WAV format, and place them in wav/.
Add any new artists in meta.json to artists.json (normally not necessary as categorize.py should do this automatically).
Search for new files in compressed/<tt> and request genre and artist information. Stores this all in <tt>meta.json.
Export meta.json data into a format convenient for use in Matlab. Write filepaths to files.dat and genre + artist info to meta.dat. Each row in these files is one training example. Column 1 of meta.dat is the genre (an index into the list of genres in genres.json) and the subsequent columns indicate the presence of absence of a particular artist on that song (where column N is the N+1-th artist in artists.json).