Audio Diary

Final Project for Music 256b


Roy Fejgin

CCRMA, Stanford University




Motivation and Vision


This project was created to enable to people to easily document the audio in their lives, and to explore the sounds others choose to share with them. It is envisioned as an application that runs continuously, perhaps all day, requiring very little interaction by user. The application keeps a snapshot of the recent of audio from the microphone. So whenever the user realizes something interesting has just happened, they still have access to it and can save it. An auditory tapping based interface allows starting and stopping the recording without even unlocking the phone.

The recorded audio clips can be saved to a server "in the cloud", along with information about them such as location, user ID, time of day, etc. Later on, the user can use the app to explore those clips, and clips created by others.


Interaction Design

The application uses a tab bar to represent its four modes of usages: recording, uploading, searching, and browsing. The following sections describe those modes in detail.


The recording interface allows easy capturing of audio events, even after they have occurred. To achieve that, the phone constantly monitors the microphone input and stores in memory a "window" containing recent audio (currently set to one minute). Audio that is older than the window size is discarded.


When the user notices an interesting audio event, they can tap on the microphone a number of times, which causes the app to switch to 'recording' mode. The previous two minutes of audio are saved to disk, as is all subsequent audio. Later, the user can stop the recording using another tapping sequence. Every time a tapping gesture is detected, a short audio cue is played to provide the user with feedback.



Screen shot 2010-03-22 at 7.26.19 PM.pngScreen shot 2010-03-23 at 5.49.34 AM.png


The recording process results in the creation of a collection of audio clips saved on the user's phone. The upload tab consists of a list of audio clips, sorted by their recording date and time. Touching a clip causes it to be uploaded to the server, from which it can later be retrieved or shared with others.



The search tab is for exploring audio clips residing on the server. It includes two search methods: by recording time (most recent first), and by distance from the current location (not yet implemented). Pressing one of the search buttons sends a query to the server. When the response arrives, the view automatically changes to the browse tab.


Screen shot 2010-03-22 at 7.27.12 PM.png




The browse tab shows a list of clips available on the server. Touching a clip causes it to be downloaded and played.



Software Architecture

Client Side

View Controllers:

Four view controllers were implemented, one for each tab. The controllers communicate using NSNotifications. The  Record and Upload controllers also share a ClipManager object which maintains a data structure representing the recorded files and their metadata.


Audio recording:

To implement the continuous monitoring and recording functionality, the AudioManager maintains a double-buffer of circular buffers. When monitoring, only one buffer is used.
The circular buffer data structure is well suited to the monitoring application since the overwriting of old audio occurs naturally.

Once recording to disk is initiated by the user, initiates the double-buffereing scheme: when a buffer fills up (or is nearly full), it switches to the other buffer. It then spawns a new thread which writes the full buffer to disk without delaying the audio thread.


Tap Analysis:

Tap detection makes use of a STK one-pole filter to do amplitude envelope tracking. When the filter's output exceeds a threshold, a tap is detected. Further tap detections are suppressed for 40 milliseconds in order to avoid duplicate detections.


Gesture detection is currently very simple. A gesture is any sequence of four taps. The gesture is interpreted as 'start' or 'stop' based on the state of the system.


Server Side:

The server side makes use of a few PHP scripts backed by a MySQL database. The database tables and most of the PHP are closely based on examples shown to us in the tutorial.


Vision for Future Versions

There are a few improvements I would like to implement in future versions:

  • Add more tap gestures. The gestures would be based on tapping a short rhythm. To detect the gestures while allowing for variation of tempo and timing, it would probably be useful to examine the ratios of the inter-tap times to the total gesture time. Then, we could calculate the Euclidian distance between the ratio vectors and the ratio vectors representing the various rhythms. The rhythm whose distance from the actual taps is the smallest would be selected. Thanks to Joachim Ganesman this idea!
  • Additional ways to search for audio clips: by user, by time-and-location, by audio features.
  • Interesting interfaces for browsing clips. For example, using a map to show clips recorded nearby.



This project uses the MOMU API, the Synthesis Toolkit (STK), and ASIHttpRequest.



The project is available here.