My final project for Music 270B was an exploration of neural networks and their use in the field of music, mostly for the purpose of giving me a better understanding of how they worked. Keep reading for a description of what I did, taken from my original project write-up.
For many years I have heard the term “neural networks” thrown around in discussions about machine learning, but its meaning remained a mystery to me. I would like to have a better understanding of a number of machine learning–related topics, including neural networks, support vector machines, and genetic programming. Because of that, I jumped at the opportunity to explore neural networks in 270B. The primary goal of my project was to explore neural networks beyond the theoretical and conceptual terms in which Andy Clark’s Mindware  discussed them and to gain a better understanding of how they are actually implemented and used. The implementation side appeals to my engineering nature, and the usage side appeals to my musical nature, making this an ideal approach for me to such a complex topic.
Neural networks are arranged in an attempt to simulate the way that signals (representing information) are transmitted from one neuron in the human brain to another based on the strength of the connections between the neurons. A neural network therefore consists of one or more neuron “units” and connections between those units. Each connection has an associated weight which represents the strength of the connection, and each neuron-unit may have any number of connections leading to it and/or from it. Neuron-units receive information signals from other units in the form of numbers. Each neuron-unit then sums all of the inputs it received, passes that sum through a limiting function, and then passes its output down each of the connections leading away from it. Based on the weight assigned to each connection, the next neuron-unit in the chain will receive some percentage of the output value that was sent to it. Neuron-units representing the input of the network are, not surprisingly, called “input units”. In a basic feed-forward network, input units do not receive input from other neuron-units, but in networks containing feedback such a scenario is quite possible. Similarly, there are “output units” representing the output of the network which can act as the source of feedback in such a network. The somewhat mysterious part of a network however (mysterious to me, at least), is the presence of “hidden units” which lie between the input and output units and affect how the input signal is modified before it reaches the output. Another mysterious point for me is that hidden units and output units can also receive input from biases, which are essentially input units which always give the value “1” as an output. I have no idea what role these bias units play in changing the output of a network, but they are apparently useful!
Training a neural network is a repetitive process that can be extremely time-consuming. During each training iteration, sample inputs (“training examples”) are fed to the network, and the weights connecting the various neurons/units are adjusted depending on the error between the network’s computed output and the expected or desired “correct” network output for a given sample. Ideally, each training iteration results in a small improvement in the network’s overall ability to correctly process each training example, but the level of success often depends on a certain element of luck. In particular, the randomly-selected initial connection weights for the network can have a significant impact on how quickly the network converges on a set of weights providing minimum error (or whether it even converges on the correct output values at all!).
I began my exploration of neural nets by looking through various engineering-oriented texts on machine learning from the library. Those books proved to (for the most part) be too advanced for my purposes – they contained proofs of why various learning techniques worked, or why some problems can never be solved with certain kinds of networks, etc. but they weren’t helping me understand the actual order of computations a network goes through as it learns and even just as it processes an input to produce an output. I then turned my attention to the book Music and Connectionism, which is primarily a collection of articles from two special issues of the Computer Music Journal released in 1989. Having been written for an audience with little or no experience in neural networks, these articles were a very helpful introduction to the field. In fact, the article by Mark Dolson titled “Machine Tongues XII: Neural Networks” actually reads very much like a high-level tutorial, so my main project focus shifted to following this tutorial and attempting to reproduce Dolson’s results.
The first network Dolson describes is not related to a musical topic at all, but it gives a general sense of how neural networks function. To make sure that I understood what he was saying, I attempted to implement this network in Java. The network is a perceptron – the most basic kind of neural network which contains only one neuron, and this perceptron’s task is to determine if one number (represented by one input) is at least two times as large as another number (represented by a second input). Using only what Dolson had written proved to not be quite enough information to allow me to implement a real perceptron - I was having trouble understanding the learning process and exactly how the connection weights were handled. I found a neural network tutorial website  which fortunately filled in all of those gaps for me. I successfully trained my perceptron to solve the problem of the logical OR, AND, and NOT functions as described in the tutorial website, and I was subsequently able to successfully train it to complete Dolson’s perceptron task. Interestingly enough, even this simple task of implementing Dolson’s “one number at least twice as large as another” perceptron pointed out an important aspect of neural networks: there isn’t always only one right answer! This perceptron had two inputs (x1 and x2) and therefore two weights (w1 and w2). The output of the network is f(w1*x1 + w2*x2), where f(x)=1 for x>=0 and f(x)=-1 for x<0. As long as w2=-2*w1, this network will give the correct output, and in the case of my implementation the weights tended to converge to w1=0.11 and w2=-0.22 instead of the most obvious (to a human, at least) answer of w1=1 and w2=-2. I think that for some, the fact that neural networks can find these kinds of perfectly valid solutions which are not always the most logical solutions to a human is part of the appeal of neural networks in the first place.
After I felt like I had a fairly good grasp of how perceptrons worked, I moved on in Dolson’s tutorial, hoping to use a pre-existing library of neural network code to replicate some of the music-related experiments he had performed. My intention was to be able to reproduce some of the more complicated network structures used without having to write all of the necessary code myself. I investigated the Java-based Joone (Java Object Oriented Neural Engine) library  and was initially excited to find that the creators had included a GUI tool for graphically building, training, and testing many kinds of networks. Following the Joone documentation, I successfully created a feed-forward network with the Joone GUI that solved the XOR problem (an important problem because, unlike OR, AND, and NOT, it cannot be solved by a single perceptron). I hoped that I could then modify that network to reproduce Dolson’s experiments, but I was unable to make it work. I don’t know if I had simply set some parameter incorrectly or if my network was fundamentally built incorrectly (Joone had many undocumented parameters which probably would have made sense to someone with neural network experience but which just served to confuse and complicate things for me). Either way, it didn’t work.
At this point, I was somewhat inspired by the success I’d had with my perceptron code, so I left Joone behind and began implementing my own more advanced neural network code. Fortunately, at this point I had finally (after several weeks’ wait) been able to get a copy of a helpful and more introductory machine learning textbook  from the library, and I found that it contained a very nice description of the back-propagation learning algorithm that is commonly used in feed-forward neural networks. I also came across a website with a description of some object-oriented neural network code  which served as an inspiration for the OOP design of my own feed-forward network.
The goal of the feed-forward network rhythm classification experiment was to train a network to learn the difference between two categories of rhythms (Dolson describes this as teaching the network to identify rhythms we like vs. rhythms we don’t like). The rhythm experiment is set up as follows: There are eight inputs to the network. Each input represents a beat of time (for example, think of eight eighth-notes representing a measure of 4/4 time). Each beat can be either an attack (represented by an input value of “1”) or a rest (represented by an input value of “-1”). There is one output from the network. This takes on the value of “1” if the network decides that an input rhythm is “good” and “0” if the network decides that an input rhythm is bad. Dolson describes using “1” and “-1” as the two possible output values, but I chose “1” and “0” instead because the sigmoid limiting function I was using to compute the output of my neuron-units gives output values in the range (0, 1) and not (-1, 1). The criteria for a “good” rhythm vs. a “bad” rhythm could be anything, but for the purposes of his experiment Dolson sets the rule that a “good” rhythm must have an attack on beat one and either attacks on both beat two (the third eighth note) and beat four (the seventh eighth note) or rests on both of those beats. Because there are eight binary inputs to the network, there are 2^8=256 possible distinct input rhythms. To train the network, all 256 input rhythms can be used, or a subset of the possible rhythms can be used. Theoretically, it is not necessary to train the network using all possible inputs in order to guarantee a 100% success rate. In fact, Dolson apparently achieved success with only 40 training examples (12 “good” and 28 “bad”). Because I had no idea what criteria should be used to select the optimal combination of good and bad training examples in order to train the network with as few samples as possible, I simply chose random subsets of the 256 possible input rhythms and fed those to the network. I occasionally achieved 100% success using 128 random inputs, but quite often instead of converging to the desired 0 vs. 1 output results, my network converged to 0 and 0.5 instead. Also, even when using all possible inputs as training examples, and with a million training iterations (where each iteration fed each of the training examples to the network once), my network didn’t always converge in the desired fashion. I think that perhaps my criteria for what is correct output vs. incorrect output needs to be relaxed. For example, I only considered an answer correct if it was less than 0.1 away from the target, but this was a somewhat arbitrary threshold, and it may be legitimate to say that anything under 0.5 is a correct answer when the target is zero, while anything over 0.5 is a correct answer when the target is one. This would be an interesting area for future exploration.
There are multiple limitations to the problem of rhythm classification when viewed in the context of a purely feed-forward network. One of these is the restriction that the network must always take the same number of inputs. For example, what if instead of a measure of 4/4 we suddenly encountered a measure of 3/4? The network would not be able to accept this 3/4 measure unless we treated it as a measure of 4/4 instead and padded the inputs with some kind of value for the seventh and eighth notes. Of course, even that doesn’t make any sense, because our rule for “good” vs. “bad” rhythms in the first place dealt explicitly with 4/4 measures. Another restriction of the feed-forward network is the lack of any kind of history or memory in the network. At any given time, we are limited to only looking at the current rhythm input, and there is no means to compare this to past inputs. Consider, for example, the scenario where we wanted the network to say that a rhythm was good only when it was different from the rhythm that preceded it. A feed-forward network cannot learn such a task, so this is where recurrent networks become useful instead. A recurrent network adds the additional dimension of time to a feed-forward network. To do this, it incorporates feedback. Feedback can consist of storing past outputs of the network to be used as future inputs, but it can also consist of storing the past values of the hidden units of a network. Supposedly, the same back-propagation learning algorithm that is used in feed-forward networks can also be applied to recurrent networks, but Dolson used a different kind of learning algorithm, called RTRL (Real-Time Recurrent Learning). Unfortunately, he doesn’t describe this algorithm in any amount of useful detail, so instead I used the basic back-propagation learning algorithm that I had already implemented for use in the feed-forward network. From there, turning my feed-forward network into a recurrent one was very straight-forward.
Next, I attempted to re-create Dolson’s rhythm experiment that used a recurrent network. In this example, there is only one current input to the network at a given time, but Dolson uses four hidden units (instead of the two he uses in the feed-forward network), and adds connections from those hidden units to four additional inputs which serve the purpose of providing a feedback loop. As a result, at any given time, the network receives one new input (one beat of the rhythm) and four inputs representing the values of the hidden units the last time the network was run. In this new, feedback-based experiment, an input value of “1” still indicated an attack and “-1” still indicated a rest. The output values for “good” and “bad” were also the same as in the feed-forward case, but the rules for what made a rhythm good or bad changed so that any rhythm with more than 3 consecutive attacks or more than 5 consecutive rests was bad, but everything else was good. Once those rules were defined, I had to figure out what data to send the network to train it. I settled on sending a random string of “1”s or “-1”s and just informing the network at each step whether the history it had received up to this point meant that it had seen a good rhythm or a bad one. I sent the network a very long training string (thousands of beats long), and then I tested it with a string 100 beats long to see how well it did. The network occasionally missed certain patterns (such as mistakenly saying that (-1, 1, -1, -1, -1) was bad, when in fact it was good because the “1” thrown in the middle meant that there were only three consecutive rests and not five. However, the results were generally good (especially considering that my network was slightly differently configured from the one Dolson used and I used a presumably less effective learning algorithm). One thing that wasn’t clear to me in the process of working with the recurrent network was exactly what the best way to feed it training examples was. For example, would it be better to feed it a limited-length string (say, 100 or 1000 beats), and then reset the network’s history and repeat the process? Or was it better to do what I did, never resetting the network’s history? I’m left wondering this because it seems that perhaps the network could get in some funky history state where it was completely confused, in which case starting fresh would get it back on track? It would be interesting to see which approach to the training is more effective. I’m sure that there are also plenty of other considerations with recurrent networks which I essentially ignored while running my “proof of concept” experiments, but it was exciting to see the network succeed reasonably well despite the fact that I kept as many things as possible the same from the feed-forward network to the recurrent one.
One of the supposed benefits of working with neural networks is the ability to solve challenging problems without having to find a symbolic representation for all aspects of the problem ahead of time. This benefit also happens to be the source of one of the biggest frustrations I found while working with neural networks: the fact that they are completely impossible to debug in the “traditional” programming sense. When your network fails to give the desired output, or doesn’t seem to be learning the way it was intended to, you can’t just open up the network, look at the weights of the all of the connections, and use that data to track down the source of your problem. The weights are too abstract of a representation for the solution to allow me to “read” them the way I would read more symbolic forms of representation in a computer program. Perhaps those who spend significant time working with neural networks eventually develop an intuition for where the source of their problem might lie, but someone like me who is new to the field is pretty much out of luck. I experienced this challenge first-hand while implementing my primitive perceptron and my more powerful feed-forward network. When I didn’t get the results I expected I had very little way to tell if I had a bug in my program, or if the random weights that were chosen to seed the network were just “unlucky”, or if I had configured my network incorrectly (with the wrong number of hidden units or biases, etc,). It was very frustrating!
Another challenge that comes up very quickly when playing with neural networks is simply the fact that it can take a very long time to train a network well. The larger the network, the more connections it has, the longer it takes to learn a new problem. I wasted more time than I’d like to admit worrying that I had a bug in my program because my network’s output values were not converging to the desired values when in fact it was simply the case that I had not allowed the learning process to be repeated enough times for the error to become small. Almost the inverse of that situation occurred for me as well, where I told the network to stop learning when its mean-squared error went below a certain value. I let it run all night, and it never did converge to a reasonable level of error. I then tried to train the network again (with a new random set of starting weights), and it converged in a matter of minutes. In other words, sometimes you get lucky with your starting weights and sometimes you don’t, and in the cases where you don’t it’s hard to know for sure that “bad” starting weights were the source of your problem. I suppose I’m the kind of person who prefers more reliable, consistent, and predictable tools to work with (although perhaps with more sophisticated implementations, neural networks can be all of those things).
Having now implemented several neural networks along with the well-known back-propagation learning algorithm, I have a much better understanding of how neural networks function and what their strengths and weaknesses are. I hadn’t originally intended to do so much of my own programming in order to learn this much, but probably if I had not implemented these networks on my own (and gone through the process of debugging them!) I would not have been able to understand them as well as I now do. Of course, all of my playing around has only scratched the surface of what neural networks are capable of doing, and I made no attempt to optimize my code in any significant way, so there is a lot of room for improvement if I decide to pursue this further. Right now, I’m not convinced that I like working with neural networks enough to continue playing with them in the near future, but in case the opportunity arises, here are some paths I would like to explore:
In his article on algorithmic composition using neural networks, Peter Todd advocates for the benefits of using multiple interconnected networks to solve a single problem . I would be very interested in learning how I could build such a multi-level network.
Marvin Minsky and Seymour Papert, in the prologue to their “Expanded Edition” of the book Perceptrons , emphasize that a lot of the problems of working with neural networks, and the reason for a lack of significant progress in the field over a fairly large number of years is the fact that it is difficult to determine good representations for the data that neural networks are intended to act upon. When it comes to working with melodies in neural networks, it would be interesting to learn how effective absolute representations of pitch are vs. relative representations of pitch. Relative representations would be useful when it’s the shape of the melody that matters, but absolute representations are useful if one cares about distinguishing between similarly-shaped melodies in different keys.
This is something I had hoped to better understand by the time I completed this project, but I didn’t have time to investigate it.
Joone’s GUI seemed like a good idea, but it didn’t work well for me. I don’t know that I need something that claims to be quite as fancy, but it would be nice to be able to easily tweak experiment parameters and view results in some sort of user-friendly GUI.
The code I wrote for this project was intended almost entirely for my own experimental purposes and not for others to use or even necessarily read. However, it could easily be extended and cleaned up to allow easier experimentation. In the interest of not having “dirty” code lying around, I’ll probably do this housecleaning to some extent no matter what.