Daniel Steele

220c Final Project




A Perceptual Study of Sound Annoyance


Annoyance is clearly something that needs to be measured, because so many aspects of our lives are dictated by the perceived annoyance of a given source. The rest of this project focuses specifically on the annoyance brought on by sound stimuli, but the analogy to haptic sensation will often be useful. Surprisingly, measures to change annoyance aren't always directed at reducing it. The next few examples demonstrate how annoyance is addressed in some real-world situations.


1) Property values: It is a common occurrence for property values to decrease with increasing proximity to airports, highways, and other sources deemed annoying. Physical acousticians have various measurement standards for addressing this, such as dB60 and dB(half-day), where the time integral values of the sound power levels are taken. These measurements average sounds, which is useful for sound fields that tend to have many high peaks (Rosenberg, 2005.)



2) Work productivity: It is often understood that work productivity decreases with increasing distraction. An understanding of this principle led to a workplace revolution on the corporate level with the help of companies like Muzak. The end goal was to input sounds into the workplace that would desensitize workers from the usual office distractions (Thompson, 2002.) In fact, the change in work productivity is so apparent that some researchers use time-elapsed measurements under various sound conditions to measure how annoyed some subjects are (ex. Chafe, Berger, et al., 2006.)



3) Choosing a ringtone: In an interview conducted with Jean-Claude Risset (2007), he conveyed to me a story of some of the early research at Bell Labs concerning the standard telephone ringtone. In early tests, subjects were found to be reluctant to answer their telephone because they found the ring quite pleasant and they did not want to interrupt the noise. As a result, people were missing important communications. Significant effort had to be made to determine that a dissonant bell tone in an on-off-on-off pattern was effective in warning people of an incoming communication.



Hypothesis: Factors in Determining Annoyance

It is generally understood that noise spectral power is directly correlated to annoyance. Previous studies (Miedema, 04, etc.) have assumed that noise level is a strict gage for determining annoyance, but it is reasonable to suggest that the influence of level might be more subtle; annoyance is, after all, only a subjective and highly personal characteristic. One study reveals that subjects perceive annoyance differently based on their ability to influence it (Maris, Stallen, Vermunt, and Steensma, 07). My claim is that loudness defines the general trend for annoyance, but there are many factors that can alter that judgment significantly. I suggest that this is a viable hypothesis based on two distinctions. First, it is not hard to imagine a noise that serves as a feature in a composition that gives the listener a sense of envelopment. Imagine then, that this noise is softened and used in context as a masker. One could imagine that given a choice a subject would define something they perceive as a masker more annoying than something they perceive as a piece of art, even if the masker is significantly quieter than the composition. I also propose a new algorithm for testing annoyance and make efforts to prove its effectiveness.



Measurements for Annoyance


In my research, I found two methods for measuring annoyance. The first such measurement, as discussed previously, is to set up a task and see how the subject's productivity changes over varying conditions. This test is very good for getting data that the subject does not influence, because the subject is not answering questions based on their perception. The time changes can be put on an absolute scale and comparing values across tests is easy. For a fine implementation of this procedure, see Chafe, Berger, et al., 2006. Unfortunately, this test might not target annoyance specifically, if annoyance is indeed the variable under scrutiny. For example, pleasantness can also slow down the progress of a task and, therefore, sounds that are pleasant as well as sounds that are annoying will lie in the same data range. Separating results like this would be difficult graphically and semantically.

Another method common to annoyance testing is the attribution of a scale. Various scales I have seen include the measure of annoyance from 0 to 100. Subjects are prompted for an annoyance value and their results are averaged with that of others. Above 72% average is percentage of highly annoyed (%HA), 50, annoyed (%A), 28 a little annoyed (%LA) (Schultz, 1978; EC/DG Environment, 2002b.) There were also some examples of scales that were less intuitive, including one than spanned zero. One major advantage of testing with a scale is that the results obtained test annoyance directly because that is the quantity being prompted from the subject. The disadvantages of using a scale, however, far outweigh the benefits. First, the response is not done using forced-choice. In other words, the subject is prompted for a subjective result and therefore has no mechanism for ensuring accuracy or minimizing competitiveness (I should find a resource on why competitiveness is bad.) Second, a poor choice of scale can be detrimental to the procedure. A scale that is too complicated or one that includes inconceivable quantities can be confusing for the subject. For instance, a study that has negative, zero, and positive values for measures like pain or annoyance would be confusing to the subject. Lastly, scales that are too fine in detail lose meaning in their inner regions. How does one distinguish something that is 61% annoying from something that is 62% annoying?

In order to explore the options for reducing the risks involved in the methods above, I decided to try a ranking system based on forced-choice results. The method will be described in detail in the next section. I should point out that the advantages of a forced-choice ranking system include low-error feedback from subjects and the use of a specific task that is easy to convey. The final results should also converge on accurate values after a significant number of iterations. The downside of the ranking system is that the experiment could be asking the subject to perform a task that does not make sense. If there is actually no preference, then the subject could be giving arbitrary responses. I will bring this up again in the Discussion sections of this report. Finally, the ranking system does not allow the sounds to be put on an absolute scale; the annoyance of one sound can only be compared to that of another.



Methods


The experiment has been set up with a 2I-2AFC protocol. For each subject, a 4x4 zero-matrix is established with noise level as one dimension and stimulus type as another. The experimental algorithm chooses two random elements from the matrix and presents the corresponding stimuli to the subject. Having the option to repeat the stimuli once, in choosing one stimulus as more annoying than the other, the subject will cause the algorithm to add one point to the more annoying element and subtract one point from the less annoying element. After a significant number of trials, the preference ratios in the matrix should approach a limit, at which point the credibility of the hypothesis can be determined. I make use of/assume true a fundamental principle, the transitive property (transitivity). If A>B and B>C, then A>C. If this is not true, then more than just the results of this experiment are at stake.

There were four types of stimuli presented to the subject. First, there was pink noise. Second, there was a narrow tone cluster of five random-frequency, random-phase sinusoids in the space of approximately one critical band (<200 Hz) centered at 1500 Hz. Third was a wide tone cluster that consists of 40 random-frequency, random-phase sinusoids over a 1000 Hz range, centered at 1500 Hz. Fourth was a narrow-band noise, bandwitdth 1000 Hz, centered at 1500 Hz, made from filtered Gaussian white noise. These were each presented at 50, 60, 70, and 80 dB SPL. I also used a dB meter to get dBA values for the sounds, in case that would be enlightening. Here are the values:







Results


The following were the raw data obtained from each of seven subjects after performing 100 iterations of the forced-choice algorithm:


Subject 1

-16 -6 -3 11

-13 -6 3 3

-3 2 6 11

-4 0 12 3

Subject 2

-11 -6 -4 13

-10 -4 1 11

-5 2 8 13

-7 -7 3 3


Subject 3

-17 -6 -6 1

-6 -2 9 13

-1 8 8 9

-7 -9 1 5




Subject 4

-13 -4 0 15

-6 - 2 3 13

-3 0 4 9

-6 -11 -2 3


Subject 5

-12 -4 -7 5

-6 -6 7 15

-3 2 6 5

-7 3 -3 5

Subject 6

-17 -6 -4 9

-4 -2 5 13

-1 2 4 13

-7 -7 -3 5


Subject 7

-17 -4 -2 17

-10 -4 3 9

-3 2 2 13

-7 -5 3 3


In general, subjects gave verbal feedback indicating that they tended to choose the louder noise as the more annoying one. It is important to note that, though loudness was their main criterion, every subject had at least one discrepancy in identifying which stimulus was louder (the numbers in the matrices should be strictly increasing from left to right.) Some of these discrepancies are surprisingly large.

As it turns out, all of the discrepancies in the data can be attributed to personal differences. This is a welcome and expected result for subjective tests. The best evidence that personal difference is the most significant factor in explaining the discrepancies lies in the sum of all of the matrices, shown below. In the sum matrix, there are no discrepancies; the increasing annoyance trend is preserved from left to right with increasing sound level.


Sum Matrix (all subjects)

-103 -36 -26 71

-55 -26 31 77

-19 18 38 73

-45 -36 11 27



Discussion


Back to the original hypothesis, it is time to compare the relative loudness of the sound to its ranking. Both the ranking matrix and the relative dBA matrix of the stimuli are presented below. The dBA matrix has been normalized to facilitate discussion.


Ranking ( increasing loudness --> )

(Pink Noise) 1 4 6 14

(Narrow Cluster) 2 6 12 16

(Wide Cluster) 8 10 13 15

(Narrow-band) 3 4 9 11


Relative dBA

(Pink Noise) 4 14 23 33

(Narrow Cluster) 2 12 23 33

(Wide Cluster) 2 12 23 33

(Narrow-band) 0 10 20 29


The first important distinction pertaining to the original hypothesis is illuminated by the four corners of these matrices. The quietest pink noise is a substantial 4 dB louder (according to the dBA meter) than the quietest narrow-band noise, yet the quiet pink noise has been selected as a good deal less annoying. The loudest pink noise, however, has been determined to be significantly more annoying than the loudest narrow-band noise, yet it is still only 4dB louder. This switch is a great indication that there is are effects more subtle than loudness affecting the annoyance. I will not speculate exactly what these effects are, as I am sure it is a delicate topic. Besides, this is science, we need to be asking yes or no questions of our data. It is also worth pointing out that the quietest wide-band cluster of tones is significantly more annoying than other types of sounds of the same level. The fact that the wide cluster is fairly consistently ranked as more annoying than sounds of similar levels leads to the conclusion that sounds that are in this “family” are capable of inducing more annoyance than sounds of some other types (note: this does not imply that it is even close to being the most annoying type of sound possible.) I suspect beating is to blame for this phenomenon, but that will need more investigation.

The fact that each subject showed some discrepancies while the sum matrix showed none is extremely relevant. The data did also prove to have some surprising consistencies. For example, every subject determined that the quietest pink noise was the least annoying. This is interesting because it is not the quietest noise on the dBA scale; in fact, it is the loudest in that column, yet every subject was consistent in acknowledging its low annoyance.

Next, there were two tied scores in the sum matrix. This means that one-quarter of the results were not given a unique ranking. The relevance of this point is subtle, but the ties demonstrate that a no-preference option was maintained even though the trails were all 2-alternative forced-choice. Therefore, rather than having failed to approach a unique ranking, some noises were identified as being indistinguishably annoying from each other. The analog of this was described earlier in the section outlining the scaling method for annoyance measurements. It is difficult to imagine how an annoyance of 61 could be different from 62, but when using ranking, the distinguishability of annoyance of two sounds is irrelevant; the algorithm has a tendency to group sounds together than are determined to be less than a just-noticeable difference (JND) apart. In the case of objective quantities, this failure of resolution would be interpreted as a downfall of the ranking system but, since we are dealing with a subjective quantity, it is acceptable or even preferred.

Finally, one needs to question the issue of consistency of this data. It is an understandable fear that the task being asked of the subject is nonsensical. In the case of personal preference measures, nonsensical tasks are actually a danger. It is conceivable that the subject, having been put to task, would listen to a series of sounds, and determine that none is actually more annoying than another; however, this would then reflect itself in the data. A subject that has determined that they have no preference or do not understand the task will choose randomly from the two options when prompted for a response. This will eventually lead to a matrix full of zeros (or numbers close to zero.) At this point, I will introduce a term that I call 'displacement,' which I will define as the sum of the absolute values of all elements in a matrix. The displacement of a matrix full of randomly chosen preferences will be zero. The maximum value for displacement will be equal to the total number of points awarded plus the number of points taken during the experiment. In my case, I performed 100 trials per subject, which means that 100 points were taken and 100 points were given, giving me a maximum of 200 points. A displacement of 200 should not be considered completely consistent, however, since reaching this number is impossible. Imagine a matrix with three elements and a predetermined ranking of those elements ([0, 0, 0] with a predetermined ranking of [1, 2, 3]). The total number of trials necessary to ask all combinations is 3 choose 2 (or m choose n, ref. Any introductory probability textbook,) or 3. This means 6 points will be awarded total. The final matrix, after 3 iterations, will look like [-2, 0, 2] since the ranking is predetermined and all choices will be perfectly consistent with the ranking. The displacement of a perfectly-consistent, 3-element matrix is then 4. Demonstrating with a 4-element matrix will also be helpful. The necessary number of trials is 4 choose 2, or 6, which means there are 12 points to be given and taken. Using a similarly predetermined ranking, the preference matrix will end up as [-3, -1, 1, 3] after 6 iterations. The displacement of this is 8. In both cases the ratio of displacement to total points is 2/3 (4/6 and 8/12). Though I have not formalized my proof yet, I will use induction to assume that 2/3 is also appropriate for a 16-element matrix. After the 100 trials per subject, or 200 total points, a perfectly consistent subject will score a displacement of 133.3. Here are the displacement values for all 7 subjects:


S1 102

S2 108

S3 108

S4 94

S5 96

S6 102

S7 104


Comparing this to the perfectly consistent value of 133.3, the average subject of this experiment was 76% consistent, which is remarkable. This is a great indication that, first, the task was clear, and more importantly, that a forced-choice ranking algorithm for annoyance measurement is useful.

How do I resolve the issue that both the no-preference property and the consistency property hold true at the same time? It seems like I am having my cake and eating it too. For one, the no-preference property requires that inconsistent choices be made so that no stimulus advances over its non-preferred brethren while the consistency property that appeared to hold true (76%, close enough) in this experiment requires that few choices be made to reduce the displacement of the matrix. I suspect I might find some help in a game theory textbook. The study of this conflict and some other pressing questions will be the direction of my future work, elaborated in the next section.



Future Work


Musical training: how will it fit in? Will it make a difference? Some have mentioned that (especially at CCRMA), some strive to make noises and might feel less inclined to define them as annoying. The issue of how to define musical training is explored in an article from Andrew Oxenham.

I will be careful to address the issue of beating. I am aware that, especially in the case of the wide tone cluster, there was significant beating that almost certainly added to the perceived annoyance of the noise since beating induces time variance and, therefore, a higher disturbance. I am only guessing right now on this time-variance claim, so I would love to find a resource on it soon.