Homework 1

Written deliverables due by 9am, Friday January 20. We will discuss the readings and datasets in class on January 20.

Part A: Readings

1. Read the following articles:

(skim) Serra X et al. (2013). Roadmap for Music Information ReSearch. Geoffroy Peeters, ed. MIReS Consortium. [pdf]
Bertin-Mahieux T, Ellis DPW, Whitman B, and Lamere P (2011). The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference. [pdf]
Bryan NJ and Wang G (2011). Musical influence network analysis and rank of sample-based music. In Proceedings of the 12th International Society for Music Information Retrieval Conference. [pdf]
Fell M and Sporleder C (2014). Lyrics-based analysis and classification of music. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers.[pdf]
Moore JL, Joachims T, and Turnbull D (2014). Taste space versus the world: An embedding analysis of listening habits and geography. In Proceedings of the 15th International Society for Music Information Retrieval Conference. [pdf]
Fuller J, Hubener L, Kim YS, and Lee JH (2016). Elucidating user behavior in music services through persona and gender. In Proceedings of the 17th International Society for Music Information Retrieval Conference. [pdf]

2. Provide written responses to the following questions for two papers (your choice) chosen from readings 2–5 (be prepared to discuss all four papers in class):

Summarize the paper in one paragraph. State the motivation for the study (e.g., unsolved problem or open topic); the current approach; main finding(s); how the authors evaluated their results; and implications.
To the best of your ability, summarize the technical details of the data analysis.
What were the advantages of taking a data-driven approach to the topic of the paper? How critical was the size and scope of the dataset? Could the research objective have been achieved in a smaller-scale study? Could it be achieved in a controlled study with human subjects?
Are all shortcomings of the dataset and approach considered by the authors? Are there any other limitations to the approach or the data that are not stated explicitly in the paper? Are the conclusions contingent on any unstated assumptions?
Overall, do you feel the paper met its research objectives? Why or why not? Is there anything you would have done differently?

3. Provide written responses to the following questions for reading 6:

Summarize the paper in one paragraph. State the motivation for the study (e.g., unsolved problem or open topic); the current approach; main finding(s); how the authors evaluated their results; and implications.
In what ways is the approach taken here advantageous? What are the possible confounds?
Propose a more data-driven approach to addressing authors' research question, using the definition of 'data-driven' from class. What would be some advantages and disadvantages of your proposed approach?

Part B: Dataset exploration

Everybody will choose two datasets to present to the class. Please choose from the preselected list of datasets; the link to the signup spreadsheet will sent out via email and Canvas Announcements. The datasets include industrial and academic data, and span a variety of sizes, topics, and access modes. Please browse a few of the datasets before signing up for one in order to find one that matches both your interests and your technical expertise.

For most datasets, it should be sufficient to read the documentation and download the data (if it is small) or browse some of its web pages. If you sign up for an API, please try it out in order to evaluate its ease of use.

Note: Your choice of dataset for this assignment does not require that you work with that dataset for your final project.

Be prepared to discuss the following points about your datasets in class:

Provide a one-sentence summary of the dataset.
General characteristics of the dataset. Where did the dataset come from? Is it academic or industrial? Is it already in usable form or does it need to be aggregated? Do the data need to be cleaned?
Size and format of the dataset. How large is it storage-wise? How many observations does it contain? What are the data attributes? What file format(s) are used?
Documentation. How well documented is the dataset? Is there a README, accompanying paper (see next question), informational materials, or code/results examples? How much is left to the user to figure out?
Literature. Is there a paper that accompanied the initial release of the dataset? If so, does the paper serve mainly to describe a dataset intended for general use or were the data published, but not with an explicit objective of future use? Has the dataset led to subsequent papers (hint: Look for citations of the dataset or its initial paper, if one exists)? If so, do authors of these papers come from a variety of research groups, or are they always from the same group?
Propose two novel research questions that could be explored with this dataset.

Deliverables

Submit your written responses to questions 2 and 3 of Part A to Canvas by 9am, Friday January 20. If you will be missing class on January 20, submit written responses to the questions for Part B as well.

Music 364

Data-Driven Research in Music Cognition

CCRMA, Stanford University