The Sound of Music

Team Members

David Moon (dmoon8), Johnny Ren (jren22), Clara Guo (cguo21)

Introduction

Classification and Model Interpretability

After learning about interpretable CNN models in class, we were interested in applying one to a dataset where interpretable CNNs have not been used extensively. Specifically, we became interested in how such a model could be applied to spectrograms of sound files. We were motivated to explore this topic for two reasons: For one, we are interested in the parallels between image analysis and spectrogram audio analysis. For another, as music lovers, we find the idea of audio and genre classification interesting.

This paper outlines an interpretable CNN framework that we are looking to modify for our project. The objective of this paper, which was developed for image classification, was to develop CNNs with filters that identify human-interpretable areas of feature detection/activation. The key difference between the feature maps of a traditional CNN and the one outlined in this paper lies in how loss between filter layers is calculated. In this interpretable framework, it is modified such that filters are incentivized to learn more interpretable parameters that lead to more discernible objects and features identified in an image. An additional strength to this paper is that additional annotations of object parts/text are unneeded under the assumption that repetitive shapes in various regions represent low-level textures.

We chose this paper because it provides a fundamental method for encouraging interpretable behavior in CNNs, which we can adapt to spectrogram classification. Our project ultimately seeks to develop an interpretable CNN model for classifying music of different genre using the GTZAN dataset.

(Update for Week of 11/31) While we have implemented many of the primary components of the original paper, creating an effective CNN model, implementing model, and an attempted integration of their proprietary loss, we ultimately decided to also explore the work of a different paper as well. This paper outlines methods for applying class activation maps (CAMs) and class-selective relevance maps (CRMs) for model interpretability to medical data that we are adapting to our spectrograms.

While the previous paper focuses on training to make model activations interpretable, this new paper focuses on modifying the model architecture so that filter activations will be interpretable after combining them through different weighted combinations.

Related Work

This link demonstrates that using pre-trained weights (on ImageNet images) result in higher accuracies when using the GTZAN dataset. We hope to use this implementation of transfer learning when passing our mel-spectrograms through our conv2D layers. While our primary focus will still be model interpretability (and not necessarily achieving the highest possible accuracy), we intend to reference this paper to boost model accuracy.

Public Implementations (Updated as more are found): This link, which includes the Kaggle notebooks associated with our dataset, describes a CNN architecture that is used for classification. However, it does not include any attempt at model interpretability, which is a specific focus of our model.

(Updated on 11/15) This link includes a link to a public implementation of the paper we are citing. We will most likely be referencing this paper but will follow the project guidelines: a) we will be using a different framework, as we will be using the Keras framework (whereas the paper is done in Tensorflow 1) and b) we will be applying the model to a different dataset (spectrograms).

(Updated on 12/30) This repository based on a paper focused on music classification includes an implementation of the deconvolution component of the deconvolve_and_interpret method in preprocess.py. Our implementation is done in Tensorflow whereas the paper is done completely in NumPy. No other code was used in our implementation except for the filter manipulation that can be seen in get_deconvolve of auralise.py.

Data

The data we plan to use includes 1000 wav audio files of 30 seconds each. It is known as the GTZAN dataset and is fairly common for audio analysis. Preprocessing will include converting each audio file into a spectrogram and potentially splitting the audio files into shorter segments (3 seconds).

Methodology

Tentatively, we are hoping to experiment with a 15-15-70 validation-test-training split. However, given the fact that there are 10 genres of music with 100 30 second clips each, we may consider tuning these ratios in order to allow for more training data.

As for our specific model architecture, we have an iterative process in mind:

We will begin with a vanilla CNN architecture: Spectrogram -> Convolution Layer 1 -> Convolution Layer 2 -> Convolution Layer 3 -> Output classification
Next, we develop the modifications outlined in the paper, namely, adapting loss so that our convolution filters are incentivized to learn meaningful features.
After verifying that our interpretability is okay, we approve accuracy: a. We will consider experimenting with standard ways of improving performance (eg. changing number of filters, filter sizes, and hyperparameters, generally) b. We will also consider specific CNN architectures that have been successful (such as the one outlined in Related Papers)
Iterate our model design until we reach our desired accuracy threshold and have a model with interpretable

The hardest part of this project will most likely include the adaptation of the modified loss function into our model. Additionally, the actual verification of interpretability (eg. conducting qualitative/quantitative analyses of activated regions in our spectrograms) will likely be a cumbersome process. More broadly, balancing interpretability with performance could be tricky, since papers that have performed well on spectrogram analysis typically don't train for interpretability.

(Update for Week of 11/30) While we are experimenting with second paper our methodology will be fairly similar. We will be experimenting with progressivly more complex and deep CNN networks to achieve high accuracy. However, with the CAM and CRM models, our model architecture must be qualified by the fact that there can only be one dense layer at the end (in order to allow for model interpretability)

Backup Ideas
(Note: Even though we are primarily implementing a paper, we have some backup modifications to make our adaptation easier). One concern with this project is that there may not be enough data to learn an effective model. As such, we are considering breaking up our training samples in different ways to boost the number of examples. Moreover, since the original data comes with 10 classes, we are also considering combining certain classes (similar genres) to make classification more feasible.

(Update for Week of 11/30) Having achieved fairly good accuracy with all 10 classes (~70 percent validation accuracy), we have not had to combine classes. However, we have explored a different form of interpretability as well due to some challenges implementing the loss for the original paper.

Metrics

Experiments
Our model seeks to be both a good classifier and interpretable. For our first experiment, we will measure accuracy in a standard format, namely through the usage of validation and test set data unseen during training. Accuracy is appropriate since our project is a classification problem. Our second experiment will involve the assessment of interpretability. Once the CNN is hitting sufficiently high accuracies, we plan on converting the feature maps back from spectrograms into wav files and seeing if we can decipher what audio makes a genre identifiable (e.g. what “rock” sounds like). Additionally, we plan on developing activation maps to qualitatively assess what causes convolution filter activation. If time permits, we are also hoping to compare the difference between our spectrogram CNN model and a more typical feedforward neural network that uses generated features from the spectrograms.

In the paper we are applying, accuracy is measured in the same way as for our model. For interpretability, the researchers use a parts interpretability measurement for images. Since this form of quantitative measurement is not available for our problem, as we are missing a "parts" dataset, our measurement of interpretability will be more exploratory. It could be valuable to generate summary statistics about spectrogram activation and see what features correspond to particular genres. And as stated before, qualitative analysis (eg. generating sounds) could also be a useful, if somewhat imprecise, way of measuring interpretability.

(Update for Week of 11/30) For the CAM and CRM models, the experiments are much the same: test for acceptable accuracy and conduct qualitative analyses.

Goals
Base: get the CNN working to at least around 75% classification accuracy and create class activation maps
Target: Create a network that results in activation maps that make sense audibly. Improve accuracy (maybe to 85%)
Stretch: Compare network performance of CNN on spectrograms to fully connected network using song parameters or explore different strategies for model interpretability (global average pooling). If time permits, maybe create a demo of our model (chooses new songs outside our training set and tries to classify it).

Ethics

What broader societal issues are relevant to your chosen problem space? While the specific problem of music classification does not have major ramifications for society, the issue of model interpretability does. As deep learning models become more prevalent in our society, it will be more and more crucial for us to recognize how exactly they work (lest they become unknowingly dangerous). We hope to explore a couple questions concerning interpretability in our project. For one, how effective will our model be at making truly interpretable features for spectrograms? Additionally, how will a focus on interpretability affect performance? In answering these questions, we hope to further illuminate the prospect of applying interpretable models to more problems in the future.

Why is Deep Learning a good approach to this problem? Deep Learning is a good approach to this problem because spectrograms, audio data, and genre classification n general have a level of complexity that is well-suited to CNNs and neural networks. In other words, simpler models would likely be unable to achieve reasonable performance for our problem. The data used is also obtained in an ethical manner and frankly does not contain any information that could be discriminatory toward particular demographics unknowingly. That said, an obvious goal of our project is to increase our understanding of how CNN models think, which can hopefully be used to combatting bias in other contexts.

Division of Labor

Our group has decided that the best way of to divide labor fairly is to have regular meetings and assign responsibilities bit by bit.

For example, for our first meeting post-TA-check-in Clara will be in charge of setting up the WAV to Spectrogram code, David will be in charge of setting up our environment, and Johnny will be in charge of providing a summary/guide for how to implement the initial steps of the paper. We plan to do a similar process throughout the project, and work over Zoom whenever possible together.

Reflection 1 (11/23)

Here is the link to our first reflection.

Update (Week of 11/31)

After running into a couple obstacles with our original paper implementation, we are also exploring an alternative form of model interpretability. The main challenge we faced was the difficulty of implementing the paper's proprietary loss function (which is written at the bottom of the model.py script in Github for the curious). The paper we referenced was missing a lot of implementation details, especially about how the new gradient is calculated (as the formula they give is very computationally expensive and complex). As such, we have decided to explore a different method model interpretability as well, namely CAM and CRM (the latter if time permitting), as have updated the devpost accordingly.