We attempt to implement the paper Music Genre Classification using RNN-LSTM as well as additional statistical and metadata processing components. The objective of this paper is to use an LSTM to classify the genre of audio files containing music better than the standard Convolutional Neural Network approach. However, in addition to using raw audio features, we also attempt to train with track metadata. We both enjoy the satisfaction that comes along with classification problems and are avid consumers of music. Thus, implementing a network that solves a classification problem for music is a well suited choice for our project. Furthermore, the idea of using a new approach, in particular an LSTM, to tackle the problem is an exciting challenge.
The model receives data from approximately 8,000 tracks extracted from the Free Music Archive (FMA) dataset as input. The dataset contains both statistical, nominal, and signal feature data. Audio signal features such as MFCC, mel spectrograms, chroma, and zero-crossing rate are used as inputs into an LSTM. Statistical data, namely, number of listens, comments, and favorites per track serve as inputs as well, however, these inputs are passed through dense layers and utilized in addition to the LSTM output. Finally, the nominal data including track and artist names are inputs into a character-level GRU layer which creates a meaningful representation of the nominal data that our model can utilize.
The signal feature data was extracted from raw .mp3 files using the Librosa Python library. These features were extracted just once and stored in .npy files so that they persisted across runs of the program. Statistical data associated with each .mp3 file in the FMA dataset is fetched and stored together with audio data. In total, our dataset contains 8,000 unique .mp3 tracks representing eight distinct musical genres including experimental, electronic, rock, instrumental, pop, folk, hip-hop, and international. Labels are constructed as one-hot vectors.
The overall composition of our model is as follows. The audio signal features are passed through LSTM two layers. LSTM layers are fitting for this data as they are dependent on a time sequence. In order to create an embedding matrix for each track title and artist name, we break the nominal data down into sequences of individual characters and create embedding matrices for each character in our character-level GRU. These embeddings are then used to create an embedding matrix for each title/name. The statistical data is passed through dense layers to create a representation of the statistics related to each track. The three outputs are concatenated along axis 1 and passed through a final set of dense layers and leaky ReLUs. Finally, softmax is applied to receive final probabilities. The model uses a categorical cross-entropy function and an Adam optimizer is applied with a learning rate of 0.0001.
After running the model under various epoch lengths and with different hyperparameters, we were ultimately able to achieve a maximum accuracy of around 40%. This denotes that the model achieves substantial learning. However, it was shy of the target goal of around 60%. We speculate that this result could be pushed further with a larger dataset and greater processing power. Interestingly, we found that the combination of the char2char textual metadata model and the audio signal feature LSTM model did not improve accuracy but rather slightly diminished it. On the model solely using an RNN on audio signals, we achieved a maximum accuracy of 45%. In the combined model, loss experienced increased variance as it descended. We suspect that given a larger dataset, our model would have been able to create more distinguishable representations of each track from the statistical and nominal data, however, our conclusion is that the current dataset does not provide enough meaningful data on each track for it to improve the model’s accuracy.
We encountered several challenges along the way. The first challenge we had was parsing and formatting the dataset. We spent a substantial amount of time re-extracting features from the dataset, reformatting the data, and removing vague data points to get the dataset into a satisfactory form. The dataset also posed a challenge further into the process when we discovered that our model was not learning to accurately predict genres because the larger dataset of 25,000 tracks was unbalanced and thus imposing a bias on our model. As a result, we had to settle for the dataset of 8,000 tracks.
Furthermore, the selection and processing of raw audio features were challenging. Initially, we planned to use audio feature data included in the fma metadata dataset, however, we found that it would be better if we hand-selected the audio features ourselves. In the beginning, we used mel spectrograms, zero crossing rate, and chroma features. However, the results were unsatisfactory. After pivoting to only using MFCC data, we experienced better accuracy. Finally, since our model was rather large and included a variety of input sources and layers it was very time consuming to experiment with hyperparameters. In the end, our model required approximately 45 minutes to run.
Ultimately, we are very satisfied with how the model turned out. For a team of two, we feel that we accomplished a lot and achieved satisfactory results. The model did not necessarily work the way we expected it to, in that the combined model did not produce enhanced results. However, we were still able to achieve some results and learn a lot about developing deep learning models on our own. Our approach definitely changed over time. While we initially wanted to use various raw audio features including mel spectrogram, we ended up achieving better results using only MFCC signals. Reducing the feature sizes was also beneficial to runtime. Additionally, our initial dataset was subject to the overclassification of a certain genre which was excessively prevalent in our dataset. We ended up using a smaller dataset with balanced data to avoid this problem. If we could repeat this process, we would begin with the balanced dataset and MFCC features. With more time there are many improvements we would make. First, we would load additional audio data on top of MFCC to see if it improves accuracy. We would also include more textual metadata. We would further experiment with hyperparameters in order to achieve the 60% accuracy we initially set out to achieve.
This project has offered us many valuable insights and takeaways. Most importantly, it has given us the opportunity to develop our own model completely from scratch. While the assignments introduced us to various architectures, we were generally working under the framework of stencil code provided by the course instructors. In this project we had a blank slate to devise our model. This experience is invaluable to our future pursuits in deep learning. We now feel that we are fully equipped with the knowledge and skills needed to continue writing our own neural networks in the future.