ZimmeRNN
Mason Burke, David Lauerman, Serdar Sungun
Introduction
The goal of our project was to generate classical music given a database of classical music transcriptions. This wasn’t so much oriented towards solving a problem as much as it was a creative pursuit.
Methodology
We first wanted to use a GAN to accomplish this task, which ultimately may have worked, but we opted for an RNN due to the simplicity of the model. Much like the MNIST datasets of images we used in class assignments, there exists a MusicNet library available online to use. The MusicNet library contains 330 song files from Baroque/Romantic period composers, primarily Bach and Chopin. While useful, these files contained a wealth of information that required a lot of paring down for our project. At each point in time, the song would have a list of notes that were playing, along with all of its pertinent information, like duration, instrument, volume, etc. We decided that the best way to process this data for use with a deep neural network would be to extract all of the note values, and so at each sampled timestep we would have a multi-one-hot vector of all of the notes playing at that moment. So, we would take windows of input samples like this, and then predict the next timestep given that sequence of notes. After training, we took our fully trained RNN, and passed in a sequence of notes from somewhere in one of our input songs in array form to act as a seed for the RNN generation. Then, we had the RNN predict timesteps for a total of a minute’s worth of music. The result was a functional multi-one-hot vector for each timestep.
Results
So, we have our output array of notes at each timestep. What now? Our first attempt was to iterate through these arrays to add each note individually to a MIDI file using a library, and then use another library to actually play the output MIDI file. At some point in this chain of events, some major miscommunication occurred, and when we played the resulting MIDI file, an audio blip lasting about 0.1 second would play. We later found out that this was an issue with the MIDI playing library, and also learned that our MIDI was being created correctly. Unfortunately, when we plotted the loss for each batch, we did not get good results. The model wasn’t really learning, and we didn’t have an effective loss function.
Our best guess is that there’s a programming issue somewhere in our code, as well as not finding a ready-to-use loss function for the way that we structured our input data.
Challenges
The difficulty in this project was very concentrated in the preprocessing of our input data, and in the postprocessing of our output song. All of the preprocessing we did took the majority of the time of the project and required the most attention. It was also quite difficult to get the MIDI output to be functional, and we ended up abandoning this approach due to its failure. Overall, this project would have been much simpler if we had found and used a dataset that was directed towards how we would end up implementing our project. This would have allowed us to have had a better chance of picking an output style that would be useful in creating a MIDI or whatever this new input format was.
Reflection
Our main takeaway from this project is that music generation and processing is inherently a very difficult process, filled with many necessary sacrifices. A potentially better approach to this project would be to use a song that was in the form of a tokenized MIDI file. Every time that a data structure is passed through some transformation, there’s a substantial risk of data loss, as we found with our first approach. We also had a few ideas that we hoped to implement but were not able to finish due to a lack of time. One idea was to have another network learn the number of notes to be played at a certain timestep. This was due to the fact that our output at each timestep was not strictly ones and zeros, but zeros and floats between 0 and 1. So, we had to have some way of picking which notes would be played, and how many. We settled on simply picking the top 3 notes at a timestep to be played (if there were fewer than 3, then just as many as there were results for). By having our network work in tandem with a network that predicts the number of notes to select at each timestep, we would be able to more faithfully output notes, rather than a predetermined fixed number.
Another potential spot of improvement for our model would be to learn separate instruments, or perhaps focus on monophonic/single instrument music. Different instruments have different roles, sticking all instruments in the training set into a single instrument output may muddy the results of learned note structure.
On the whole, our project was a success in terms of data processing, but our ability to produce an audio output file was limited by the difficulty of interfacing with available libraries.
This brings us to our biggest takeaway from this project, which is the value of in-depth planning. We took on a hard project, so we needed to pick a route and just go for it to be able to get it done in time. However, had we had more time, we could have devoted more time to planning ahead, so that deeper problems could be uncovered, and key pivots could have been made. That being said, we are moderately happy with how it turned out, and we gained a lot of familiarity with RNNs and with data handling in Python.
Log in or sign up for Devpost to join the conversation.