Generating tonal piano music with a LSTM neural network
Final project for MAIS 202: Accelerated Introduction to Machine Learning bootcamp, hosted by the McGill Artificial Intelligence Society.
As neural networks get more and more powerful and versatile, it remains to see if they can really accomplish everything that a human can accomplish. I believe one of the most interesting areas in which this question still remains to be answered is artistic creation. Can a neural network truly learn to create original art, even after having been trained with data from countless existing pieces of art?
Having played the piano and composed classical music since the age of 4, for my first machine learning project, I therefore decided to combine my two passions and train a neural network to generate original tonal piano music.
Examples of generated music
Examples can be found here, or in the Github repository of my project, in the "examples" folder. Each example starts with the starting sequence that was used to generate it, then a very high note indicates when the music generated by the model begins.
I used data from here, which was the largest collection of classical piano midi files that I could find.
I cleaned the data so that each message was represented using 3 values:
- The pitch of the note, represented by an integer between 21 and 108 (the possible notes on a piano)
- Whether the message is a note_on message, representing the start of a note, or a note_off message, representing the end of a note (note_on = 1, note_off = 0)
- The time that has passed between that message and the previous message, represented by an integer between 0 and 720 (0 meaning that the notes were simultaneous and thus formed a chord), which I'll refer to as the duration
The network takes as input a sequence of 150 messages and predicts the message that will follow it. It first passes the sequence through a self-attention layer, which generates a new representation of the sequence containing information about how each message is linked to the other messages in the sequence.
It then passes this representation through a LSTM layer with a Tanh activation function and a Sigmoid recurrent activation function.
The last hidden state from this LSTM layer is then used as a query for an additive attention layer, with the keys and values being all the hidden states from the LSTM. The goal of this attention layer is for the network to learn which specific hidden states in the sequence of 150 hidden states that were output by the LSTM it should pay most attention to when predicting the next message.
The output from this attention layer is then concatenated to the last hidden state of the LSTM. This concatenated vector, which we'll call the preliminary output vector, is then passed through 3 different dense layers to produce 3 outputs, each predicting one aspect of the next message in the sequence.
The first is a dense layer with only 2 outputs and a softmax activation function, to predict whether the message is a note_on or note_off message. The loss function I used for this output layer was Binary Cross-Entropy, since it is a basic binary classification problem.
The output from this dense layer is then concatenated to the preliminary output vector and passed through another dense layer with only one output and a sigmoid activation function to predict the duration of the next message. The loss function I used for this output layer was Mean Squared Error. This output is then scaled up and rounded to one of the possible durations for a note in a classical piece.
The outputs of both previous dense layers are then concatenated to the preliminary output vector and passed through another dense layer with 88 outputs and a softmax activation function to predict the pitch of the next message. The loss function I used for this output layer was Categorical Cross-Entropy.
The combination of the outputs from these 3 layers forms the prediction for the next message in the sequence.
I decided not to use a validation or test set, as a high accuracy on the validation/test sets wouldn't really indicate anything about whether the model was capable of generating original music, since whether a piece of music is "good" or not is a very subjective matter. I therefore decided to trust my musical instincts to see whether each new iteration of my model was learning to generate music well or not. For each model, I would generate new pieces with a length of 1000 messages and listen to them. The elements I used to evaluate the model's performance were elements I deemed to be indicative of a good piece of tonal piano music:
- Whether the harmonies were consonant or dissonant
- Whether the rhythmic patterns seemed organized or not
- Whether melodic patterns were repeated
I know this is far from an objective way to evaluate a model's performance, but it was the best way I could think of due to the subjective nature of music in general.
When generating new music, the model takes a random sequence of 150 messages from the dataset and predicts the next message in the sequence. It then discards the first message in the sequence and appends the message it predicted to the end of the sequence, to form a new sequence of 150 message which it uses to predict another note. It continues to do this until it has generated a specified number of messages.
I found the need to separate my dataset into a "training set" and a "generation set", because, when the model used an initial sequence that came from data it had been trained on, it just recomposed the exact same piece that was in the training set. I therefore kept the generation set separate from the training set so that the model could have some unseen data from which to take its initial sequence, thus serving as a sort of "test set".
I implemented and trained my model using Keras. I integrated my model into a simple Flask/HTML web-app, which allows the user to specify the length of the piece they want to generate, as well as the composer they want the initial sequence to be taken from. The web-app then generates a new piece and gives the user the option to listen to it directly on the webpage, or download it as a MIDI/WAV file.
My initial model was a model with a single LSTM layer that predicted the duration, type, and pitch of a note all at once. It therefore had to predict a class between around 1500 classes, which represented the number of different combinations of duration, type and pitch for a note. Apart from the fact that this was an unreasonable number of different classes, it was also problematic because, although some of the classes were more closely related than others (for example, two classes that shared the same pitch with a slightly different duration), the model had no idea of knowing or understanding these relations.
I therefore reframed the problem so that my model would have to predict the 3 different attributes of a message separately. This not only reduced the number of different classes it had to predict between, but also allowed the model to "see" when it had predicted one of the attributes correctly, and the other attributes incorrectly, which made it easier for it to learn. I also decided to pass the output from the classification of the type of message as an input to the layer predicting the duration of the message, then the output for the prediction of the duration of the message as well as the output from the classification of the type of message to the layer predicting the pitch of the message, so that each prediction could take into account the predictions for the other attributes of the message being predicted and therefore come up with a more accurate prediction. I also decided to add attention layers to allow the model to better understand the connections between the messages in the input sequence, and between the current message being predicted and all the other messages in the input sequence.
This final model is the model described in more detail in the "Architecture" section.
One of the major challenges I ran into was a lack of training time. Since I didn't have any kind of validation set, I had no way of knowing whether the model was actually learning something before it completed its training, so I had to train it completely every time, which usually took around 8 hours. Then, after 8 hours, I would sometimes find that the model hadn't learned anything and that I had therefore wasted those 8 hours of training time.
This lack of training time was accentuated by the fact that I encountered many strange bugs with Keras and CUDA which would sometimes make my training loop crash while my model was training overnight and therefore waste even more training time.
What I learned
During this project, I learned a lot about how to clean data correctly so that it can be used for a machine learning model.
I also learned how to use Keras/TensorFlow to implement a neural network, and how to experimentwith different architectures to see which would give better results.
I also learned that, when training a machine learning model, the way you frame your problem can have just as big of an impact on your results as your model's architecture or the size of your dataset.
Finally, I learned how LSTM and Attention layers work, and how to implement them using Keras/TensorFlow. I was also able to see with my own eyes the power of Attention layers by seeing how much my model improved after I had added Attention layers to it.
In the future, I would like to experiment with transformers, as they have been shown to be very effective in sequence prediction problems in recent years.
I would also like to find more data for my model to train on, as there are countless other classical compositions I could include in my dataset if I were just able to find MIDI files for them.
I would also like to try augmenting my dataset by uniformly increasing/decreasing the pitch/duration of each note in a piece, which would not change the overall musicality of the piece, but would provide the model more data to train on.
I would also like to try using the Maestro dataset, which contains MIDI files of human performances and therefore has varying velocities (volumes) for each note. I would then frame the velocity prediction as a regression problem instead of a classification problem in order to train the model to emulate human performance.
Finally, I would like to experiment with a LSTM layer that takes inputs with variable length, so that the model would not require an initial sequence of 150 messages to start generating music, but could take a random starting note and generate music directly from there by adding each note it predicts to its input sequence and thus increasing the length of its input sequence every time.