MusicVAE: MIDI Music Generation

Project Members

Minh Quan Do (mdo3)
Sam Maffa (smaffa)
Sean Yamamoto (syamamo1)

Final Project Deliverables

PROJECT PAPER
PROJECT POSTER - Google Slide Version
PROJECT POSTER - PDF Version
PROJECT ORAL PRESENTATION
MODEL GENERATED MUSIC (GDRIVE FOLDER). We recommend listening in order of 25, 50, to 100 epochs.

Devpost #1

Introduction:

We aim to explore deep learning models for generating novel music clips. This task is a structured prediction problem, where we want to learn latent features of input music data to train a generative model that can produce realistic music. Towards this end, we are implementing the MusicVAE model outlined in “A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music” (Roberts et al., 2018). The original paper sought to improve upon existing recurrent VAE models for long sequences. At the time, VAEs were not commonly applied to sequential data and were limited in their ability to capture long-term structure. By introducing a novel hierarchical recurrent decoder into the VAE architecture, the authors were able to apply a generative modeling approach to music generation.

Other studies (review paper, Towards Data Science article) into deep learning models for music generation have experimented with CNNs (WaveNet), RNNs (DeepBach), GANs (MidiNet, MuseGAN), and transformers (MusicAutobot, MuseNet), among other architectures. Furthermore, Magenta is an open-source Python library based in Tensorflow, which features several research projects into music and image generation using various machine learning models, including MusicVAE.

Data:

We will be using a library of piano songs from the Lakh Pianoroll dataset (LPD). This dataset is a collection of 174,154 multitrack pianorolls from the Lakh MIDI Dataset. Pianoroll is a music storing format that represents songs as matrices. Each row is a tone and each column is a timestamp. This dataset uses 24 timesteps per beat, which allows for dynamic temporal patterns. The matrices have 128 rows which correspond to 128 different notes on the piano keyboard. Since our data is also multitrack, the entire song is stored in a [(32*number of beats in song) x 128 x N] tensor, where N is the number of tracks (instruments) in the song. We will be scrubbing this data to extract just the piano from each track to simplify the complexity of our model. We can do this by simply taking the first dimension of the song tensor instead of all N.

These song matrices are stored in MIDI files, which are ~4.2GB in size. Conveniently, there are packages for Python that can be used to manage this type of file. Mido, in particular, can be used to convert MIDI files into numpy arrays, which can then be passed into our tensorflow model. There are also corrupt MIDI files so we will have to remove these from our dataset.

Methodology:

The overall architecture of the MusicVAE model is: Encoder RNN → Conductor RNN → Decoder RNN We will be using recurrent variational autoencoders based on the paper linked above. Split into two main components, the paper references these as the “bidirectional encoder” and the “hierarchical decoder," which consists of a conductor and a bottom-level decoder. Encoder

A two-layer bidirectional LSTM network
We receive two final states from the LSTM, one from passing the inputs in each direction, which are concatenated and fed into two fully connected layers to produce distributions for the mean and variance of the latent space. Conductor
A two-layer unidirectional LSTM network
Initial state of RNN is the latent vector from the encoder. Non-overlapping subsequences of the input are then fed into the conductor RNN to produce embeddings for each subsequence. Decoder
A two-layer LSTM network
Embeddings from the conductor serve as initial states for the decoder. Inputs to the decoder are RNN outputs concatenated with the conductor embedding of the current subsequence.
Outputs probabilities over output tokens. Most challenging part about paper
Constructing the hierarchical decoder with two RNN components seems challenging, as there will be many inputs from different sources to keep track of and manipulate: latent codes, embeddings from the conductor, subsequences of the input, etc.

Metrics:

The authors of the original MusicVAE project used two methods for evaluation. The first metric was a Turing test, a subjective measure of performance whereby the authors asked 192 survey participants to vote on which of two audio clips they thought were computer-generated or composed by a human. The second method was a reconstruction test, where the authors measured how much of the original input could be reproduced when using teacher forcing and when simply sampling from the latent space.

There are other objective measurements that can be used to assess music generation by comparing the distributions of specific features of the input data and the generated data. For example, we could compute the pitch counts, note length distribution, the pitch class transition matrix, or the distribution of distances between successive notes, and we could calculate Kullback-Leibler Divergence or overlapping area under the distributions as a measure of similarity or distance between our generated audio clips and real audio clips (Yang and Lerch, 2020).

For our project, however, we are setting less rigorous baseline goals to begin, and we hope to use more robust measurements once we have a reasonably working model. Our base goal will be for our model to generate audio that sounds like “real” music, in our own opinions. If we achieve that, our target goals will be to achieve either 20% votes for our generated music on an informal Turing test or 25% reconstruction on a teacher-forcing test. Our stretch goals will be higher percentages on these metrics: upwards of 50% on a Turing test or 50% reconstruction accuracy.

Ethics:

Why is Deep Learning a good approach to this problem? Deep learning is a good approach to this problem because music can be represented in a neat, quantifiable way, using a matrix with tones as rows and timestamps as columns. Music creation is also very complex since each song is unique. However, we hope our model can identify similarities between songs and produce songs of its own. RNN layers are well suited for this problem since music is time dependent and music utilizes complex tone and rhythmic patterns so LSTM cells can be used to learn these nuanced similarities.
What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain? Our data comes from the Lakh Pianoroll dataset. If we are generating music based on input data, if the data contains mostly samples from a certain genre/culture of music, it is likely to be biased to generate only that music. Although that is not necessarily a problem in of itself given our limited use case for our network, it means we have to be careful about saying what kind of music it would predict. Meaning we can be upfront and say, this is a music generator for x types of music, as it was trained on that music. This is still a concern even with piano music, as piano is predominantly also a Western instrument, so there may be some inherent biases through that being the main music medium we are passing in. The concern here is we do not know much information about the composition of the music in terms of genre or anything; we just know we have around 200,000 songs.
Other concerns: In terms of stakeholders, technology like the ones we are building can lead to systematic biases in music if widely adopted. If not careful, and only sampling music for dominant cultures, that would continue. There are also implications economically, in terms of jobs and the implications of this in industry. Could this lead to automation for more structured music types, such as jingles, corporate music, elevator music? Would this lead to many musicians being out of work, and the field losing ‘creativity’, which is one of the hardest parts to teach a model like the one we are trying to build.

Division of Labor:

Data acquisition and preprocessing - Sean
VAE model implementation: Encoder, Conductor, Decoder - Sam
Training and testing functions; parameter tuning; integration - Minh Quan

We expect the model implementation to be the most difficult part, so all three of us will work together on the model once we complete our other tasks.

DevPost #2

Introduction:

Challenges:

Data processing was pretty challenging because we had to learn a python package to process MID/MIDI files, which took a considerable amount of time. Once we imported the MIDI file data into Python, we also had to convert the data into a format (np.array([timesteps, notes being played])) so that it could be used in our RNN encoder-decoder model. This was tricky because of the way MIDI files are formatted. When implementing the model, one of our biggest challenges was setting up the hierarchical decoder structure, where the predicted outputs would be concatenated with its inputs. The decoder consists of two successive LSTM blocks, and for each note in a given input sequence, it must fully pass through both decoder layers before being joined to the next input in the sequence. This presents a considerable obstacle for runtime, as we have needed to loop through the sequence manually rather than allow TensorFlow to process the sequence internally in the LSTM call.

Insights:

We have done some initial testing locally on small batches of songs, and we’ve found that our model stalls at the autodifferentiation step, suggesting that we either need to upgrade our computational resources or decrease the number of parameters in our model. We do, however, know that both the MIDI conversions and the model work independently, as we have been able to process the data into compatible matrix representations for our model, and the model can run on random inputs, though those results are not interpretable.

Plan:

Moving forward, we need to ensure that our model runs on the real MIDI data. This may take some parameter tuning, or adjusting the number of trainable weights so that our model runs in a reasonable amount of time. Afterward, we can try to optimize our hyperparameters to make our output music sound better. We might also play around with our dataset. For example, we could limit our song dataset to certain games, increase/decrease the number of measures we sample from each input song or decrease the resolution of each beat (right now we have a resolution of 64 ticks per beat, meaning the shortest note our model recognizes is a (1/64)th note).