EnsembleEmbedders

Relevant Links:

Who

Eric Chen (echen58)
Mahdi Boulila
Pratham Rathi
Evan Meyer

Introduction

Autoencoders can "compress" data into a lower-dimensional representation called a latent space embedding. The model learns to encode the relevant features of the input data in this new space. We are especially interested in applying this method to music. We aim to use autoencoders (and maybe variational autoencoders) to learn latent representations of MIDI-format music. Once learned, we can interpolate between samples in the latent space to create interesting mixtures of different songs. Our stretch goal is to combine this with a lyrics autoencoder and associate interpolated lyrics with the interpolated songs.

Papers

The following papers all solve a similar problem to ours:

In particular, we were inspired by the interactive interpolation tools developed in https://ceur-ws.org/Vol-2068/milc7.pdf. While we aim to focus on the deep learning side rather than the interactive side, we still hope to achieve similar (non-interactive) results as this paper's demos. Our novel extension of this work is to add the lyric interpolation on top of the MIDI interpolation.

What kind of problem is this?

Our base goal is an unsupervised learning task, since our data has no labels and we are trying to learn the underlying structure of the data. If we add in the lyrics, we could frame that as a supervised learning task--in which the lyrics are labels for the songs.

Related Work:

Are you aware of any, or is there any prior work that you drew on to do your project? Please read and briefly summarize (no more than one paragraph) at least one paper/article/blog relevant to your topic beyond the paper you are re-implementing/novel idea you are researching. In this section, also include URLs to any public implementations you find of the paper you’re trying to implement. Please keep this as a “living list”–if you stumble across a new implementation later down the line, add it to this list. https://arxiv.org/pdf/1803.05428.pdf This paper from researchers at Google details the use of variational autoencoders to interpolate between one-instrument melodies after learning their latent space representations. The paper uses BiLSTMS for its encoder and decoder architectures, as they help capture the sequential nature of the musical inputs, which are sourced as MIDI files. To better learn the sequential form of the melodies, the paper introduces the idea of a hierarchical decoder, which features a "conductor" layer. This layer takes in the latent space as a set of subsections and produces an embedding for it. These embeddings are then passed into a final decoder layer to create the final music matrix, which is then converted back to a MIDI file for listening. The paper makes its inputs in the form of 3 different instruments: melody, drums, and bass, then creates a pitch-by-time matrix for each.

Data: What data are you using (if any)?

We will use the Lakh MIDI Dataset v0.1. We are filtering the dataset to keep only songs that have lyrics associated with them (23570 songs after filtering). The data is relatively large so we will cache the results of our preprocessing steps. To go from the raw MIDI files to training data we need to:

Select a fixed subset of k instruments
Get the "piano roll" representation with n fixed time steps
Reshape into a tensor of shape (k instruments, 128 pitches, n time steps)

Methodology:

What is the architecture of your model? We will be using a variational autoencoder to learn the latent space representations of the music. In particular, we will be running experiments with multiple setups of the VAE architecture, including a VAE that concatenates the inputs before being fed into the encoder and, in the second case, a VAE that has two encoders and one decoder. For the latter setup, we will concatenate the latent space representations of the two input forms before feeding them into the decoder. For the encoder and decoder, we will likely be using either transformers or RNNs.

How are you training the model? We are training the model with MIDI files and associated lyrics. Our VAEs main goal will be to learn an efficient representation of the songs and minimize reconstruction loss.

If you are implementing an existing paper, detail what you think will be the hardest part about implementing the model here. If you are doing something new, justify your design. Also note some backup ideas you may have to experiment with if you run into issues. Our novelty is the introduction of lyrics and associating them with different parts of the song. To account for this, we are suggesting the concatenation of either the music and text inputs or of their latent space representations. We believe that this will train the model to learn the relationships between the notes and the lyrics.

Metrics:

What constitutes “success?” What experiments do you plan to run? For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate? If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model. If you are doing something new, explain how you will assess your model’s performance. What are your base, target, and stretch goals? For the overall end-to-end interpolation goal, we will likely be using qualitative metrics to see if the newly created songs are truly in between the endpoints they were sourced from. We could also potentially use musical similarity algorithms to decide this more objectively. For the isolated model, we will likely be paying attention to the reconstruction loss and accuracy. Furthermore, for the lyrical novelty of the project, we could use perplexity to see if the created lyrics make sense.

Ethics:

Choose 2 of the following bullet points to discuss; not all questions will be relevant to all projects so try to pick questions where there’s interesting engagement with your project. (Remember that there’s not necessarily an ethical/unethical binary; rather, we want to encourage you to think critically about your problem setup.) What broader societal issues are relevant to your chosen problem space? Why is Deep Learning a good approach to this problem? Learning the latent representations of the MIDI files helps create a much more reasonable space to perform calculations and interpolations with them as compared to using the raw files. Thus, deep learning offers a proven way to learn these smaller-dimensional representations through autoencoders.

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain? Our dataset is a subset of the LAKH midi dataset link. The data is a collection of MIDI files of songs that were scraped from the internet. The sourcing was done by Colin Raffel, who created the dataset for use in his thesis. The web-scraping procedure could raise some concerns, as it was not a standardized source for the songs but rather a way of collecting them from all over. Furthermore, it mostly contains more popular songs and may not have an equal distribution of genres.

Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm? How are you planning to quantify or measure error or success? What implications does your quantification have? Add your own: if there is an issue about your algorithm you would like to discuss or explain further, feel free to do so.

Division of labor:

Briefly outline who will be responsible for which part(s) of the project Eric: MIDI pre-processing, model 1 Mahdi: MIDI pre-processing, model 1 Evan: Lyrics extraction, model 2 Pratham: Lyrics extraction, data collection, model 2 We are collectively working on architecture design and training.