Duet | Devpost

Who

Louis Geer (ljgeer), Kyle Lee (klee175)

Github Link

https://github.com/l-ouis/KAL-Duet/tree/final

Final writeup

https://docs.google.com/document/d/1T-cnnK_ZQZ-wTluHNxljhv5yOEX00picgrS9QaOencg/edit

Introduction

The problem we are trying to solve is how one can generate piano accompaniments given a melody. We are reimplementing this paper: https://mct-master.github.io/machine-learning/2022/05/20/kriswent-generating-piano-accompaniments-using-fourier-transforms.html because it provides a nice high-level overview of existing approaches (seq2seq, LSTM, Transformers, etc.) to generate piano accompaniments using deep neural networks. The paper’s objective is to use a Transformer model’s self-attention mechanism to learn the relationship between notes and learn long-term, high-level structures in music. This problem is essentially a generative task, although the tokens are not words but rather musical notes.

https://arxiv.org/pdf/2002.03082.pdf

https://www.duo.uio.no/bitstream/handle/10852/95694/1/UiO_Master_Thesis_benjamas.pdf

https://magenta.tensorflow.org/performance-rnn

In the first paper, Kristian Wentzel created a transformer model to create accompaniments of piano melodies. Wentzel also created a model dubbed “FNet”, a spin on the Transformer model which uses a Fourier transform layer in place of the attention layer. Wentzel tokenizes the MIDIs in a fashion explained by Google Magenta’s Performance RNN. In the end, Wentzel is able to generate slightly convincing accompaniments – parts of the accompaniments seem to be well-structured and sound, but some individual parts are off-tempo and seem slightly ‘confused’.

Data

We will be using raw MIDI files from multiple online sources. We are mainly looking at the Lakh MIDI Dataset v0.1 (https://colinraffel.com/projects/lmd/). If this does not suffice, we may add on other online MIDI databases. Since the MIDI format is very standardized, we will not have to worry too much about fitting all our files to one form. The dataset has 176,581 MIDI files, 45,129 of which are matched and aligned to entries in the Million Song Dataset. We believe this should be more than enough data to train a substantial model. Although the MIDI format is standardized, we will have to do significant preprocessing. We must sift through the data to ensure each final file can be used in an “accompaniment and melody” manner. This means making sure that the MIDI file is not simply a single channel. Most Piano MIDI files have two channels: the first for the right hand, and the second for the left hand. We may also apply slightly more complex techniques such as counting the number of chords in each channel to algorithmically determine which channel is the melody and which is the accompaniment.

Methodology

We plan on building upon the Transformer model, which has been found to perform better on piano accompaniment generation than LSTM layers. We plan on incorporating the typical attention layer, although other alternatives exist, such as a Fourier transform layer, which we can substitute in if our model is too costly to train. The melody will be the input into the encoder and the accompaniment will be passed into the decoder when training the model. The output of the encoder will be connected to the decoder’s multihead attention layer, in which the decoder’s output will be the generated accompaniment of the melody. At test time, the decoder will only receive its last predicted output token. The hardest part of implementing the existing paper will likely be preprocessing the input data because the Transformer model will be similar to what we utilized in class (encoder + decoder blocks). The paper gives a large degree of freedom as to how music is tokenized, so we are unsure if our approach mentioned above will work. Since we are not implementing a brand new idea, our code will be written from scratch and the model will likely be different from the paper. If we run into any issues regarding the model itself, we will look at our homework 5 implementation and see any relevant connections. One bottleneck mentioned in the paper is the amount of time it takes to train the model, even with GPU power, so a backup idea we have is to first use a pretrained model to first establish a solid baseline.

Metrics

Once we have a working model, we plan to manually listen to the generated accompaniments alongside the input melodies to determine the quality of our model’s music generation. In this case, “success” would be an accompaniment that plays in the same key as the melody and has consistent timing and rhythm. The notion of “accuracy” is less clear-cut for our project because our task involves piano accompaniment generation and the tokens are notes. However, an appropriate metric that we are considering is the dynamic time warping (DTW) distance to measure the alignment between the generated accompaniment and the input melody. We can also explore chord and rhythm accuracy metrics that measure how far off the timing is of the accompaniment relative to the melody. In the paper, the author does not discuss any specific metric that he used to quantify the results of their model. Rather, the author played the generated accompaniments with the melody and observed specific attributes such as whether the key was the same, the timing matched, and the rhythm was consistent. The author notes that he was not concerned with the accuracy because he was working with a generative model. Our base, target, and stretch goals are as follows:

Base: Get reasonable results using a pretrained model to evaluate how to best preprocess the data.

Target: Get comparable results to the paper using a working Transformer model that appears to generate reasonable accompaniments and does well according to the metrics mentioned above.

Stretch: Utilize Fourier Transform layers in a working Transformer model (i.e. project the input melodic sequence into the Frequency domain) to train the model faster.

Ethics

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?

Our intended dataset is the Lakh MIDI Dataset v0.1. This dataset is largely composed of Western songs transcribed into MIDI form (most songs in MIDI form are generally Western due to how the MIDI format aligns with Western music theory). This alone is fine – Western music is prevalent in many of the communities we are in – but it is important to note the fact that we will be creating our model based on Western music and we will be feeding it data of Western music. Music in the form of MIDI is different from natural language in that it is harder for somebody to incorporate bias into it.

Will the way in which we do preprocessing introduce any potential bias? We may be using a lot of “assumptions” with preprocessing. We will be targeting MIDI files with only two channels, assuming the first channel is the right hand (melody) and the second channel is the harmony (accompaniment). However, we are also making the assumption that the right hand often plays melody. While this is mainly true in modern Western music, it’s not necessarily true in classical piano or other forms of piano music.

Division of labor

Louis will focus on gathering the input data and preprocessing. Kyle will work on constructing the Transformer Model and hyperparameter tuning. Both will work on training the model together and help with whatever part each person is stuck on.