Amanda Lee (zlee8), James Cai (jcai20), Kate Nelson (knelson9), Inho David Lee (ilee26)
The motivation for our idea came from watching and listening to mashups of popular songs on YouTube. Whether it be Ed Sheeren x Taylor Swift or BTS x Lauv, many of these mashups can be fun to listen to and can work surprisingly well together. We were wondering whether it might be possible to train a neural network to learn to create these mashups automatically. The goal of our project is thus to create a deep learning model that creates one of these musical mashups given two songs in mp3 format as inputs. This is a supervised learning problem since we will use the premade music mashups on YouTube as our training labels. We want to have our model create these mashups to be similar to and include components of the input songs, but also be their own unique entities. We came up with this original idea by considering how we could apply deep learning to music. At first we considered recreating a music recommender, similar to what Spotify does to create Discover Weekly playlists, but we then pivoted, as we wanted to come up with our own deep learning problem that hadn’t been studied as thoroughly.
The most relevant work to our project is prior research done on generating vector embeddings for music. One article we looked closely at focused on generating these embeddings specifically for the task of genre classification. The full article can be found at the first link in the list that we provide below.
The reason why this article was particularly relevant to our project was because the authors used spectrograms as the input to their models which is the same type of data that we will be using (described further in the data section). Secondly, this paper provides multiple architectures that were experimented with, each with its own strengths and weaknesses, which was beneficial to us as a source of inspiration when we were brainstorming how we should design our architecture.
The primary model of interest was their convolution based autoencoder with supervised genre labels. The encoder aspect of this architecture consisted of four convolution and max pooling layers, which after being passed through a dense layer produced the econding vector. This encoding vector was then used to predict the genre of the song after being passed through a few more fully connected layers, but also into a decoder which reversed the original convolution layers contained in the encoder and computed the reconstruction loss as the mean square error for each pixel of the reconstructed and original images. Also of note was their mention of the dataset they used, the Free Music Archive, which we could potentially use to train a variant of their encoding network and tailor it to our purposes.
Also of note was a WaveNet architecture that generated embeddings based on raw audio data, a paper describing how two images could be reconstructed into one, and an article describing the generation of newly synthesized audio samples. All of the sources to these papers are listed below.
The label data consists of mp3 files of mashed-up songs from YouTube. For instance, one label could be an mp3 clip of BTS song and Lauv song mashup. They will be downloaded using PyTube, a Python library. There are entire channels on YouTube that host hundreds of mashup songs, which expedites the ingestion of a higher volume of inputs.
The corresponding inputs to a label are 2 songs, also mp3 files that originate from YouTube. These inputs will be the individual songs from which the mashup (the label) was made. In the example of BTS x Lauv mashup song label, the corresponding inputs are the original BTS and the Lauv songs. The titles of these songs will be parsed from the label, and they will be downloaded in the same fashion.
We aim to have 10,000s of input songs, and high 1,000s to 10,000s of labels.
In addition to formatting the data such that 2 input songs are paired with 1 label, the auditory data will also be manipulated so that it is more conducive to be used in a neural network. Each mp3 file will be converted into a spectrogram. A spectrogram will contain data about the signal frequency and volume with respect to time. The ARSS Python library will be used to convert mp3 files to spectrograms for preprocessing.
The architecture of our model will be based on the encoder-decoder transformer based model that we saw in the seq2seq example in class. Given that we are able to collect enough training data, we will write the encoder aspect of our model ourselves. Imagining our spectrogram data for a single song to be of size 1000 by 120 (as an example), we will divide up our spectrogram into, say, 50 by 120 chunks to represent it as time-series data. We will then pass this sliced spectrogram into an encoder block containing a self-attention layer followed by an add & normalize and fully-connected layers to produce the encoding of the song in question for each of the spectrogram slices.
The intuition behind this is that just like how the seq2seq model learned to capture the “meaning” of a word from the relative importance of its surrounding words, our model might be able to capture the relationship of a certain segment of a song to the rest of that song around it. Also what comes after a certain segment of song is highly relevant to the context of that segment which is why an attention based model makes sense here as the encoder. We could also potentially pass this spectrogram data through a series of LSTMs to produce a state for each song instead, and there are other architecture tweaks that could potentially be made. If we simply do not have enough data, we could either use a larger database of songs like the FreeMusicArchive just to train the encoder neural network, or we could also use a pre-trained model like the few documented in the above literature to generate our song encodings.
Regardless of how we generate our song econdings, after we have accomplished this we will use both song econdings for the two songs meant to be merged and pass them into the decoder module. The decoder model will use the labeled mashed songs as its input and first calculate the self-attention of the sliced version of these training labels. The output of this layer will then be fed into an encoder-decoder attention layer alongside both of the outputs for the econders of the two input songs. This may pose a problem because there are now two K and V vectors (one for each input song) but this may be resolved by increasing the number of features of all the vectors in the decoder module to account for the two input images (could be even more than two). After addition and normalization, the decoder outputs will be passed through a few more feed forward layers, but this time the last layer will generate a 20 by 120 output segment of the model generated mashed song. All the output slices will be put back together and the final mashed song spectrogram will have been generated. This spectrogram can then be converted back into an mp3 file to be listened to using ARSS.
There are both qualitative and quantitative methods to evaluate the performance of our model. We can simply listen to the mp3s that we generate and analyze if the output song mashup resembles any sort of comprehensible sound at all and also if they have indeed taken aspects of both input songs and mashed them together into something that sounds pleasant to listen to.
However, we will need a quantitative way to analyze the performance of our model in order to train our network. The naive way of doing this is to perform a pixel by pixel comparison of our model’s output song mashup and the training label mashup’s spectrograms. This would be a simple way to measure the “accuracy” of our model. However, this “accuracy” metric does not account for the fact that if we produced an output that exactly matched the training label but shifted over a time step, we would still want to classify this result as a success. In order to account for these situations, we will instead evaluate our model by calculating the cross correlation between these two spectrograms, which has already been well established in literature as a good metric to determine spectrogram similarity.
Everything combined, this project will be fairly difficult to properly execute. Thus, our base goal is to have our model generate some sort of sound, whether it be pleasant to listen to or basically just noise. We think being able to produce any sort of result will be somewhat of an achievement for our architecture. Our target goal is for our model to generate song mashups that evidently show elements of both input songs. We would not expect it to necessarily be coherent or very much like the videos we find on YouTube at all, but it would be really great to see our model actually pulling elements of both input songs into a final output. Finally, our stretch goal is for our model-generated songs to resemble the original mashed up songs we found on YouTube used to train the model.
Our dataset consists of songs from Youtube and the mashed up versions of those songs. There are no real concerns of how it was collected since we are just using legal songs by artists on Youtube. There might be some biases that these mashups contain because there are different genres of music, and there may be more examples of mashups of certain genres than others. These genres of music often correspond to a certain group of people writing them; this could lead to some racial bias since there seem to be less mashups of songs created by underrepresented minorities in the music industry compared to those created by more represented groups. For example, we can find many versions of song mashups between Taylor Swift and Katy Perry, both of which are highly represented in the music industry, but very little versions of mashups of songs by underrepresented artists such as Jay Chou, a Chinese artist, or Aretha Franklin, an African-American artist. This creates a problem where our model would not work as well on songs by underrepresented artists as they would on songs by the mainstream artists.
Deep learning seems to be a good idea for this problem since we have many examples of what is considered “good” mashup music. Since we have the inputs which are the songs to be mashed together, we can easily build a model that takes in these inputs and trains based on the existing mashup music. We believe that a deep learning model can help us see the patterns in good mashup music that would be hard for normal people to see when creating the mashup videos.
Division of Labor
Amanda: Data ingestion and parsing (see Data section)
James: Preprocessing and architecture design
Inho (David): Model architecture and overall workflow
Kate: Research into pretrained models and metric evaluation