Final Reflection

https://docs.google.com/document/d/18IQAUf1nMDma98sr_XBVs7PJDYLDBruOghuFmFN3jkQ/edit?usp=sharing

Team Members

Amanda Lee (zlee8), James Cai (jcai20), Kate Nelson (knelson9), Inho David Lee (ilee26)

Introduction

The motivation for our idea came from watching and listening to mashups of popular songs on YouTube. Whether it be Ed Sheeren x Taylor Swift or BTS x Lauv, many of these mashups can be fun to listen to and can work surprisingly well together. We were wondering whether it might be possible to train a neural network to learn to create these mashups automatically. The goal of our project is thus to create a deep learning model that creates one of these musical mashups given two songs in mp3 format as inputs. This is a supervised learning problem since we will use the premade music mashups on YouTube as our training labels. We want to have our model create these mashups to be similar to and include components of the input songs, but also be their own unique entities. We came up with this original idea by considering how we could apply deep learning to music. At first we considered recreating a music recommender, similar to what Spotify does to create Discover Weekly playlists, but we then pivoted, as we wanted to come up with our own deep learning problem that hadn’t been studied as thoroughly.

Related Work

The most relevant work to our project is prior research done on generating vector embeddings for music. One article we looked closely at focused on generating these embeddings specifically for the task of genre classification. The full article can be found at the first link in the list that we provide below.

The reason why this article was particularly relevant to our project was because the authors used spectrograms as the input to their models which is the same type of data that we will be using (described further in the data section). Secondly, this paper provides multiple architectures that were experimented with, each with its own strengths and weaknesses, which was beneficial to us as a source of inspiration when we were brainstorming how we should design our architecture.

The primary model of interest was their convolution based autoencoder with supervised genre labels. The encoder aspect of this architecture consisted of four convolution and max pooling layers, which after being passed through a dense layer produced the econding vector. This encoding vector was then used to predict the genre of the song after being passed through a few more fully connected layers, but also into a decoder which reversed the original convolution layers contained in the encoder and computed the reconstruction loss as the mean square error for each pixel of the reconstructed and original images. Also of note was their mention of the dataset they used, the Free Music Archive, which we could potentially use to train a variant of their encoding network and tailor it to our purposes.

Also of note was a WaveNet architecture that generated embeddings based on raw audio data, a paper describing how two images could be reconstructed into one, and an article describing the generation of newly synthesized audio samples. All of the sources to these papers are listed below.

https://medium.com/@rajatheb/music2vec-generating-vector-embedding-for-genre-classification-task-411187a20820 https://deepmind.com/blog/article/wavenet-generative-model-raw-audio
https://towardsdatascience.com/neuralfunk-combining-deep-learning-with-sound-design-91935759d628 https://arxiv.org/pdf/1508.06576.pdf

Data

Labels

The label data consists of mp3 files of mashed-up songs from YouTube. For instance, one label could be an mp3 clip of BTS song and Lauv song mashup. They will be downloaded using PyTube, a Python library. There are entire channels on YouTube that host hundreds of mashup songs, which expedites the ingestion of a higher volume of inputs.

Inputs

The corresponding inputs to a label are 2 songs, also mp3 files that originate from YouTube. These inputs will be the individual songs from which the mashup (the label) was made. In the example of BTS x Lauv mashup song label, the corresponding inputs are the original BTS and the Lauv songs. The titles of these songs will be parsed from the label, and they will be downloaded in the same fashion.

Size

We aim to have 10,000s of input songs, and high 1,000s to 10,000s of labels.

Preprocessing

In addition to formatting the data such that 2 input songs are paired with 1 label, the auditory data will also be manipulated so that it is more conducive to be used in a neural network. Each mp3 file will be converted into a spectrogram. A spectrogram will contain data about the signal frequency and volume with respect to time. The ARSS Python library will be used to convert mp3 files to spectrograms for preprocessing.

Methodology

The architecture of our model will be based on the encoder-decoder transformer based model that we saw in the seq2seq example in class. Given that we are able to collect enough training data, we will write the encoder aspect of our model ourselves. Imagining our spectrogram data for a single song to be of size 1000 by 120 (as an example), we will divide up our spectrogram into, say, 50 by 120 chunks to represent it as time-series data. We will then pass this sliced spectrogram into an encoder block containing a self-attention layer followed by an add & normalize and fully-connected layers to produce the encoding of the song in question for each of the spectrogram slices.

The intuition behind this is that just like how the seq2seq model learned to capture the “meaning” of a word from the relative importance of its surrounding words, our model might be able to capture the relationship of a certain segment of a song to the rest of that song around it. Also what comes after a certain segment of song is highly relevant to the context of that segment which is why an attention based model makes sense here as the encoder. We could also potentially pass this spectrogram data through a series of LSTMs to produce a state for each song instead, and there are other architecture tweaks that could potentially be made. If we simply do not have enough data, we could either use a larger database of songs like the FreeMusicArchive just to train the encoder neural network, or we could also use a pre-trained model like the few documented in the above literature to generate our song encodings.

Regardless of how we generate our song econdings, after we have accomplished this we will use both song econdings for the two songs meant to be merged and pass them into the decoder module. The decoder model will use the labeled mashed songs as its input and first calculate the self-attention of the sliced version of these training labels. The output of this layer will then be fed into an encoder-decoder attention layer alongside both of the outputs for the econders of the two input songs. This may pose a problem because there are now two K and V vectors (one for each input song) but this may be resolved by increasing the number of features of all the vectors in the decoder module to account for the two input images (could be even more than two). After addition and normalization, the decoder outputs will be passed through a few more feed forward layers, but this time the last layer will generate a 20 by 120 output segment of the model generated mashed song. All the output slices will be put back together and the final mashed song spectrogram will have been generated. This spectrogram can then be converted back into an mp3 file to be listened to using ARSS.

Metrics

There are both qualitative and quantitative methods to evaluate the performance of our model. We can simply listen to the mp3s that we generate and analyze if the output song mashup resembles any sort of comprehensible sound at all and also if they have indeed taken aspects of both input songs and mashed them together into something that sounds pleasant to listen to.

However, we will need a quantitative way to analyze the performance of our model in order to train our network. The naive way of doing this is to perform a pixel by pixel comparison of our model’s output song mashup and the training label mashup’s spectrograms. This would be a simple way to measure the “accuracy” of our model. However, this “accuracy” metric does not account for the fact that if we produced an output that exactly matched the training label but shifted over a time step, we would still want to classify this result as a success. In order to account for these situations, we will instead evaluate our model by calculating the cross correlation between these two spectrograms, which has already been well established in literature as a good metric to determine spectrogram similarity.

Everything combined, this project will be fairly difficult to properly execute. Thus, our base goal is to have our model generate some sort of sound, whether it be pleasant to listen to or basically just noise. We think being able to produce any sort of result will be somewhat of an achievement for our architecture. Our target goal is for our model to generate song mashups that evidently show elements of both input songs. We would not expect it to necessarily be coherent or very much like the videos we find on YouTube at all, but it would be really great to see our model actually pulling elements of both input songs into a final output. Finally, our stretch goal is for our model-generated songs to resemble the original mashed up songs we found on YouTube used to train the model.

Ethics

Our dataset consists of songs from Youtube and the mashed up versions of those songs. There are no real concerns of how it was collected since we are just using legal songs by artists on Youtube. There might be some biases that these mashups contain because there are different genres of music, and there may be more examples of mashups of certain genres than others. These genres of music often correspond to a certain group of people writing them; this could lead to some racial bias since there seem to be less mashups of songs created by underrepresented minorities in the music industry compared to those created by more represented groups. For example, we can find many versions of song mashups between Taylor Swift and Katy Perry, both of which are highly represented in the music industry, but very little versions of mashups of songs by underrepresented artists such as Jay Chou, a Chinese artist, or Aretha Franklin, an African-American artist. This creates a problem where our model would not work as well on songs by underrepresented artists as they would on songs by the mainstream artists.

Deep learning seems to be a good idea for this problem since we have many examples of what is considered “good” mashup music. Since we have the inputs which are the songs to be mashed together, we can easily build a model that takes in these inputs and trains based on the existing mashup music. We believe that a deep learning model can help us see the patterns in good mashup music that would be hard for normal people to see when creating the mashup videos.

Division of Labor

Amanda: Data ingestion and parsing (see Data section)

James: Preprocessing and architecture design

Inho (David): Model architecture and overall workflow

Kate: Research into pretrained models and metric evaluation

Built With

Share this project:

Updates

posted an update

Updates

Introduction

The motivation for our idea came from watching and listening to mashups of popular songs on YouTube. Whether it be Ed Sheeren x Taylor Swift or BTS x Lauv, many of these mashups can be fun to listen to and can work surprisingly well together. We were wondering whether it might be possible to train a neural network to learn to create these mashups automatically. The goal of our project is thus to create a deep learning model that creates one of these musical mashups given two songs in mp3 format as inputs. This is a supervised learning problem since we will use the premade music mashups on YouTube as our training labels. We want to have our model create these mashups to be similar to and include components of the input songs, but also be their own unique entities.

Data collection

As described in our previous outlines, we are using the python package Pytube to download the input and label data. For the input data, we used the package’s Playlist functionality to scrape entire playlists of mashup songs that have been created manually by Youtube content creators. Amongst these playlists is nearly 1000 songs from KJ Mixes, Cageman MashUps, and others. Many of these downloads include both an audio and video track, but we only need the audio data for our purposes. We have also done some data clean-up — standardizing the file names and filtering for the mp3 data.

Kate and Amanda worked on this process and have found it pretty smooth sailing. Next steps are downloading more songs — at least 3,000 mashups and ~6,000 individual songs. Furthermore, we will explore the Google Drive API as we download more data, to store it in the cloud rather than locally.

Preprocessing

James worked on the preprocessing of the audio mp3 data into spectrograms. He is experimenting with ARSS, another package, to do this, but has encountered challenges with the documentation. The package has not been well maintained, so we will may need to pivot. We will consult relevant literature to see how others did this.

We have also decided to trim each song to be 2 minutes to make the data more uniform.

Model

David has created a skeleton for the model we will use based off of a seq2seq architecture. He built a model that instead of taking in a sequence of word embeddings, takes in a sequence of small spectrogram slices, but still similarly passes them through a self-attention transformer to produce an embedding for each song. He will then use the embeddings for both of the spectrograms of the input songs and send them through the decoder where the new spectrogram of the model generated mashup will be created.

This is where the bulk of our future work will be. There are two major future considerations. First, we are contemplating whether or not to make more significant modifications to the seq2seq model so that it makes more intuitive sense — for example, the concept of positional encodings don’t necessary align directly with spectrograms, given that the spectrograms represent audio. Second, we will need to decide if we have enough data to feasibly train our model, if not, we will need to consider using a much bigger dataset of songs to at the very least pretrain our encoder.

Log in or sign up for Devpost to join the conversation.