posted an update

Updates

Introduction

The motivation for our idea came from watching and listening to mashups of popular songs on YouTube. Whether it be Ed Sheeren x Taylor Swift or BTS x Lauv, many of these mashups can be fun to listen to and can work surprisingly well together. We were wondering whether it might be possible to train a neural network to learn to create these mashups automatically. The goal of our project is thus to create a deep learning model that creates one of these musical mashups given two songs in mp3 format as inputs. This is a supervised learning problem since we will use the premade music mashups on YouTube as our training labels. We want to have our model create these mashups to be similar to and include components of the input songs, but also be their own unique entities.

Data collection

As described in our previous outlines, we are using the python package Pytube to download the input and label data. For the input data, we used the package’s Playlist functionality to scrape entire playlists of mashup songs that have been created manually by Youtube content creators. Amongst these playlists is nearly 1000 songs from KJ Mixes, Cageman MashUps, and others. Many of these downloads include both an audio and video track, but we only need the audio data for our purposes. We have also done some data clean-up — standardizing the file names and filtering for the mp3 data.

Kate and Amanda worked on this process and have found it pretty smooth sailing. Next steps are downloading more songs — at least 3,000 mashups and ~6,000 individual songs. Furthermore, we will explore the Google Drive API as we download more data, to store it in the cloud rather than locally.

Preprocessing

James worked on the preprocessing of the audio mp3 data into spectrograms. He is experimenting with ARSS, another package, to do this, but has encountered challenges with the documentation. The package has not been well maintained, so we will may need to pivot. We will consult relevant literature to see how others did this.

We have also decided to trim each song to be 2 minutes to make the data more uniform.

Model

David has created a skeleton for the model we will use based off of a seq2seq architecture. He built a model that instead of taking in a sequence of word embeddings, takes in a sequence of small spectrogram slices, but still similarly passes them through a self-attention transformer to produce an embedding for each song. He will then use the embeddings for both of the spectrograms of the input songs and send them through the decoder where the new spectrogram of the model generated mashup will be created.

This is where the bulk of our future work will be. There are two major future considerations. First, we are contemplating whether or not to make more significant modifications to the seq2seq model so that it makes more intuitive sense — for example, the concept of positional encodings don’t necessary align directly with spectrograms, given that the spectrograms represent audio. Second, we will need to decide if we have enough data to feasibly train our model, if not, we will need to consider using a much bigger dataset of songs to at the very least pretrain our encoder.

Log in or sign up for Devpost to join the conversation.