Title Vocal Isolation (VISO)
Who:
Names and logins of all your group members. Take credit for your awesome work! Aaron Jennis (aj62), Ben Hurd (bhurd1)
Introduction
While humans are quite adept at separating different sounds from a musical signal (vocals, individual instruments, etc.), computers have a much harder time doing so effectively. The goal of our final project is to create a model that, given a song or recording, can isolate, extract, and output the vocals of the song with as little interference from background noise as possible. The reason we are doing this is because the extraction of specific components from a song, such as vocals, can be used by artists to create other music. Artists can use this model to extract the vocals from a song and then creatively sample or alter this snippet to create their own original music. Taking popular vocals and overlaying them on an original accompaniment or beat is a very common way of making music (especially in genres such as lofi and rap), so having a way of effectively extracting vocals from songs would be very helpful for many artists. Also, listening to accurate isolated stems can be a fascinating educational tool for performers, producers, and mixers, as well as for music appreciators in general!
The paper that we will be implementing is “Singing Voice Separation Using a Deep Convolutional Neural Network Trained by Ideal Binary Mask and Cross Entropy”. The objective of this paper was to create a neural network that can extract a singing voice from its musical accompaniment. We chose this paper as it took a novel approach to this vocal extraction problem. Many researchers have tried to solve this problem using recurrent neural networks in order to capture temporal changes in a song. However, the paper we are implementing is quite interesting as it uses a convolutional neural network (something traditionally associated with analyzing visual imagery) to extract vocal samples. This is possible through the use of spectrograms, a visual representation of the spectrum of frequencies of an audio signal as it varies with time. By converting songs to their representations as spectrograms, a CNN can learn spatial patterns in these spectrograms and use this to identify the components that make up human vocals, a form of classification. This is a subproblem of this project, namely training our model to be able to classify whether a given section of audio contains human vocals. The overall problem is technically a regression problem, as our model needs to be able to predict a continuous sequence of audio that is strictly composed of the human vocals.
Related Work
An important piece of prior work that we will be drawing on to do our project has to do with preprocessing, specifically how to transform the time-domain signal of a music sample into a spectrogram. Getting the spectrogram of a music sample is quite important as it allows us to expose the structure of human speech from a sample and also use the CNN model architecture described above. Since this has been shown to be effective in previous research, we will be applying a Short-Term Fourier Transform on our input signals in order to generate their corresponding spectrograms.
Another important piece of prior work that we will be drawing on is “Improving music source separation based on deep neural networks through data augmentation and network blending”, which was a previous attempt at solving this problem using feed-forward and recurrent neural networks. In this attempt at solving the problem, the output magnitude spectrograms produced required a post-processing step in which soft masks were calculated and multiplied with the output spectrograms in order to recreate the estimated signal. These soft masks were shown to give a better separation quality than directly using the network output to synthesize the final signal. This previous work will be drawn on as our implementation will attempt to skip the Wiener filter post processing step and design a network that will learn the soft mask directly.
Public Implementations: link
Data
The data set we will be using is musdb18, a standard dataset that contains over 10 hours of full-length music tracks along with their isolated stems (drums, bass, vocals, etc.). As mentioned before, we will need to do significant preprocessing in order to convert these music tracks and their corresponding stems into magnitude spectrograms that we can feed into our CNN. We will involve applying a Short-Term Fourier Transformation on these tracks in order to get their magnitude and phase spectrograms. These spectrograms will then be divided up into relatively small windows so we can gain temporal context about a given slice of music. This is what we will be feeding into our model.
Methodology
Our model will be a convolutional neural network with four 2-D convolution layers followed by two fully connected layers (in addition to max pooling, activation functions, and dropout). The model will be fed data once it is pre-processed (spectrograms of audio file taking into account stereo channels) and produce an output that is representative of the isolated vocals’ spectrogram. The output will then be post-processed in order to recreate the vocal-only audio. We will train the model on the majority of our dataset, running the audio through the pipeline and calculating the loss based on the difference between the predicted spectrogram values and the actual values (our data set contains the actual vocal stems for each song) in order to learn the ideal soft binary mask capable of extracting vocal information. The hardest part of implementing the model will be pre-processing the data into the format that our initial convolution layer can read. Specifically, going from audio files to spectrographs to tensors will be our biggest challenge. Similarly, we will have to do the inverse of this process in order to transform our model’s output into an audio file that can be played back, all while maintaining the stereo information for each song.
Metrics
Success for our group will be to create a working CNN model that can isolate the vocal stem from a stereo mix. We will devise a metric for determining the accuracy of our test by comparing the predicted vocal stems to the actual vocal stems. This could be done by either subtracting the two spectrographs or flipping the phase of one, combining the data, and measuring the remaining signal. Once the model produces vocal stems that meet a minimum standard of quality, we will also perform listening tests with many individuals, as ultimately how well the results hold up for music listeners and creators is paramount. Our base goal is to create a CNN model that could take in a spectrograph from an audio sample and produce a spectrograph of just the vocal stem. Our target goal is to create a fully-functioning model from pre-processing through post-processing that takes in a mixed audio sample and outputs just the vocals with high fidelity. Our stretch goal is to output the instrumental (entire mix except vocals) as a listenable asset as well.
Ethics
The data set we are using is biased towards some genres (pop/rock, heavy metal, etc.) more than others (jazz, reggae, latin, etc.). This bias could likely reflect poorer results when tested and run on songs of less represented genres. In addition, this model could be used to extract vocal stems from copyrighted songs which would lead to illegal remixes. De-mixing falls into a similar category to sampling in music, where samples must be cleared with the original artist before being used in any commercial release.
Division of Labor
Aaron: pre-processing, post-processing Ben: cnn model architecture, testing
Final Writeup/Reflection: https://docs.google.com/document/d/1X8irpEf4EPFFiCxNdHCpGAwCXpeNs5Qt09VURuetTSM/edit?usp=sharing
Built With
- tensorflow
Log in or sign up for Devpost to join the conversation.