Vocal Isolation

Vocal Isolation Poster

Title Vocal Isolation (VISO)

Who:

Names and logins of all your group members. Take credit for your awesome work! Aaron Jennis (aj62), Ben Hurd (bhurd1)

Introduction

While humans are quite adept at separating different sounds from a musical signal (vocals, individual instruments, etc.), computers have a much harder time doing so effectively. The goal of our final project is to create a model that, given a song or recording, can isolate, extract, and output the vocals of the song with as little interference from background noise as possible. The reason we are doing this is because the extraction of specific components from a song, such as vocals, can be used by artists to create other music. Artists can use this model to extract the vocals from a song and then creatively sample or alter this snippet to create their own original music. Taking popular vocals and overlaying them on an original accompaniment or beat is a very common way of making music (especially in genres such as lofi and rap), so having a way of effectively extracting vocals from songs would be very helpful for many artists. Also, listening to accurate isolated stems can be a fascinating educational tool for performers, producers, and mixers, as well as for music appreciators in general!

The paper that we will be implementing is “Singing Voice Separation Using a Deep Convolutional Neural Network Trained by Ideal Binary Mask and Cross Entropy”. The objective of this paper was to create a neural network that can extract a singing voice from its musical accompaniment. We chose this paper as it took a novel approach to this vocal extraction problem. Many researchers have tried to solve this problem using recurrent neural networks in order to capture temporal changes in a song. However, the paper we are implementing is quite interesting as it uses a convolutional neural network (something traditionally associated with analyzing visual imagery) to extract vocal samples. This is possible through the use of spectrograms, a visual representation of the spectrum of frequencies of an audio signal as it varies with time. By converting songs to their representations as spectrograms, a CNN can learn spatial patterns in these spectrograms and use this to identify the components that make up human vocals, a form of classification. This is a subproblem of this project, namely training our model to be able to classify whether a given section of audio contains human vocals. The overall problem is technically a regression problem, as our model needs to be able to predict a continuous sequence of audio that is strictly composed of the human vocals.

Related Work

An important piece of prior work that we will be drawing on to do our project has to do with preprocessing, specifically how to transform the time-domain signal of a music sample into a spectrogram. Getting the spectrogram of a music sample is quite important as it allows us to expose the structure of human speech from a sample and also use the CNN model architecture described above. Since this has been shown to be effective in previous research, we will be applying a Short-Term Fourier Transform on our input signals in order to generate their corresponding spectrograms.

Another important piece of prior work that we will be drawing on is “Improving music source separation based on deep neural networks through data augmentation and network blending”, which was a previous attempt at solving this problem using feed-forward and recurrent neural networks. In this attempt at solving the problem, the output magnitude spectrograms produced required a post-processing step in which soft masks were calculated and multiplied with the output spectrograms in order to recreate the estimated signal. These soft masks were shown to give a better separation quality than directly using the network output to synthesize the final signal. This previous work will be drawn on as our implementation will attempt to skip the Wiener filter post processing step and design a network that will learn the soft mask directly.

Public Implementations: link

Data

The data set we will be using is musdb18, a standard dataset that contains over 10 hours of full-length music tracks along with their isolated stems (drums, bass, vocals, etc.). As mentioned before, we will need to do significant preprocessing in order to convert these music tracks and their corresponding stems into magnitude spectrograms that we can feed into our CNN. We will involve applying a Short-Term Fourier Transformation on these tracks in order to get their magnitude and phase spectrograms. These spectrograms will then be divided up into relatively small windows so we can gain temporal context about a given slice of music. This is what we will be feeding into our model.

Methodology

Our model will be a convolutional neural network with four 2-D convolution layers followed by two fully connected layers (in addition to max pooling, activation functions, and dropout). The model will be fed data once it is pre-processed (spectrograms of audio file taking into account stereo channels) and produce an output that is representative of the isolated vocals’ spectrogram. The output will then be post-processed in order to recreate the vocal-only audio. We will train the model on the majority of our dataset, running the audio through the pipeline and calculating the loss based on the difference between the predicted spectrogram values and the actual values (our data set contains the actual vocal stems for each song) in order to learn the ideal soft binary mask capable of extracting vocal information. The hardest part of implementing the model will be pre-processing the data into the format that our initial convolution layer can read. Specifically, going from audio files to spectrographs to tensors will be our biggest challenge. Similarly, we will have to do the inverse of this process in order to transform our model’s output into an audio file that can be played back, all while maintaining the stereo information for each song.

Metrics

Success for our group will be to create a working CNN model that can isolate the vocal stem from a stereo mix. We will devise a metric for determining the accuracy of our test by comparing the predicted vocal stems to the actual vocal stems. This could be done by either subtracting the two spectrographs or flipping the phase of one, combining the data, and measuring the remaining signal. Once the model produces vocal stems that meet a minimum standard of quality, we will also perform listening tests with many individuals, as ultimately how well the results hold up for music listeners and creators is paramount. Our base goal is to create a CNN model that could take in a spectrograph from an audio sample and produce a spectrograph of just the vocal stem. Our target goal is to create a fully-functioning model from pre-processing through post-processing that takes in a mixed audio sample and outputs just the vocals with high fidelity. Our stretch goal is to output the instrumental (entire mix except vocals) as a listenable asset as well.

Ethics

The data set we are using is biased towards some genres (pop/rock, heavy metal, etc.) more than others (jazz, reggae, latin, etc.). This bias could likely reflect poorer results when tested and run on songs of less represented genres. In addition, this model could be used to extract vocal stems from copyrighted songs which would lead to illegal remixes. De-mixing falls into a similar category to sampling in music, where samples must be cleared with the original artist before being used in any commercial release.

Division of Labor

Aaron: pre-processing, post-processing Ben: cnn model architecture, testing

Final Writeup/Reflection: https://docs.google.com/document/d/1X8irpEf4EPFFiCxNdHCpGAwCXpeNs5Qt09VURuetTSM/edit?usp=sharing

Built With

tensorflow

Updates

aaronjennis Jennis posted an update — Nov 24, 2020 02:16 AM EST

Introduction

Challenges

At this point, the most difficult part of the project has been preprocessing. This is as expected, as working with audio presents its own unique set of challenges, especially since we are moving between audio and images. In preprocessing, we have been working on reading our audio data in and converting them into proper float arrays of samples that represent both the full mix as well as the ground-truth vocal stem. Figuring out how to navigate 2-channel stereo audio has been a somewhat challenging aspect of this. The other difficult component of preprocessing is creating detailed spectrogram images for each audio clip. We have found an audio-processing library, Librosa (https://librosa.org/doc/latest/index.html), that has been rather helpful in this process, and we are currently experimenting with both Tensorflow’s short-time Fourier transform function and Librosa’s versions.

Insights

We do not currently have concrete results that we can show from our model, although we have some results from our preprocessing pipeline which converts raw audio samples into their spectrogram form. This is an important step as our model needs to be able to train on spectrograms in order to effectively isolate vocal samples. In addition, our pipeline can not only convert raw audio into regular spectrograms, but also “melspectrograms” (depending on the arguments given by the user). “Melspectrograms” are a log-scaled form of spectrogram that give more detailed information relevant to human hearing, and it will be easy to compare these different data scalings using this pipeline. Seeing as we have been working on preprocessing, we have not gotten to work much on the actual model itself. We have a skeleton foundation of the model’s structure and methods, but do not yet have insight into how well our model will perform.

Plans

We are relatively on track with our project. In the initial check-in, we had primarily been focused on preprocessing and having the beginnings of a model by this checkpoint. We have been working on preprocessing the images into their spectrogram form, something that we knew would be unfamiliar territory and a significant component of this project. However, we have not had time to finish this pipeline and build out the model itself. Once our preprocessing is fully functional, we need to dedicate more time to creating the model, playing around with parameters, and seeing how well our model trains and tests. One thing that we are thinking of adjusting is to train our model on melspectrograms instead of regular spectrograms, as these graphs could provide information more relevant to human hearing. We will be playing around with various (mel)spectrogram designs in order to assess which ones give us the best results for isolating vocal samples. Additionally, we are thinking of potentially extending the scope of our project to include instrumental isolation as well.

Log in or sign up for Devpost to join the conversation.

aaronjennis Jennis posted an update — Nov 24, 2020 02:10 AM EST

Introduction This can be copied from the proposal. While humans are quite adept at separating different sounds from a musical signal (vocals, individual instruments, etc.), computers have a much harder time doing so effectively. The goal of our final project is to create a model that, given a song or recording, can isolate, extract, and output the vocals of the song with as little interference from background noise as possible. The reason we are doing this is because the extraction of specific components from a song, such as vocals, can be used by artists to create other music. Artists can use this model to extract the vocals from a song and then creatively sample or alter this snippet to create their own original music. Taking popular vocals and overlaying them on an original accompaniment or beat is a very common way of making music (especially in genres such as lofi and rap), so having a way of effectively extracting vocals from songs would be very helpful for many artists. Also, listening to accurate isolated stems can be a fascinating educational tool for performers, producers, and mixers, as well as for music appreciators in general!

Challenges What has been the hardest part of the project you’ve encountered so far? At this point, the most difficult part of the project has been preprocessing. This is as expected, as working with audio presents its own unique set of challenges, especially since we are moving between audio and images. In preprocessing, we have been working on reading our audio data in and converting them into proper float arrays of samples that represent both the full mix as well as the ground-truth vocal stem. Figuring out how to navigate 2-channel stereo audio has been a somewhat challenging aspect of this. The other difficult component of preprocessing is creating detailed spectrogram images for each audio clip. We have found an audio-processing library, Librosa (https://librosa.org/doc/latest/index.html), that has been rather helpful in this process, and we are currently experimenting with both Tensorflow’s short-time Fourier transform function and Librosa’s versions.

Insights Are there any concrete results you can show at this point? How is your model performing compared with expectations? We do not currently have concrete results that we can show from our model, although we have some results from our preprocessing pipeline which converts raw audio samples into their spectrogram form. This is an important step as our model needs to be able to train on spectrograms in order to effectively isolate vocal samples. In addition, our pipeline can not only convert raw audio into regular spectrograms, but also “melspectrograms” (depending on the arguments given by the user). “Melspectrograms” are a log-scaled form of spectrogram that give more detailed information relevant to human hearing, and it will be easy to compare these different data scalings using this pipeline. Seeing as we have been working on preprocessing, we have not gotten to work much on the actual model itself. We have a skeleton foundation of the model’s structure and methods, but do not yet have insight into how well our model will perform.

Plans Are you on track with your project? What do you need to dedicate more time to? What are you thinking of changing, if anything? We are relatively on track with our project. In the initial check-in, we had primarily been focused on preprocessing and having the beginnings of a model by this checkpoint. We have been working on preprocessing the images into their spectrogram form, something that we knew would be unfamiliar territory and a significant component of this project. However, we have not had time to finish this pipeline and build out the model itself. Once our preprocessing is fully functional, we need to dedicate more time to creating the model, playing around with parameters, and seeing how well our model trains and tests. One thing that we are thinking of adjusting is to train our model on melspectrograms instead of regular spectrograms, as these graphs could provide information more relevant to human hearing. We will be playing around with various (mel)spectrogram designs in order to assess which ones give us the best results for isolating vocal samples. Additionally, we are thinking of potentially extending the scope of our project to include instrumental isolation as well.

Log in or sign up for Devpost to join the conversation.

Benjamin Hurd started this project — Nov 13, 2020 04:36 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.