Vocal Isolation

aaronjennis Jennis posted an update — Nov 24, 2020 02:10 AM EST

Introduction This can be copied from the proposal. While humans are quite adept at separating different sounds from a musical signal (vocals, individual instruments, etc.), computers have a much harder time doing so effectively. The goal of our final project is to create a model that, given a song or recording, can isolate, extract, and output the vocals of the song with as little interference from background noise as possible. The reason we are doing this is because the extraction of specific components from a song, such as vocals, can be used by artists to create other music. Artists can use this model to extract the vocals from a song and then creatively sample or alter this snippet to create their own original music. Taking popular vocals and overlaying them on an original accompaniment or beat is a very common way of making music (especially in genres such as lofi and rap), so having a way of effectively extracting vocals from songs would be very helpful for many artists. Also, listening to accurate isolated stems can be a fascinating educational tool for performers, producers, and mixers, as well as for music appreciators in general!

The paper that we will be implementing is “Singing Voice Separation Using a Deep Convolutional Neural Network Trained by Ideal Binary Mask and Cross Entropy”. The objective of this paper was to create a neural network that can extract a singing voice from its musical accompaniment. We chose this paper as it took a novel approach to this vocal extraction problem. Many researchers have tried to solve this problem using recurrent neural networks in order to capture temporal changes in a song. However, the paper we are implementing is quite interesting as it uses a convolutional neural network (something traditionally associated with analyzing visual imagery) to extract vocal samples. This is possible through the use of spectrograms, a visual representation of the spectrum of frequencies of an audio signal as it varies with time. By converting songs to their representations as spectrograms, a CNN can learn spatial patterns in these spectrograms and use this to identify the components that make up human vocals, a form of classification. This is a subproblem of this project, namely training our model to be able to classify whether a given section of audio contains human vocals. The overall problem is technically a regression problem, as our model needs to be able to predict a continuous sequence of audio that is strictly composed of the human vocals.

Challenges What has been the hardest part of the project you’ve encountered so far? At this point, the most difficult part of the project has been preprocessing. This is as expected, as working with audio presents its own unique set of challenges, especially since we are moving between audio and images. In preprocessing, we have been working on reading our audio data in and converting them into proper float arrays of samples that represent both the full mix as well as the ground-truth vocal stem. Figuring out how to navigate 2-channel stereo audio has been a somewhat challenging aspect of this. The other difficult component of preprocessing is creating detailed spectrogram images for each audio clip. We have found an audio-processing library, Librosa (https://librosa.org/doc/latest/index.html), that has been rather helpful in this process, and we are currently experimenting with both Tensorflow’s short-time Fourier transform function and Librosa’s versions.

Insights Are there any concrete results you can show at this point? How is your model performing compared with expectations? We do not currently have concrete results that we can show from our model, although we have some results from our preprocessing pipeline which converts raw audio samples into their spectrogram form. This is an important step as our model needs to be able to train on spectrograms in order to effectively isolate vocal samples. In addition, our pipeline can not only convert raw audio into regular spectrograms, but also “melspectrograms” (depending on the arguments given by the user). “Melspectrograms” are a log-scaled form of spectrogram that give more detailed information relevant to human hearing, and it will be easy to compare these different data scalings using this pipeline. Seeing as we have been working on preprocessing, we have not gotten to work much on the actual model itself. We have a skeleton foundation of the model’s structure and methods, but do not yet have insight into how well our model will perform.

Plans Are you on track with your project? What do you need to dedicate more time to? What are you thinking of changing, if anything? We are relatively on track with our project. In the initial check-in, we had primarily been focused on preprocessing and having the beginnings of a model by this checkpoint. We have been working on preprocessing the images into their spectrogram form, something that we knew would be unfamiliar territory and a significant component of this project. However, we have not had time to finish this pipeline and build out the model itself. Once our preprocessing is fully functional, we need to dedicate more time to creating the model, playing around with parameters, and seeing how well our model trains and tests. One thing that we are thinking of adjusting is to train our model on melspectrograms instead of regular spectrograms, as these graphs could provide information more relevant to human hearing. We will be playing around with various (mel)spectrogram designs in order to assess which ones give us the best results for isolating vocal samples. Additionally, we are thinking of potentially extending the scope of our project to include instrumental isolation as well.

Log in or sign up for Devpost to join the conversation.