Vocaloids sound better than people because you don't have to talk to them and you have to talk to people

What it does

First, it separate the original song to vocal and instrument part, then we transfer the vocal to text with the time stamps and get the frequency and notes from them using a homophonic transcription deep learning method. And we transfer this text to Google text to speech generated voice. We use this time stamps to match the new vocal and shift the pitch resulting in a newly layered, completely customizable vocals from any song.

How we built it

We choose Python as our primary language. We use Google Cloud Api--speech to text and text to speech to generate voice and to get lyrics. We used PyTorch to build a U-net for vocal and instrumental separation, TensorFlow and a convolutional neural net for homophonic transcription, and used librosa and signal processing for shifting frequencies. Sorry for all the buzwords.

Challenges we ran into

It didn't work.

Accomplishments that we're proud of

It kind of sort of works.

What we learned

How to make it work

What's next for Vocaloid

To take over the world

Built With

Share this project: