Everyone is inside and depressed right now. What could be something that would make them chuckle? Music is a phenomenon that connects everyone, why not produce an interesting take on it with AWS AI products?

There's a high barrier of entry for kids into technology. Maybe this project could eventually become something that helps connect kids growing up who are interested in music get introduced to the ml technology of AWS.

What it does

Short Answer: Takes a song, interprets it, produces a cover of that song.

'Longer' Answer: Takes a song, uses the 'Quantphi Sound Separation' package in AWS Marketplace to isolate the vocals of the song. The isolated vocals of the song are then interpreted by Amazon Transcribe. The generated transcriptions are then fed into Amazon Polly to generate the speech from the interpretation. The resulting mp3 is then mixed with the accompaniment of the vocals resulting in the final product: a cover of the original song.

How I built it

This project was entirely built in python 3.

Amazon Sagemaker:

  • Used for accessing all AWS APIs

Quantphi: Source Separation:

  • Used to take the mp3 of a song and isolates the vocals

Amazon Transcribe:

  • Used to transcribe the isolated vocals of the song

Amazon Polly:

  • Used to generate the vocals for AI Cover Musician.
  • This is done word by word to allow flexibility with timing. Meaning for each word an mp3 is generated.

  • Then each mp3 file is concatenated together, and then mixed with the accompaniment. While the mp3 files are being concatenated together each mp3 is being checked and modified for pitch and timing.

  • Pitch detection is done using the deep learning python library Crepe.

Challenges I ran into

While Getting it to Work:

  • Audio Files were too large: The first challenge was that the API for the 'Quantphi Sound Separation' package would throw a timeout/error for mp3 files longer than 30 seconds. To workaround this, the song is split into 30 second segments and fed into the API of the model package as a batch.

  • Dependencies: ffmpeg is a library that is pretty much required if you're gonna be working with audio files with software. I couldn't figure out how to install ffmpeg for the Sagemaker environment as 'yum' doesn't doesn't have all the correct dependencies to install it. Ended up figuring out that Sagemaker is built on the Amazon Linux distro, which allowed me to piece together how to fulfill this dependency of the project.

  • Interpreting Documentation: Figuring out how to get Amazon Transcribe and Amazon Polly to work using their Python SDK. I wish especially Amazon Polly had more thorough documentation on how to query the API.

While Getting it to Sound Less Like Garbage:

  • Amazon Transcribe's start and end time: When transcribing the music, Amazon Transcribe tends to have very conservative start and end times. Meaning the start and end times of words generated by Amazon Transcribe tend to be too short. This results in the pronunciation of the song lyrics becoming very choppy and at some points unintelligible since Amazon Polly used the start and end times for the timing and length of its output.

  • Fixing the timing of the song: The vocals tend to go faster than the background since Amazon Polly only specifies the max length a body of text is said. To compensate for this the expected start time was tracked using the Amazon Transcribe timestamps and pauses were added.

  • Pitch shifting the vocals: For some reason this never went according to plan :/

  • Some limitations are that Amazon Transcribe is that it's not designed to recognize words in song. Thus, words that are included in riffs are not recognized, words included in quickly executed rap bars are not recognized, and complicated lyrics are not transcribed properly.

Accomplishments that I'm proud of

Coming up with an end solution that didn't give me a headache immediately was a really big accomplishment.

What I learned

I started out with no experience working with AWS Sagemaker or any of the other products (AWS Marketplace, Amazon Polly, Amazon Transcribe, or any of the Amazon APIs). After working on this project I feel much more confident building stuff with those products now.

I also had no experience working on music(or sound for that matter) with code at all. Learning about the python libraries that allow you to manipulate sound was very interesting.

What's next for AI Cover Musician

TODO List:

  • Supporting songs in multiple languages
  • Interpreting the accompaniment (into MIDI) and generating it again
  • pitch shifting without changing duration
  • improve pitch correction

AI Cover Musician's goal is to become world famous because of music.

Follow AI Cover Musician on:

Sound Cloud:


Built With

Share this project:


posted an update

  • Pitch Variation now leverages Amazon Polly

  • Bug Fixes in main notebook: The notebook can now create multiple songs without manually needing to clear S3 locations. Amazon Polly query has been made modular for different credentials.

Log in or sign up for Devpost to join the conversation.

posted an update

  • More songs available for demo.

  • Updated Pitch Detection Algorithm.: Uses Machine Learning (Tensorflow) to determine the frequency of small intervals (~10 ms) in the each audio bite corresponding to a word in the transcription. After truncation and converting the frequencies to semitone units (musical notes), the most frequent musical tone is returned as the pitch of that audio bite.

  • Optimized dependencies for Sagemaker environment.: Reduced the execution time for fulfilling dependencies by migrating to Tensorflow notebook. This will make the demo much quicker. Local environment support in addition to Sagemaker coming soon.

Log in or sign up for Devpost to join the conversation.

posted an update

Updated to fulfill dependency for executing audio processing in Sagemaker environment.

New Demo Video:

New Demo Instructions(demo is now simplified):

  1. Clone this repository as a notebook instance in AWS Sagemaker.
  2. Go through main.ipynb file in the repository. Execute all the code blocks.
  3. The last code block of the notebook allows you to listen to the generated song.

Version 2.1 of AI Cover Musician now uses Tensorflow model to detect pitch in the generated audio and the isolated vocals. This allows for pitch modification to try and match the original song.

Log in or sign up for Devpost to join the conversation.