Inspiration

Recent progress in generative AI has all but solved conditional image generation. But for generating music, although current systems can generate snazzy tunes, the lyrics are eerie mumbling.

What it does

We take any prompt describing audio "Country music where two people sing about boots" and generate music about it with intelligible lyrics.

How we built it

We first pipeline our music prompt into the OpenAI API. We convert this to robotic-sounding lyrics with Google TTS. We then adjust the variance schedule of the diffusion process of the open-source model Riffusion to bias the model to generate content-preserving lyrics in the style of the intended song.

Challenges we ran into

The crux of this project was that the diffusion latent space of our model was easy to use off the bat. This was not naively the case, and we had to introduce a new type of content-based guidance to our model.

Accomplishments that we're proud of

We bootstrap on top of the open source Riffusion and get better lyrics than the current closed-source larger Google systems Noise2Music and MusicLM that came out in the past month.

What we learned

We learned more intricate detail about the diffusion process and how to manipulate its latents.

What's next for Noise to Lyrics

Expanding to longer form audio and incorporating higher quality diffusion models.

Built With

Share this project:

Updates