Inspiration
Recent progress in generative AI has all but solved conditional image generation. But for generating music, although current systems can generate snazzy tunes, the lyrics are eerie mumbling.
What it does
We take any prompt describing audio "Country music where two people sing about boots" and generate music about it with intelligible lyrics.
How we built it
We first pipeline our music prompt into the OpenAI API. We convert this to robotic-sounding lyrics with Google TTS. We then adjust the variance schedule of the diffusion process of the open-source model Riffusion to bias the model to generate content-preserving lyrics in the style of the intended song.
Challenges we ran into
The crux of this project was that the diffusion latent space of our model was easy to use off the bat. This was not naively the case, and we had to introduce a new type of content-based guidance to our model.
Accomplishments that we're proud of
We bootstrap on top of the open source Riffusion and get better lyrics than the current closed-source larger Google systems Noise2Music and MusicLM that came out in the past month.
What we learned
We learned more intricate detail about the diffusion process and how to manipulate its latents.
What's next for Noise to Lyrics
Expanding to longer form audio and incorporating higher quality diffusion models.
Built With
- google-cloud
- openai
- python
- riffusion
Log in or sign up for Devpost to join the conversation.