Track
Generative AI
Inspiration
This project is an extension of my previous project music2image which aims to extract musical features such as emotion and genre using deep learning models like CNNs to input text prompts into a stable diffusion model. However, the text prompts generated were very standardised and the only differences between each text prompt were basically the musical features that were extracted like happy,sad,emotional etc. Insage aims to alleviate this problem by utilising more modern techniques to generate more descriptive text prompts from audio inputs through using architectures like Vistransformers inspired models Music2Cap etc.
What it does
The application takes in an audio input from the user and generates a descriptive text prompt such as "This music is instrumental. The tempo is slow with a melancholic piano melody with no other instrumentation. The song is emotional and passionate. The audio quality is poor." . This text prompt is then fed into a stable diffusion model that was fine tuned using the LoRA methodology using a public dataset sourced from the MelFusion Paper.
How we built it
For the models, the music-2-caption model makes use of the audio-encoder model from the MusicCaps paper. The image model makes use of the hugging face package diffusers and its stable diffusion pipeline to load the model as well as the LoRA weights. The front end user interface makes use of the streamlit package and the demo application uses the hugging face space to host.
Challenges we ran into
There was an attempt to extend the idea of image generation to video generation to provide users with more room for inspiration, however there were 2 main roadblocks that were faced. One, the video generation took way too long for inference for it to be used. Second, souring a suitable dataset to fine-tune the video model would have taken too much time.
What we learned
The main learning point for me in the development of this application was the ability to read through the source code and paper to sieve out the suitable model for my use case, other than just using open-source hosted models through hugging face.
What's next for Insage
In order to provide users with more inspiration, it is possible to try to return users a video that have their music being played in the background and the image/video generated is also used as a background to see if it fits the users as well. Other than that, music similarity features with other songs as well as the thought process/stories of these songs could prove to be useful for music creators.
Built With
- diffusers
- huggingface
- python
- streamlit
- tensorflow
Log in or sign up for Devpost to join the conversation.