GIF
Image2Music pipeline

Image2Music - On Inspiring Creativity with Generative AI

Inspiration

This project is inspired by the tiktok influencers that make dancing video with awesome music. Most of the short videos on Tiktok are embedded with some viral songs to make the video more appealling. It mad me think of a question: Could we automatically retrieve songs that suit the videos or images? The first thing I think of is scrapping the musics in the tiktok music library, use GenAI model to generate the description of the image and find the most suitable music in the database, however, these songs are copyrighted and it is hard to determine the suitable music based on the categories in the tiktok music library. And with the state-of-the-art model that can generate music based on text, how about automatically generate music for the image. So Image2Music is born.

What it does

The project takes an image or an URL of an image as an input and produces a unique songs based on the image. This allow mudic producers or enthusiastics to get inspiration from the AI model using pictures and colors. It also makes video-making easier by using non-copyrighted songs. The project also allows user to customize their result audios by choosing the length of the audio, the genre of the music, the mood of the song, and some other speicification. The customization allows model to produce more accurate song that match the users desire.

Feature

Upload Image from device
Upload Image URL
- Preview Image in browser
Settings
- Prompt for Llava model
- Number of song generated
- Prompt Length and Song Length
- Music Genre and Mood
- Custom music specification
Audio
- Listen audio in the browser
- Download audio to device

How we built it

Model Pipeline

This project uses two models: Large Language and Vision Assistant (Llava) and Musicgen. Llava model trained on images and texts and can provide description or reasoning on the given image and prompt. By inferencing model with an image and providing prompt that says "Descirbe the music that better suits this picture in a sentence", the Llava model is able to provide detail description of the music that suits the picture. Next, after retrieving the description of the image, we can use it to generate music by inferencing Musicgen. Musicgen is trained on songs and texts and is able to generate music based on text. Feeding the description of the image generated by LLava model into Musicgen will get us a unique custom music based on the image.

Frontend and Deployment

The Frontend website is built using the python package called Gradio. Since the period of hackathon is very short, this package allow developers build fast and clean frontend UI with only few codes, although learning new library can be a handful. For building model, huggingface already has trained model weight and can be easily downloaded and used. The only problem is getting a server that has enough GPU for LLM. So, I rent a server using Vast.ai, which provides many servers equiped with powerful GPU in a relative cheap price. The Gradio website is able to mount with Fastapi providing stable routing and TCP connection. I then used a package called uvicorn to host the Fastapi app. After users connect to the website, they can upload images to the server. The server will process the image to suitable data type and inference the models. The resuling audios are then displayed to the website and the user can listen and download it to their local device.

Public Domain Name

In order to let public test out the project, I bought a custom domain name from cloudflare and use the tunnel to host a proxy server for traffic and secure connection. Cloudflare tunnel will redirect any traffic into the DNS to the VastAI server.

Library

Website: Gradio, Fastapi, Uvicorn
ML: huggingface, transformers

Model

Llava
Musicgen

Challenges we ran into

There are several challenges I ran into:

How to quickly build a full stack application within the hackathon period? After doing some research on the internet, many LLM model demos are built with Gradio and Streamlit package, both provide quick and simple frontend UI.
How to get a server that equips with GPU that has enough RAM and memory for LLM model? Llava and Musicgen are both fairly large models which are not possible to run on local machine, therefore, it is important to get a server with powerful GPU and enough memory. First solution I used is Modelbit, which provides a service that allow developer to deploy their machine learning model to cloud and can inference the model by sending request. However the price for using Modelbit is high and I cannot afford it, so I decided to find a second solution, which is renting server on Vast.AI. Using the server, I am able to host my website on the server and inference the model locally, saving so much time and money for sending request to cloud.
The URL generated by uvicorn server is not secured and might be vulnerable for attack. The server rented on Vast.AI provided a public IPv4 address but this address should not be used for public. It is important to get a proxy server that is responsible for handling incoming traffic from outside. So, I bough a domain name from Cloudflare and used the Cloudflare tunnel to handle secure connection.

Accomplishments that we're proud of

This project achieved several accomplishments

Custom setting for Image2Music model Most of the Image2Music models on the website don't allow user to give custom prompt for test2music generation. But my website allow custom modification including music genre and mood.
Allow image URL input My project allows user to give URL of the image for music generation, which increase the variability of the input image and user don't need to download image from other website to use the model, as long as the images are reachable through internet.
Can generate mutiple songs at the same time. User can specify how many songs, maximum 5, to generate in each time. This allow user to get variety of songs and can compare as well as select the desired one.
Public domain name for traffic hadling and secure connection Using Cloudflare tunnel to handle secure connection and traffic.

What we learned

Through this hackathon, I have learned a lot:

Developing quick demo for machine learning model using Gradio and deploy the website on server.
Combining multiple generative models to get a multimodal model.
Using large generative model, especially LLM, is costly and have to first find an affordable server for the model.
Buying domain name on Cloudflare and using it as a proxy server and secure connection

What's next for Image2Music

There are several future steps for Image2Music.

Incorporate more image-to-text and text-to-music models for more diverse result.
Allow user to edit or generate new image using diffusion model and generate music based on the new image.
Using video as the input and generate songs based on the video.
Fine-tuning model with custom training dataset