Music2Image

Inspiration

Personally I have dabbled in music composition and have tried to base my compositions off short stories or experiences. However during the process of it I ran into two issues. Firstly, the feedback I got from my composition was that the feelings evoked were not as I intended secondly at some stage of the composition I do not know how to continue. So I wanted to create an application that could help me with these issues, firstly to detect the emotion of the piece and secondly to provide inspiration. I chose image as a source of inspiration with video being a possible extension in the future.

The application takes in an audio snippet such as a short piece from a music creator and then returns the creator the emotion evoked as well as an image that was motivated the music piece.

There are 3 main models that are used in the creation of this application. A Music Emotion Recognition(MER) model, a Music Genre Classifier(MGF) model as well as a text-to-image generation model. The MER model uses the base A2E architecture while the MGF model uses a multi-layered CNN that is implemented using tensorflow and trained using Google Colab. The text-to-image generation model is a pre-trained stable-diffusion model that is fine-tuned using LoRA using hugging face. The front end framework uses the streamlit python package.

The challenges that I faced can be broken down into 2 categories, dataset availability and model performance. Below are a few issues faced and are not exhaustive.
Dataset availability: There are quality datasets that are available for genre classification but not so much for emotion detection, likewise for text-to-image in the musical context.
Model Performance: Currently the model for emotion recognition only has audio files that fall under 4 genres while for genre classification there are 10, so the EMR model does not perform as well on some genres. Next, the data pre-processing for the emotion recognition dataset could have been an over-simplification which directly affects the performance of the model. Specifically, the dataset has multiple users giving emotion scores to a single track and what I did to label these tracks was to get the emotion with the highest average score. This may end up removing certain nuances in the dataset. Lastly, the images used for LoRA were mostly images of artists and instruments which was not what I intended but served as a good starting point.

I am proud that I was able to implement what I have learnt with regards to audio signal processing to create a proof-of-concept product that I could use. Furthermore, being able to see a functioning product while far from perfect is very satisfying.

I have learnt with regards to machine learning projects it is always better to have an ordered plan of doing things to minimize confusion down the line. Firstly, to think of an idea and its usability. Next, to source out usable datasets and lastly to think about the models as well as how to iteratively improve on them.
I have also learnt how to filter as well as better understand papers through this project. I found that it is easier for me to always start at their data processing stage then move on to the model's architecture as opposed to jumping straight in.
Lastly I have also learned to make use of online resources like Google Colab to train my models.

What's next would be to iteratively improve each of the model's performance either through sourcing or building better datasets, improve feature extraction during the data pre-processing stage(data centric), finding better models, thinking about alternative workflows of the application such as video generation or story generation instead(model centric) and lastly deploying and monitoring the application. Below are some specific actions:
Better datasets: Look for datasets for other genre audio files for MER, custom dataset for LoRA.
Better data pre-processing: Look into data augmentation for audio and spectrograms.
Better models: Look into alternative emotion recognition and other more useful/open-ended musical features to extract as bpm and genre can be input by users.
Alternative workflows: Look into story generation models from audio or direct audio to image models.

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.