CSCI 2470 Final Project: Text To Music

Title

text sentiment analysis and emotion-based music generation

Team Member

Xiaoyan Liu(xliu252), Yuhong Zhang(yzhan709), Ziyu Wang(zwang569)

Introduction

Inspired by the power of modern AI, we are interested in text analysis and music generation. Hence, we decide to explore the secret between them. The motivation behind our plan is to harness the nuanced interplay between sentiment extracted from text and the corresponding emotional cues in music, aiming to revolutionize how we create, experience, and interact with music. This task can be framed primarily as a combination of classification(text sentiment analysis) and unsupervised learning(generative model development). By leveraging Large Language Model in emotion recognition and transformer based deep network, we unlock new dimensions of musical expression and personalization, bridging the gap between human emotion and musical creation.

Related Work

Text Understanding: In BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, the authors presented a novel method of pre-training language representations which involves conditioning on both left and right context in all layers. This approach, known as BERT (Bidirectional Encoder Representations from Transformers), allows the model to capture a more nuanced understanding of language context and nuance than previous models. The researchers utilized a diverse set of pre-training and fine-tuning tasks, including question answering and language inference, to demonstrate the model's superior performance across a variety of language understanding benchmarks. They explored multiple model sizes and settings, illustrating BERT's adaptability and its significant impact on advancing the state-of-the-art in natural language processing.
Music Generation: In EMOPIA: A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation, they manually segmented the songs into emotionally consistent clips, ensuring a balanced representation of emotions. The team conducted various analyses on the dataset, exploring note density, length, velocity, and key distribution to understand their emotional correlates. Furthermore, they utilized EMOPIA to train and evaluate models for music emotion recognition and emotion-conditioned music generation. They experimented with different representations and generation models including Transformer and LSTM, showcasing the dataset’s potential in facilitating advanced generation tasks.
Music Emotion Recognition: In Music Emotion Recognition Based on a Neural Network with an Inception-GRU Residual Structure, they introduced a model to identify the emotion of music. We viewed their paper and decided to search for an online music analyzer to check our generated music. This will be discussed in the evaluation.

Data

Text:

The ZuCo 1.0 Task1 is a lightweight text dataset featuring different kinds of sentiment labels that represent movie reviews with negative, neutral, and positive sentiments. The preprocessing steps include removing the control sentences, which don't contain any of the four sentiment labels, and we organize the text file in the format 'ID, sentence, sentiment labels'. Then we loop through all the sentences using OpenAI's API to produce the ground truth labels. It comprises a total of 500 sentences, with balanced labels.

Music:

The EMOPIA dataset is a specially curated collection designed for tasks related to music emotion recognition, with a particular focus on piano performances. The dataset encompasses key features as follows:

1) Emotion Labels: It classifies music pieces according to their emotional content, often employing models such as the valence-arousal framework. This framework interprets emotions along two dimensions: valence, ranging from positive to negative feelings, and arousal, varying from calm to excited states.

2) Piano Music in MIDI Format: The dataset comprises MIDI (Musical Instrument Digital Interface) files. MIDI serves as a standard protocol for recording and playing back musical performances, encapsulating details like note sequences, timing, and dynamics in a compact, precise manner. This attribute makes MIDI exceptionally conducive to computational analysis and synthesis, as it encodes musical instructions rather than audio waveforms. Despite this, waveforms in both the time and frequency domains remain crucial modalities explored in the feature space.

The dataset contains a total of 1,079 piano pieces, distributed across four emotional quadrants as follows: 250, 265, 253, and 310 pieces, respectively. The lengths of these music pieces range from 31.9 to 40.6 seconds, offering a broad spectrum for analysis in our final project.

The MIDI message main infomation is about the note, the velocity and the time. The notes correspond to piano keys, ranging from 21 to 108. Velocity indicates volume, with values between 0 and 127. Time represents the duration between the last note and the current note, ranging from 0 to 350.

Methodology

Text Sentiment Analysis: We intend to leverage a Large Language Model, specifically by employing the ChatGpt API, to conduct a comprehensive analysis of sentiment labels derived from a text corpus. Our approach involves meticulously crafting prompts that enable the precise categorization of sentiments into one of four distinct quadrants. This method not only aims to utilize the advanced capabilities of Gpt for sentiment analysis but also to refine the process through strategic prompt engineering. This will ensure a nuanced understanding of sentiment that aligns with our predefined emotional quadrants, facilitating a more targeted and effective analysis.

The pre-trained BERT model converts each token in the sentences into 732-by-1 vectors, and we directly perform the concatenation of these vectors into a matrix to represent the whole sentences. Then, these matrices are sent into three models that map these matrices into sentiment labels. These three models are: 1) two layers fully connected, 2) four layers fully connected, and 3) layers with batch normalization in the four fully connected layers.

Emotion-based Generation: To generate melodies based on emotional expression with audibility, we aim to train models using the EMOPIA dataset, which contains melodies paired with emotion labels. Our current dataset is in MIDI format, and we intend to preprocess it using mido. Our strategy involves selecting high-performing models between cVae and cGan. These models will then be used to generate melodies that reflect emotional while ensuring audible quality. Since all of us are new to generative models, we are open to any new methods which can improve and optimize our process.

MIDI to array: Initially, we attempted to map each note to one of the 88 piano keys while compressing time and velocity data. However, the resulting dataset was excessively large and challenging to discern patterns from. Consequently, we opted to parse the MIDI messages directly, extracting the note, velocity, and timing information as our features.

In the cVAE model, which is built upon a standard VAE architecture implemented in PyTorch, we integrate conditional label supervision. We enhance the loss function by incorporating separate training for each feature— note, velocity, and time. This approach aids in preventing gradient explosion by reconstructing the mean (mu) and variance (var) separately in the encoder. By leveraging the distinct characteristics of each feature, we aim to generate more accurate representations of the data. In the decoder, these separate representations are combined to produce the new MIDI message as required.

For cGAN, we deploy a standard GAN with conditional label-supervising with PyTorch. Furthermore, we add much more non-linear and music note noise into both the discriminator and generator to create diversity and creativity in the generation. At about batch_size=256, epochs=400, took 90 mintures in training and less than 3 minutes to generate a new piece of piano music of length 40-60 seconds.

Metrics

We provide a comparison of three different language models for sentiment analysis. Obviously, directly using large language models can achieve the best accuracies among all models, with the highest accuracies reaching 100%. The other models use an 80-20 train-test split and still show very promising results that are far better than the 25% chance level after 100 epochs. As we can see, the two-layer fully connected network provides better accuracy among the three; surprisingly, adding more layers and normalization did not provide a significant improvement.

We will train the generative model and try to experiment with emotion-based generation. However, until now, there hardly exists one particular accuracy metric that can comprehensively measure the generating performance of a generative model. One potential measurement during model construction is perplexity, but it is not the solution to define generation "success". We will try to explore some metrics for generatives models online either from papers or open-resouce projects. After exploration, we find the following metrics could be useful:

collecting feedbacks from listeners: since a "good" music needs opinions from the public, we sent out a google survey with generated music and asked for feedbacks, check the results either in slides or ask author for google survey link. Overall, label 1 and label 4 are clearly classified by the public, but people cannot tell music from 2 or 3 very well.
upload music into music sentiment analysis to verify the emotion: as described above in related work, we found some completed music analysis model and wish to try them out. Due to time constrain in final week, we didn't re-implement or deploy any deep learning music verification model, but use a similar online music AI Sonoteller to verify music. We packed the most successful 80 musics into AI and check their mood output. Here is the accuracy:

Label 1 Label 2 Label 3 Label 4

9/20 8/20 7/20 11/20

45% 40% 35% 55%
input generated music into some emotion models to check output: deprecated, emotion model requires text input but our output is audio.
Peak Signal-to-Noise Ratio: it is a standard measurement to justify audio or spectrum. However, it is hard to define the ground truth music for each label. The feeling of each label is subjective for everyone, we cannot set one standard for each emotion, even with the exact same keywords such as "happy", "sad", "angry", etc.
music measurement future exploration: we are still on the way of learning new ideas about anything that can give us quantitative results from the generated music. One potential music-wise solution is to check the specific notes/chords to evaluate its rationality of the arrangement and rhythm. From this perspective, all of us are not expert in music, so we might need help from professional knowledge in music in the future.

Label 1	Label 2	Label 3	Label 4
9/20	8/20	7/20	11/20
45%	40%	35%	55%

Base Goal: identify text sentiment, and then generate melody base on that emotion: DONE
Target Goal: identify text sentiment with higher accuracy and efficiency, and then generate a "pleasing" music: Basically done, text accuracy over 95%; generated music showed in deep learning day and received a good amount of feedbacks.
Stretch Goal: identify text sentiment, and then generate intricate melody that has multiple-tracks, different instruments, and artistic merits: On the way, limited by computational resources and data complexity, and we will try out in larger datasets with more instruments.

Ethics

Why is Deep Learning a good approach to this?

Deep Learning models are capable of automatic feature extraction from raw data. It also can generate music by learning from a large corpus of features, understanding underlying structures, styles, and emotional expressions, and creating new compositions that are both innovative and emotionally resonant. Additionally, these models can be adapted and fine-tuned to specific domains or types of text, improving the performance of generation.

Will AI generation replace human in music composition?

It is hard to predict the result of AI development nowadays. However, human composers bring a level of emotional depth, experience, and nuance to music that is difficult for AI to replicate. The human touch in storytelling, expression, and emotional connection is unique and valued. Besides, music is deeply tied to human culture. The role of composers not only involves creating music but also understanding and expressing ideas, something AI lacks. Human composers are capable of innovative thinking and can lead the evolution of music in ways that AI, which often learns from existing data, might not predict or initiate. Even if we cannot tell the final state of AI, we feel confident to claim that talented human composers will still be able to create unique and unparalleled works in the near future.

Does this procedure violate someone's interest?

We believe this project is only useful when people are completely new to music and its expression. When they can write some words, we aim to help to analyze the text sentiment and then generate the music. It is a non-commercial use for deep learning application, so we believe only students can obtain ideas from our project. Hope anyone could develop future improvements and contribute to the music world! Peace:)

Division of Labor

Yuhong Zhang: Gpt prompt-engineering text analysis; Bert model
Xiaoyan Liu: Music data preprocess ; LSTM model(deprecated); Vae model
Ziyu Wang: Related writing work and paper investigation; Github maintenance; Gan model
Special thanks to all the help from TA Emily Wang