Introduction
Emotions are at the heart of human experience, and central to many types of artistic expression. Art forms as distinct as music and the written word can both reach back to the same universal emotional landscape. For our final project, we have created a tool to select music which is emotionally consonant with a written work. Using this tool, a reader would be provided with music which resonates with the mood of the work being read. Many people enjoy listening to music while reading, and matching the emotionality of music to the text might make that experience more pleasant.
Our motivation to choose this project stemmed from the observation that some of the most interesting projects in deep learning address problems which are simple and relatable, but whose solutions may be multifaceted and complex. We arrived at the idea organically in a moment of inspiration during one of our brainstorming sessions. After some research, we learned that other groups have attempted to solve similar problems, although no single team has attempted to solve exactly this problem. We decided to expand upon their research to make our own functional tool.
In this project, we have built a system which takes in a user-provided sentence and returns a 10-second music clip containing the same emotion as the original sentence. This system connects two separate internal models based on neural networks, one for the text and the other for music, through the use of a shared embedding space.
Related Work
The most relevant papers to this project include Won et al. (2021), Zhao et al. (2020) and Koh & Dubnov (2021). Two of these papers address the “translation” of emotions from a different modality to music, and the third focuses on the process of categorizing emotionality in music. Won et al. (2021) focuses on the cross-modal matching of text to music. Both text and music datasets are labeled with emotional labels. An aligned, multimodal embedding space is used to match text emotion with the retrieval of emotionally consonant music.
On the other hand, Zhao et al. (2020) focus on connecting image and audio data. They use cross-modal deep continuous metric learning (CDCML) to learn a shared latent embedding space that contains cross-modal emotional similarity, and it is refined to a vector in VA (valence-arousal) space. Valence and arousal, along with dominance, are a well-established basis for the emotional embeddings.
Audio data is inherently hard to use due to its high dimensionality. However, Koh and Dubnov (2021) have demonstrated that – at least for the emotion-recognition problem – complicated operations on raw audio data are not required. In fact, they showed that applying image-processing techniques to the audio’s corresponding spectrogram is sufficient for making accurate emotional predictions.
Methodology
Dataset
Emotion-tagged Text Data
The GoEmotions dataset (Demszky et al, 2020) is a collection of more than 50,000 reddit posts in the English language, all manually classified into either one of 27 emotional categories or the neutral category. This dataset is already a part of the Tensorflow package (https://www.tensorflow.org/datasets/catalog/goemotions), which can be conveniently loaded and preprocessed for further analysis.
Emotion-tagged Music Data
Google AudioSet (Gemmeke et al., 2017) is a list of URL links to a large number of manually labeled audio segments from YouTube videos. The audio segments are drawn from 2,084,320 uploaded videos, and are all 10 seconds long, or the entire length of the video if it is less than 10 seconds. These audio segments are classified into 527 labels, and each audio segment can have one or more labels. For example, an audio clip can have both "Happy music" and "Christmas music" labels. The labels are also hierarchically structured, under what is called the AudioSet Ontology. For example, under the "Vehicle" class, there are "Boat, Water vehicle", "Motor vehicle (road)", "Rail transport", "Aircraft", and "Non-motorized land vehicle" classes.
Instead of actual audio excerpts from the YouTube videos, AudioSet offers 128-dimensional embeddings from their VGGish model. VGGish is a repurposed version of the original VGG model (Simonyan and Zisserman, 2014). However, instead of image data as the input, it takes in a Mel-spectrogram of the size 64 by 96, which represents 64 frequency bins and 96 ten-millisecond time windows from a 960 ms audio clip, identically to various convolution-based models by Hershey et al. (2017). After processing the input data through its 11 layers, which involves multiple convolution and pooling operations, the model outputs 128-dimensional embeddings, which can then be used for further audio classification tasks.
The NRC-VAD Lexicon
Mohammad et al. (2018) has mapped about 20,000 English words to the Valence, Arousal, and Dominance space according to the emotions associated with each word.
The license forbids redistribution of the lexicon, but non-commercial use for academic and research purposes is allowed. In our project, we quote only a tiny portion of the lexicon and hard-code it directly into the Python code. A more conventional approach would be to look up words in a text file containing either the entire lexicon or a relevant portion of it. However, we purposefully do not keep any portion of the original lexicon as a separate file, since it could, against our intentions, be easily redistributed.
Model Architecture
Overall Architecture
Our approach largely follows that of Won et al. (2021). We intend to employ the idea of a shared latent embedding space as proposed by Zhao et al. (2020) to build a model that connects two separate datasets of different modalities. Won et al. have experimented with both the direct use of the valence-arousal space as a shared embedding space and with the construction of a brand-new embedding space to bridge the text and music data.
However, we should note that our goal is to build our own novel architecture to map the text and music data to the shared embedding space, and we do not intend to replicate the exact details of Won et al.'s models. Rather than using the pretrained BERT model for text analysis and spending significant time to build a customized ResNet music model utilizing the MagnaTagaATue, Million Song Dataset, and MTG-Jamendo datasets, we seek to design a more size-efficient model that can be trained with a smaller dataset.
Similar to Won et al.’s model, our model also has two sub-models, one for text and the other for music. The text sub-model maps a text input into a vector in the VAD emotional space, and the music sub-model does the same for a music input. The training processes for the two sub-blocks are done in a similar fashion to other sentimental analysis models. After the training is finished, the two sub-models are combined into one big model, which takes in a text input and matches it with the closest music in the shared VAD space. We can measure the accuracy of the text-music matching by feeding the text and music data from the training set into the model and checking if the data were correctly matched by examining their original emotion labels. Since the original emotion labels for the text and music data are based on different taxonomies, a separate dictionary was created to convert a text label to its equivalent label in the music dataset. The equivalent labels are based on the Euclidean distance in the VAD space.
Text Model
To build the text model, the text dataset had to be tokenized first. So, we have built a dictionary by keeping only the words that have appeared at least 15 times in the training set. Out of the total 25,638 distinct words in the training set, 2,678 words were retained in the dictionary. Then, any words that were not included in the dictionary were replaced with the *unk* token, and all the Reddit comments were padded to have a uniform window size of 30 words. After this, the proportion of the *unk* token in the tokenized comments was about 4.578% in the train set and 4.688% in the test set. A window size of 30 words was chosen because there were fewer than 10 comments longer than 30 words in the training set.
The text sub-model first maps word tokens to a vector in the word embedding space. Then the word embedding is fed into a bidirectional LSTM layer. The word embedding size is 48 and the LSTM size is 24 in this model. Then the LSTM output is fed into a dense layer of size 24 with a leaky ReLU activation function and the final dense layer of size 3 with a Sigmoid activation function. Thus, the final output is a 3 dimensional coordinate between 0 and 1, which can be interpreted as a VAD coordinate.
Music Model
The input data for the music-sub model are VGGish embeddings from AudioSet. The VGGish embeddings were created from the Mel-spectrograms in the size of 10 seconds by 128 bins, which can be interpreted as an image of pixel size 10 by 128. In the model, the VGGish input is fed into three convolutional layers.
Then, the output from the convolutional layers is fed into a dense layer of size 32 with a leaky ReLU activation function, and a final dense layer of size 3 with a Sigmoid activation function. Again, the final output is a 3-dimensional coordinate between 0 and 1, which can be interpreted as a VAD coordinate.
Metrics
The model will be judged by its ability to appropriately match text to music by mood.
The continuous scale that can be also used as the loss function would be the Euclidean distance in VAD space or the cosine similarity in our custom-made embedding space. The distance between two matching pairs of text and music should be significantly shorter than a random pair of text and music without any emotional correlation.
One of the evaluation metrics can be categorical classification accuracy. In the AudioSet data, there are only seven different categories of musical emotion (happy, funny, sad, tender, exciting, angry, scary). However, since the emotions “exciting” and “happy” are very close to each other in the VAD space, we decided to lump them together into the “happy” category. Similarly, “angry” and “scary” were lumped together into the “angry” category.
Distinguishing between all 28 emotion categories for the GoEmotions dataset can be a confusing task, as the VAD space gets very crowded. Thus, we decided to lump similar emotions together by their Ekman classification, as was done in the original paper by Demszky et al (2020).
We then calculated the micro- and macro- averaged precisions over the lumped emotion categories with the same methodology as used in the Won et al. (2021) and Demszky et al. (2020) papers for the Google AudioSet and the GoEmotions dataset.
Results
Text Model
Our text model is capable of predicting the Ekman emotional category from the text dataset with about 55% accuracy. To calculate a more reliable measure of the accuracy, we have repeated initializing and training the model ten times, because the model performance is a little bit different every time due to randomness in the weight initialization and the stochastic training process. After re-running the model ten times, the micro-averaged precision was 0.548 with a standard deviation of 0.014, and the macro-averaged precision is 0.554 with a standard deviation of 0.010.
This micro- and macro-averaged precision is not too bad on its surface value, especially when compared to the results from the much more complicated model by Demszky et al. (2020). However, our model suffers from a huge class imbalance. Our model is good at predicting the “joy” or “neutral” categories with more training data, but bad at predicting the “fear” category with less training data.
Music Model
Our music model is capable of predicting the emotional label from the AudioSet with about 50% accuracy. To control for the randomness, we have re-trained the model ten times, and the micro-averaged precision was 0.559 with a standard deviation of 0.004 and the macro-averaged precision was 0.408 with a standard deviation of 0.028. The model performance is better for the more intense emotions of "happy" or "angry" than for the less intense emotion of "tender".
Text-Music Connection
When we connect the text model and the music model, the connected model is capable of making accurate predictions about one third of the time. After re-training the entire model ten times, the micro-averaged precision was 0.365 with a standard deviation of 0.057 and the macro-averaged precision was 0.422 with a standard deviation of 0.036.
Again, the model is not as good at predicting the "tender" category as it is at predicting the more intense emotions of "happy" and "angry".
Our model does not perform as well as most of Won et al.’s models. However, our model has a much simpler architecture with significantly fewer parameters.
Challenges
The progress so far on this project can be roughly divided into three steps: 1. Obtaining and preprocessing data, 2. Building the models, and 3. Evaluating model performance. Even though the second step – where we actually built the model – was the most important, the first and last steps were more difficult.
Obtaining and Preprocessing the Data
Our primary problem was that we did not have enough time to create our own datasets, so we had to make do with whichever datasets were available. There was no publicly available dataset in which text and music are paired together. As a workaround, we had to find an emotion-labeled text dataset and an emotion-labeled music dataset and connect them together through the emotion labels.
Finding the text dataset was relatively easy. There are not many datasets with sentence to emotion matching, but the GoEmotions dataset, which is included in the TensorFlow data package, fits this description perfectly. In contrast, finding an emotion-labeled music dataset was quite difficult. Music sentiment analysis has not been as widely researched and reported on as text sentiment analysis. It was only the serendipitous discovery of the Google AudioSet that saved the project. Google AudioSet was not designed specifically for sentiment analysis, but is intended for the far broader task of automatically labeling the audio content of a YouTube video.
The next challenge was to download the AudioSet and preprocess it into a format that is useful for our project. We couldn’t download all the YouTube videos in the dataset and extract their audio tracks; YouTube has made it purposefully difficult to do so. We would run the risk of having our IP addresses banned by YouTube. Instead, we had to download AudioSet’s own encodings of the audio tracks, called the VGGish encodings, and extract the music-related portions. We had to learn to understand the idiosyncratic way that VGGish encodings are organized, which is hardly straightforward. We also had to teach ourselves the Proto format, a data format that Google uses internally to move data from one programming language to another.
Building the Models
This part was relatively easy, probably thanks to the high-level syntax of Tensorflow and Keras. It was amazing to see how much a few convolution, LSTM, and dense layers could achieve.
One way of interpreting the unweighted loss is as a root mean squared distance between the true and predicted coordinates in the 3D VAD space. Even with a single bi-directional LSTM layer, the music model could bring this RMS distance to around 0.3, which is enough to distinguish between the five major emotion categories used in the Won et al. paper (“fun”, “happy”, “sad”, “angry”, and “tender”).
Evaluating Model Performance
After building the models, we had to evaluate their performance so as to make a direct comparison with other models previously reported in the literature. However, this process is less straightforward than expected. For one, it was another round of data wrangling, and second, every paper has its own way of evaluating models. Eventually, we concluded that the most logical metrics to use are those used by Won et al. and Demszky et al. for the Google AudioSet and the GoEmotions dataset, respectively.
Reflection
Ethics
As with most topics in Deep Learning, it is prudent to make a survey of the ethical implications of this technology. Two potential concerns are the displacement of human artists, and the potential for emotional manipulation. We will address these issues in order.
The selection of music to serve as an accompaniment to other media is a deeply creative task that requires skill and artistic insight. Although matching music to text is not necessarily a common undertaking in our current cultural landscape, we can imagine this type of multimodal emotional mapping could be more widely applicable to tasks such as setting music to video, a very common task undertaken by humans. Automating these tasks risks depriving human creative professionals of work. If this technology is very successful, the displacement of human artists is a potential consequence.
Furthermore, we should acknowledge the potential for this technology to be used for emotional manipulation of the audience. Filmmakers have long understood the power of music to set a mood – music can be used to great effect in films to evoke specific emotions. However, the evocation of an emotional response is not always for simple artistic effect. Propaganda will often play on the audience’s emotions, and harmful, controlling groups such as cults are known for making emotional appeals to their followers. By providing a tool for making more emotionally resonant media, we could potentially facilitate this type of emotional control over audiences. Great care should be taken to avoid the use of this technology in circumstances where a person’s agency could be compromised.
Insights
The music model can correctly identify the emotion in a piece of music with 54~56% macro-averaged accuracy. To be more specific, this accuracy measure is the macro-averaged precision over five major emotion categories, which is exactly the accuracy metric that Won et al. have used.
Also, the text model’s accuracy is around 54-56% micro-averaged, but only 35~40% macro-averaged for the Ekman emotion categories. The reason for the discrepancy is the class imbalance. There are about 40 times more reddit comments in the “Joy” category than in the “Fear” category, for example.
When the text and music models are combined, the macro-averaged precision is around 36.5% with a standard deviation of 5.7%. However, it should be emphasized that those accuracy numbers are higher than we had expected. The mean precision that Won et al. have reported was around 0.4 to 0.5. Furthermore, we should note that Won et al.’s model is doing the entire process of converting text to music, while our music model is simply doing a sentiment analysis on music. However, not only is Won et al.’s model much larger than ours, but it also has a more complicated architecture, and was trained with much more computing power.
We were able to build two models which share a single Valence-Arousal-Dominance space for emotional embedding. Embeddings of more familiar emotions, such as "joy," are translated into the VAD space via hard-coded embeddings taken from Saif Mohammad's 2018 NRC-VAD Lexicon. Future work may focus on replacing the VAD space with a custom embedding space, and/or allowing emotional embeddings to be a learned parameter.
References
- Demszky, D., Movshovitz-Attias, D., Ko, J., Cowen, A., Nemade, G., & Ravi, S. (2020). GoEmotions: A dataset of fine-grained emotions. arXiv preprint arXiv:2005.00547. https://doi.org/10.48550/arXiv.2005.00547
- Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., ... & Ritter, M. (2017, March). Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 776-780). IEEE. https://doi.org/10.1109/ICASSP.2017.7952261
- Hershey, S., Chaudhuri, S., Ellis, D. P., Gemmeke, J. F., Jansen, A., Moore, R. C., ... & Wilson, K. (2017, March). CNN architectures for large-scale audio classification. In 2017 IEEE international conference on acoustics, speech and signal processing (icassp) (pp. 131-135). IEEE. https://doi.org/10.1109/ICASSP.2017.7952132
- Koh, E., & Dubnov, S. (2021). Comparison and analysis of deep audio embeddings for music emotion recognition. arXiv preprint arXiv:2104.06517. https://doi.org/10.48550/arXiv.2104.06517
- Mohammad, S. (2018, July). Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 English words. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 174-184). https://doi.org/10.18653/v1/P18-1017
- Won, M., Salamon, J., Bryan, N. J., Mysore, G. J., & Serra, X. (2021). Emotion Embedding Spaces for Matching Music to Stories. arXiv preprint arXiv:2111.13468. https://doi.org/10.48550/arXiv.2111.13468
- Zhao, S., Li, Y., Yao, X., Nie, W., Xu, P., Yang, J., & Keutzer, K. (2020, October). Emotion-based end-to-end matching between image and music in valence-arousal space. In Proceedings of the 28th ACM International Conference on Multimedia (pp. 2945-2954). https://doi.org/10.1145/3394171.3413776
Github Repo: https://github.com/yeununchoo/AmbianceNet
Final Report: https://docs.google.com/document/d/1xMr1ItA2e8WMfBUx--ujDjfWEsMwcEavpcnO8MUNUWY/edit?usp=sharing
Presentation: https://docs.google.com/presentation/d/1o4vLiuArIAh38u0PHQkzOfihuz0LImBdc1sSqboYdnE/edit?usp=sharing
Poster: https://docs.google.com/presentation/d/1iRcjiRDMbhgkKYCu195MJ2r7qaTdyEIO2d_aBsqF19g/edit?usp=sharing
Log in or sign up for Devpost to join the conversation.