Music-to-Text Generation

Introduction

Our project involves building a model to generate rich captions or descriptions for music clips. This started out as a reimplementation of the MusicLM paper, which involves music generation from text descriptions, but we reversed the order of media to make it more feasible to implement in our timeline and because we thought it would be interesting to extend the ideas of image captioning from previous class projects to the problem of audio. Our model should be able to take in music clips (as WAV files) and generate captions describing the music. Our architecture is an adaptation of the architecture we used in Homework 5, since the task of audio captioning is similar to image captioning, but with significantly different pre-processing requirements and with data that behaves differently too. We implemented this in Tensorflow, training on the MusicCaps dataset provided in our original paper of reference (MusicLM).

Related Work

We looked at a paper called “MusCaps: Generating Captions for Music Audio”. Unlike the transformer-based model that we are implementing, MusCaps utilizes an encoder-decoder model using a type of CNN and LSTM. While MusCaps also uses attention as has become relatively standard for these types of models, unlike a Transformer that only uses attention, they also use a multimodal CNN-LSTM encoder and LSTM decoder.

Data

We're using MusicCaps, curated by the authors of the MusicLM paper to evaluate their model, and we will be splitting it and using it for both training and evaluation. The dataset stores YouTube IDs for the music clips in the training set that have corresponding captions. We obtained his audio from those URLs, converted them into acoustic tokens, and convert the captions into tokens. The authors provide example scripts that help do individual steps, which we will need to put together to use the data.

Methodology

We trained the model using the MusicCaps dataset which stores YouTube URLs for music clips paired with corresponding text based captions. We had to preprocess the data to extract the audio samples from the YouTube URLs (using yt-dlp) and then split our audio segments into acoustic tokens and our captions into text tokens. A lot of our code for preprocessing was inspired by the scripts on the Kaggle page for the dataset, written by the authors of the MusicLM paper.

Our model follows the structure of the model from Homework 5 — a decoder-only transformer model with an audio embedding layer, relative positional encodings for captions, a transformer encoder block with a hidden size of 256 and 1 attention head, a dense classification layer. Our window size was 40 and our learning rate was 0.001. We erred on the side of simplicity with this model to make sure we had a solid baseline model to build off of in the future, and to ensure we were able to iterate more quickly with model changes.

Metrics

As our model focuses on a captioning task, similar to image captioning, we are interested in being able to evaluate the model using one of the many available sets of evaluation metrics for captioning tasks. After some consideration, we decided to utilize ROUGE-1, since it is relatively uncomplicated and allows for a good understanding of how close the caption might mirror the true caption assigned by a human. Our base goal would be just using ROUGE-1 aiming for a ROUGE score of maybe 0.35, but some target and stretch goals might be using a higher n-grammed version of ROUGE, basically ROUGE-2 or ROUGE-3. Beyond that, we would want to aim for higher scores of up to 0.5 or higher.

Ethics

Our dataset is the MusicCaps dataset that the authors of our reference paper have curated. It contains over 5000 music examples, each of which is labeled with a list of aspects or tags and a free-text caption written by musicians the authors hired. The underlying music itself comes from the music examples in Google’s AudioSet dataset, which are 10-second YouTube clips of non-copyrighted music.
It’s encouraging that the audio clips were tagged by paid human annotators, and that the free-text captions were generated specifically by musicians. It’s also nice that the music chosen for these datasets doesn’t seem to be copyrighted to begin with, so no music is being used for training that isn’t already in the public domain. There is, however, the question of data provenance or stewardship, and of whether the people whose music is being used even know if Google might be profiting off of their work.
Representation is also a plausible issue here. The captions and tags that the paper deals with are all in English, making it harder for non-English speakers to engage with the model. This could also lead to geographical bias; since the annotators are only English speakers, it limits what kinds of music the model is exposed to. This also limits the vocabulary the model uses to describe music, and the kinds of music it can describe. For example, Carnatic music or Hindustani classical music may conceive of rhythm and melodies in a different way to Western classical music, but the way annotators were recruited might not recognize this diversity.

Stakeholders + Consequences of Mistakes

Major stakeholders in this problem may include:
Record labels
Music writers
Advertisers
People
People with hearing impairments who may want rich captions for music
Poor performance may result in confusion from hearing-impaired people since they are most likely to want to use a music captioning program, but may be confused when communicating about the music with others if the captioning is inaccurate since their idea of what the music might be like would then differ from others.