Classical Music Generation

Final_Project_poster

Title

Team Members

Zhuoyang Lyu(zlyu12), Yixiang Sun(ysun133), Zixuan Guo(zguo47)

Introduction

We are a group of music lovers and musicians, and naturally we thought about making music with what we learned about deep learning this semester. We realized a strong resemblance between language generation and music generation, replacing individual words with musical notes and we decided to create a model that can learn from given music data and generate more music. This will be a structured prediction problem similar to classic NLP topics.

Data

We found a publicly available dataset on kaggle: MusicNet, a curated collection of labeled classical music. (https://www.kaggle.com/datasets/imsparsh/musicnet-dataset). “MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note in every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition. The labels are acquired from musical scores aligned to recordings by dynamic time warping. The labels are verified by trained musicians; a labeling error rate of 4% has been estimated. The MusicNet labels are offered to the machine learning and music communities as a resource for training models and a common benchmark for comparing results.” We plan to only use the annotated note as our data, not the raw audio files. We will need to embed the notes to an embedding space not only of its pitch but its length and possibly its instrument.

Methodology

Because of the assemblance between music data and natural language data, we decided that Recurrent Neural Network would be a potential solution. So we used RNN, to approach this problem. We experimented and trained multiple LSTM with different setup on different dataset and compared their output. We used not only the midi-track data but also the note feature embedding dataset, simplified and not simplified. The midi track data can directly output midi data that is easy to evaluate and contains much more information. On the other hand, the note feature embedding is easier to interpret, but is harder to train and evaluate. The information like composer, instruments that don't necessarily affect the music itself as much as the note are weighted as much as essential features like pitch and timing. Those information made the dataset much more variant and complex. The result, being a list of generated notes with features, cannot be played and listened to directly. We can only evaluate it by quantity, model loss, but not its tonal quality. We have to manually transpose the music on a sheet to hear it. Therefore, we prepared the simplistic dataset and discarded all the unnecessary information and saved only the essential. The output will be a sequence of notes with only pitch and time. Transposing that melody is much easier and straightforward.

Metrics

We will first train the model and experiment with generating music from scratch, and also generating music with a given starting line or measure of different genres. Since accuracy doesn’t apply to our task, we will use perplexity while building our model, comparing the performance with different hyperparameters. But perplexity is not really interpretable in real life, therefore we will also manually review the quality of generated music. We will also attempt to include reviews from students taking music classes or in music concentration and Professors from Music department’s comments on the generated piece. Our base goal is to make something that trains and be able to generate notes, our target is to be able to generate music that has some sort of tone or musical values, our stretch goal is to generate complete music pieces able to be performed in concerts etc.

Ethics

This problem mainly concerns the creativity of human intelligence. If someday, programs can learn the written music as well as humans and produce music as well as human composers, then how do we define creativity? Who should claim credit for those works? And how are the next generation be influenced if a large portion of what they take in are not done by humans? Although these questions sound far off, they are actually happening around us. For example, a lot of songs today samples or remix earlier work, and conversational robots are being widely used by companies for customer service. Besides these social and humanity concerns, we can also consider the “stakeholders” of this problem. In this problem, all the musicians and people working for the creative process of musicians like the sound engineers, instrument retailers are definitely influenced. Also, the computer scientists that participate in the creation of music is also important. All the audience will also be impacted, since their aesthetic will certainly be changed by computers’ ability to rapidly generate a large number of similar songs. Once such generation technology is widely used, the whole human civilization’s idea of art, value, and culture will be held in question.

Nov 30 Check In Update

Our implementation generally contains two parts. One is to use the midi-track of the dataset and train model with midi data directly. The other is to use extracted information from each note and train a neural network based on those extracted feature data. The first part that uses midi data was generally straightforward because we referenced an online notebook doing a similar task and we used a python library that focused on working with midi data. On the other hand, when we try to encode notes directly, it’s much more confusing. We have to inspect the feature offered by the dataset, do feature extraction and engineering and preprocessing to encode the information of each note into a Numpy array to feed our neural network. And we also need to work out the reverse, ways to decode the generated result. We also need to tune our RNN in order to fit with the data. Also, it’s still unclear how we can interpret the quality of the generated notes since they will be a collection of features that cannot be played directly. We might need to transcribe them into actual sheet music and play them to review its quality. But with the uncertainty, we were able to look deep into our data and plan out our experiments based on its variety. We will test and compare the performance of models trained with different feature engineering and gain insights about the features of each note. We can also use the music generated by the midi model as a base-line and play around to see their interaction, feeding the output of one to the other.