BACHPropagation

Project Title

BACHPropagation

Group Members:

Luke Eller leller1 Xinru Li xli115 Kim Brown kbrown18 Carter Cobb ccobb1

Final Writeup

https://docs.google.com/document/d/1ivkVeF145PM3V48ImKEVPc0imCtunmMKpXZT5suCxYw/edit?usp=sharing

Introduction

We are trying to solve the problem of generating art -- specifically music -- using deep learning. We are implementing the “Deep Learning for Music” paper by Allen Huang and Raymond Hu, which implemented an RNN model on a musical dataset to see if it was possible to generate melody and harmony the same way as one would generate language. We are thinking about doing something new in terms of preprocessing. In the papers that we’ve looked at, it seems like the deep learning model tends to be pretty similar with variations on an LSTM model, but the way the data is represented is different. Musical data comes in the forms of MIDI, but a challenge here is representing both harmony, melody, and time elements in a learnable dataset. Some papers mentioned setting all pieces to the same tempo, some set all pieces to the same key, and some represented their notes in terms of tensors, or piano rolls. We are looking to innovate on these concepts for our model. The problem of generating music is a supervised learning, sequence prediction task.

Related Work

We drew inspiration from several papers and online articles regarding music generation with deep learning. The most significant is the Allen H. paper we cited previously which can be found here: https://cs224d.stanford.edu/reports/allenh.pdf. We also looked closely at this paper: https://www.sciencedirect.com/science/article/pii/S1877050919313444 called “Bach 2.0” which gives a lot of detail about different model architectures for this task, for instance, using LSTM vs GRU. This paper also discusses a different representation of musical notation that comes from the “NoteWorthy Composer” music software. We also looked at Medium articles and the like with tutorials on how to generate music using Deep Learning, for instance this piece: https://sswanalytics.com/2019/08/09/how-to-generate-techno-music-using-deep-learning/ about generating techno music. Additional Articles: https://medium.com/artists-and-machine-intelligence/neural-nets-for-generating-music-f46dffac21c0 Discusses how the choice of LSTMs over plain RNNs improved the length of the model’s output: with RNNs, the song’s quality tampered off quickly. https://blog.floydhub.com/generating-classical-music-with-neural-networks/ Christine McLeavey Payne has created an LSTM neural network called Clara Public Implementations: https://paperswithcode.com/paper/deep-learning-for-music#code

Data

We will be mostly taking piano MIDI from open source places such as MuseData, which includes music by composers such as Bach, Beethoven, and Vivaldi. Because these composers come from such different backgrounds, we will most likely focus on Bach in the beginning like the other papers we found, then try generating music from other composers as well. This is in part to verify that our model works, and also so that we won’t have insanely long training times. In terms of the training itself, MIDI files contain information on the duration and pitch of a note, but the papers we read mentioned that RNNs have trouble processing music that happens simultaneously, like with chords and harmony. An approach is to turn MIDI files into Piano Rolls, which takes samples of what notes occur at what timestep. Currently it looks like a lot of preprocessing is required for our model to understand our data. The different papers we looked at seemed to have slightly different preprocessing approaches to their model, and we are thinking about coming up with our own representation. Either way, these datasets will need at least 7 hours to train, so we will start with a small set of Bach and work towards other genres as our stretch goal. We are looking to use the following datasets: MuseData http://www-etud.iro.umontreal.ca/~boulanni/icml2012 One challenge to work with less data to decrease training time would be to have to transpose all our pieces so that they are in the same key. That way, our model wouldn’t have to train similar music through what might be different approaches. Transposing could be challenging though there are examples online of how it might be possible: https://gist.github.com/aldous-rey/68c6c43450517aa47474 https://web.mit.edu/music21/doc/usersGuide/usersGuide_15_key.html We are also considering using the music21 API to help with visualizing our data http://web.mit.edu/music21/doc/index.html

Methodology

2-layer LSTM architecture Hyperparameters might include the number of LSTM units, number/size of dense layers applied afterward, hidden unit size, sequence length, batch size, and learning rate. Gradient Clipping Learning rate annealing (might not be necessary depending on the optimizer we choose) Depending on how easy the paper architecture ends up being to implement, we could experiment with other models as well (GRUs, transformers). How are you training the model? Train via the usual supervised learning process we used with RNNs on hw3 and hw4 Data preprocessing might be the most difficult part, as well as finding a sufficiently effective method of representing the data. Tuning hyperparameters might be a challenge, since the dataset is so different from natural language.

Metrics

There are 3 major ways in which we are planning to evaluate the model. The first is to set aside some data for testing, and compute the percent accuracy on the symbol-prediction task over the testing set, similar to what we did in the language models assignment. We can compare our accuracy to the various models that are tested in this paper on the JSB chorales: https://arxiv.org/pdf/1206.6392.pdf (these data were also used as the reference in the Alllen H. paper). Similarly, we can simply see if our loss is going down during training to verify that the model is learning. The second way we might test is to visualize the embedding representations of the tokens in our data set after training the model. These tokens will represent notes and chords, so we can try to discern any musical meaning from the relative positions of the vectors. For instance, we would hope to see the notes distributed in a way that corresponds to their pitch. The third way is to listen to the music and see if it is pleasing. The Allen H paper has a group of people listen to and compare the music samples, and asks them whether they think the samples sound better than complete random noise. Accuracy Notion Since we are training a model to predict the next note or chord in a sequence, we can compute a numeric accuracy over the test set, just as we did in the HW3 language modelling assignment. However, we saw from the Montreal paper that even when the model is able to achieve a reasonably high accuracy, the music that it generates can sound totally discordant. So, a more appropriate metric may be a subjective rating of how pleasant the music actually sounds. If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model. In the Allen H paper, apart from measuring accuracy and visualizing embeddings to assess performance, the authors surveyed 26 listeners to rate the music generated by their models. They were asked to rate the samples on a scale of 1-10, 1 being “sounds like random noise” and 10 being “composed by a composer.” Through these ratings, they were hoping to determine whether their model was actually producing musically-plausible melodies and quantified this through the 1-10 ratings. If you are doing something new, explain how you will assess your model’s performance. We will use the three methods described above to evaluate the model’s performance. Namely, computing accuracy & loss, visualizing the embeddings, and listening to the music generated by the model. What are your base, target, and stretch goals? Base: train a model on the Bach corpus, using a music representation described in one of our papers (ie NoteWorthy, piano roll, or flattened MIDI). Obtain a prediction accuracy over the test set which is better than a totally random player and produce music that is aesthetically better than random.
Target: Use our own original representation of the music data and obtain an accuracy that is comparable to the RNN tested in the U Montreal paper (ie ~30%) and produce short, aesthetically pleasing piano melodies.
Stretch: Produce longer melodies which exhibit higher-level musical structure such as harmony. Possible stretch ideas include: train on data from multiple musicians in order to create a hybrid style that sounds similar to both, etc.

Ethics

It is important to consider the ethical implications and uncertainties of creating this type of neural network. If a deep learning model is trained on one musician’s data and it generates music similar to theirs, does that musician have an intellectual property claim on it? Also, is that musician entitled to any of those profits? It may also simply be considered unethical to create music which imitates the sound of an artist- this can be seen as a type of impersonation, or something that dilutes and undermines the artist’s real work. There is also the question of whether or not the companies using AI to produce music are rendering real artists obsolete, or absorbing the profits in the music industry. A possible implication of this is that companies may come to completely dominate mainstream music, and we will no longer be able to enjoy individual creativity and authenticity (though in some ways, this is already the case with record labels). Finally, it is important to consider which music these models are getting trained on. If they are only getting trained on the work of classical Western European composers, it may perpetuate the idea that this is the epitome of music and sweep the musical traditions of other cultures even further under the rug. Why is Deep Learning a good approach to this problem? Deep learning is a good approach to complex problems where the model needs to learn the important features itself, rather than relying on lots of labeled data and human expertise. A music-generating neural network is able to take in tons of data, learn the features and patterns in it, and reproduce those patterns in the new music it creates. Additionally, there is plenty of music data available, and deep learning generally outperforms other techniques, such as classical machine learning, when there is sufficient data. What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain? Our dataset is not very representative considering it highlights the work of one composer in a very overrepresented genre. Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm? Major stakeholders in this problem include record labels, musicians, and the general public. If the model becomes commercialized, it can have profound ramifications in society which change the way music is consumed and produced. This will truly affect everyone, but especially music artists. There are not high consequences for mistakes made by the algorithm, because there is technically no “wrong answer”: the model simply is producing music which may or not sound pleasing. On the other hand, there are many potential consequences of the algorithm working “too well”, some of which are discussed in the ethics section.

Division of labor:

Data Pre-processing reading in midi files- Kim Transposing data - Xinru Creating piano rolls or other music representations- Xinru & Carter Building the Model Model call function - Carter Accuracy & Loss functions - Luke Writing training and testing loops - Carter Testing “Accuracy” Compare accuracies with other papers and compare to random noise - Luke Listening tests with other people - Xinru Looking at musical structures - Xinru Visualization -- generating sheet music and visualizing embedding vectors - Luke

Built With

python
tensorflow

Updates

Carter Cobb posted an update — Nov 24, 2020 02:49 AM EST

Introduction: [copied] We are trying to tackle the problem of generating art -- specifically music -- using deep learning. We are implementing the “Deep Learning for Music” paper by Allen Huang and Raymond Hu, which implemented an RNN model on a musical dataset to see if it was possible to generate melody and harmony the same way as one would generate language.

Challenges: Transforming midi data files into a piano roll representation has been the most challenging aspect of the project so far. The notes of the piece are indicated by either “note-on” or “note-off” messages, which are annotated with the note (1-88) and a time stamp. The unit of time is “ticks”, and meta-data messages indicate the number of ticks per beat. These times are also all relative and indicate the delta time between messages. Therefore, there are many conversions that need to be made in order to interpret the midi files correctly. We also need to find a way to generate a data structure which contains all the notes at any given timestep in order to align multiple tracks and sample at every eighth note. Another challenge is reading the midi data files so that they are normalized as much as possible so we have more consistent data. Here, we’re trying to make all our files be in the same key and have similar tempo, and reading this in is proving to be a challenge due to the lack of consistency in the data.

At this time it seems like pre-processing will be the most challenging aspect of the project.

Insights: We do not have any concrete results at the moment, as we are still finishing up pre-processing.

Plan: We are reasonably on track with the project, but we will need to dedicate more time to implementing the model over the next week or so. Ideally we should have ample time to tweak hyper-parameters and fiddle with the architecture so that we can optimize the music being generated. It will be significant work to evaluate the model’s performance, which may involve generating audio samples from the music output by the model. All in all, we need at least a few days to evaluate the model before the project is due. So far, we are not thinking of changing anything. It may not be plausible for us to normalize the tempos and time signatures of the pieces like we had previously planned.

Log in or sign up for Devpost to join the conversation.

Luke Eller started this project — Nov 13, 2020 01:06 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.