RachmaninoffNN: Style-Specific Music Generation with Biaxial LSTM

Team Members

  • Xin Lian (xlian1)
  • Yanyu Tao (ytao5)
  • Yezhi Pan (ypan34)
  • Yongjeong Kim (ykim235)

Introduction

Deep learning has shown its superiority in solving generative tasks in the domains of natural language processing and computer vision, creating artificial articles and pictures that are realistic and comparable to the works of humans. While deep learning has also been incorporated in the field of audio in recent years for automatic music generation, generating realistic and aesthetic music pieces remains challenging. Most existing neural network music generation algorithms specialize in creating new music in a particular music genre. Still, few algorithms possess a tunable ability that gives users the freedom to choose their desired musical style, including music genre, composer’s style, mood, etc. In the paper by Mao et al. [1], the authors aim to create a model capable of composing music given a specific or a mixture of musical styles. They believe that such a model can help customize generated music for people in the music and film industries. They develop upon the previously introduced genre-agnostic algorithm, Biaxial LSTM, and incorporate new methods to learn music dynamics. In our project, we will be implementing the deep learning model presented in the paper, DeepJ, utilizing a different dataset with a mixed-style piano repertoire from the 17th to the early 20th century.

Related Work

One of the most notable related works that our target paper compares against is "Generating Polyphonic Music Using Tied Parallel Networks" by Daniel D. Johnson [2]. This work is unique in the sense that it can compose music of multiple genres, whereas the previous works mainly focus on a music composition task of a specific genre. The author uses Bi-Axial LSTM and Tied-Parallel LSTM-NADE for the music generation and prediction tasks for the polyphonic music, where the models can take different melodies concurrently at the time-level granularity (joint probability distribution). As noted by both papers [1, 2], this related work achieved high prediction accuracy and generated the music of multiple genres but fails to ensure a strong consistency in the single-genre music (i.e., each output is of a particular genre). Our target paper aims to leverage this shortcoming. The link for the DeepJ code can be found in the reference section.

Data

We will implement the model using the MAESTRO Dataset V3.0.0 [3], distributed by Google’s Magenta program. Specifically, we will utilize the MIDI file portion of the data, which constitutes around 200 hours of recordings from the International Piano-e-Competition, 2004-2018. We will transcribe the MIDI files to piano roll representation of notes to align with the training data representation from the original paper, using the get_piano_roll() functionality from the pretty_midi library. We will also need to assign music styles (classical, baroque, or romantic) based on composers to each MIDI track, following the style assignment mechanism in the original manuscript.

Methodology

Our implementation is primarily based on DeepJ's original implementation, where we use PyTorch and a different dataset instead. The original implementation has specific fixed codes that assume particular forms for the data (i.e., the dataset splits into each type of other genres) and constant values. We plan to build our logic for preprocess method that can convert this new dataset to work with our model. After the prepossessing, we will feed them into our model to generate the music, which requires human interpretation for the accuracy test.

This model is Biaxial LSTMs for the time-based and node-based modules. This model architecture uses these LSTMs and conditional probabilities of concurrent melody at each timeframe to generate the unique music. It uses the convolution layer to extract the note features then these are passed into our two LSTMs with their appropriate contexts. The first LSTM is the time module, and the second one is the note module. We need to reimplement two types of hidden layers DeepJ [1] uses: 1) one linear hidden layer is used to manipulate the style embedding, and 2) each LSTM works with another hidden layer with the tanh activation to create a non-linear version of the embedding.

Following is a quick summary of the general model architecture explained in DeepJ [1]. The model described in the paper starts with the convolution layer: a 1-dimensional convolution layer is applied to note input to extract note features. These note feature outputs are then concatenated with contextual inputs and incorporated information on style by adding a non-linear representation of style before being fed into the first LSTM layer (Time-Axis Module). The outputs from this LSTM layer also need to be concatenated with chosen notes from results, do component-wise sum with style, and then be fed into the next LSTM layer (Note-Axis Module). Finally, we provide the outputs of this layer into the sigmoid probability layer to get prediction results. Also, the key thing to note for this model’s weakness is that it relies on human input for the final evaluation.

While reimplementing with a different framework sounds like a simple task, understanding the existing models without detailed code documentation can be challenging. Also, simply modifying the dataset and converting the Tensorflow-version to the PyTorch version does not necessarily make our model better. For instance, we will need to ensure that our implementation scales well with our GPU accelerator environment and the size of the dataset (e.g., we are given an option to choose the number of accelerators). Furthermore, creating a developing environment that works for every team member's local environment is crucial if we decide to implement the initial code locally for the testing and then deploy them to the cloud instance.

In summary, these are the current challenges we identified: 1) ensuring that we have a proper development environment that works for everyone (e.g., a container for the initial experiment or the appropriate cloud instance setting), 2) modifying the preprocessing logic to ensure compatibility with our dataset, 3) modifying the code to optimize the GPU accelerator fully (e.g., less branch-blocking and hyperparameters tuning given our rich GPU environment), and 4) scaling with the large dataset with careful management of coding styles.

Metrics

We plan to train our model on the MAESTRO dataset consisting of three specific styles: classical, baroque, and romantic. Following the original manuscript, we will also truncate the pitch range in the MIDI files to reduce the input dimensionality. Hyperparameters will be decided upon tuning, including the number of units for the two axes in the model, number of filters, and dimensions of the embedding space. The most optimal ones will be selected. The model will be updated using stochastic gradient descent with the Adam optimizer. Detailed training steps can be found in the Methodology section.

Given that music evaluation can be personal, the authors launch subjective experiments to evaluate the generated music from two aspects, quality, and style. For quality analysis, general users are asked to choose a better one from pairs of music generation of DeepJ and music generation of the original Biaxial model. For style analysis, users with music backgrounds are divided into two groups. They are asked to make a manual classification of style given the music generations of DeepJ and authentic music pieces, respectively. A hypothesis test is conducted based on their responses to check the difference in identifiability of musical style between the two groups. We will also conduct a personal user study in evaluating the performance of our model. Still, due to the time limitation of the project, we may not find as many users as in the original paper, and we will calculate objective metrics such as perplexity score as supplementary to examine our model.

In this project, our base goal is to successfully reimplement the DeepJ model and train the model on the MAESTRO dataset. We target to generate discernable, realistic music comparable to the work of humans, and our stretch goal is to create aesthetic music with identifiable styles to some extent.

Ethics

Since our project is not concerned with lyrics or any natural language-related elements in the music generation process, the primary ethical complication involved is intellectual property and copyright issues. Is one broader societal issue relevant to our chosen problem space: how should we distribute credits when using AI or DL models to engage in the creative process? To what extent is it ethical to make AL or DL models "learn" from other people's work? There are no agreements or legal terms that specify whether it's appropriate to generate music from copyrighted works, so we need to consider this from an ethical perspective. As a result, we will only train the model on properly licensed data and will not attempt to generate music from any web-scraped or unlicenced MIDI files.

The significant stakeholders of music generation deep learning models are composers/songwriters of the music in the training dataset and the audience who listen to the music generated by the model. Some secondary stakeholders include music label companies in the industry and content creators who use the generated music. The quality of our model results is entirely subjective to the audience, so the stakeholders won't be affected by mistakes made by our model but by the deployment of our model if it ever gets commercialized. Future discussions should include topics such as compensation to individuals or companies who own the copyright of the training music for any profit made by the music generated from the model or potential decrease in creative labor cost and resulting in attrition in the creative industry.

Division of Labor

Implementation

We understand that the implementation works cannot be evenly divided among members since it is difficult to predict the time/effort required for each project segment. Instead, we plan to create git branches for each member where we may hold weekly/bi-weekly meetings to synchronize the process and undergo code reviews. Then, we will merge with the main branch on an ongoing basis to ensure the quality of working codes.

The attempted distribution of work is as follows:

  • Data preprocessing (data cleaning, style assignment, compatibility modification, etc.) - Yezhi Pan
  • Model architecture - Yongjeong Kim
  • Test and main (calculate complexity, call model, etc.) - Yanyu Tao
  • Visualization and other utility functions - Xin Lian

Survey

As noted from the target paper [1], the essential metric of evaluating the DeepJ framework is surveying a diverse group of people, including but not limited to one with a Deep Learning background and another group without expertise.

The attempted distribution of work is as follows:

  • Create survey templates and links - Yezhi Pan and Yanyu Tao
  • Distribute surveys - all
  • Analyze survey results and generate relevant charts - Yongjeong Kim and Xin Lian

Final Report and Posting

Final report and posting preparations are crucial parts of the final project to demonstrate the bottleneck of our implementation through evaluations and description. We plan to hold weekly or daily meetings to ensure that everyone participates in summarizing and documenting the project and share each member's interpretation and understanding of the framework on their roles.

References

[1] H. H. Mao, T. Shin and G. Cottrell, "DeepJ: Style-Specific Music Generation," 2018 IEEE 12th International Conference on Semantic Computing (ICSC), 2018, pp. 377-382, doi: 10.1109/ICSC.2018.00077.

[2] D. D. Johnson, “Generating polyphonic music using tied Parallel Networks,” Computational Intelligence in Music, Sound, Art and Design, pp. 128–143, 2017.

[3] Magenta, "The MAESTRO Dataset," magenta. Updated October 29, 2018. [Website].

Built With

Share this project:

Updates

posted an update

Checkin 3: Reflection

Team Members

  • Xin Lian (xlian1)
  • Yanyu Tao (ytao5)
  • Yezhi Pan (ypan34)
  • Yongjeong Kim (ykim235)

Introduction

Deep learning has shown its superiority in solving generative tasks in the domains of natural language processing and computer vision, creating artificial articles and pictures that are realistic and comparable to the works of humans. While deep learning has also been incorporated in the field of audio in recent years for automatic music generation, generating realistic and aesthetic music pieces remains challenging. Most existing neural network music generation algorithms specialize in creating new music in a particular music genre. Still, few algorithms possess a tunable ability that gives users the freedom to choose their desired musical style, including music genre, composer’s style, mood, etc. In the paper by Mao et al. [1], the authors aim to create a model capable of composing music given a specific or a mixture of musical styles. They believe that such a model can help customize generated music for people in the music and film industries. They develop upon the previously introduced genre-agnostic algorithm, Biaxial LSTM, and incorporate new methods to learn music dynamics. In our project, we will be implementing the deep learning model presented in the paper, DeepJ, utilizing a different dataset with a mixed-style piano repertoire from the 17th to the early 20th century.

Challenges

Documentation and compatibility issues:

We encountered debugging and implementation challenges due to the lack of documentation from the original repository and Pytorch instructions. Some libraries that the original program depends on failed to run correctly and led to system-level failures at the cloud VM instance of Google Cloud Platform (i.e., the instance equipped with the Nvidia Tesla K80 accelerator and Ubuntu 20.04 LTS operating system). We had to select the Pytorch version 1.11 to ensure its compatibility with the Python packages. However, Pytorch constantly fails to detect the accelerator I/O device due to its incompatibility with the virtual machine's pre-installed Linux headers and driver versions. This failure was difficult to diagnose because the official Pytorch setup instructions do not provide sufficient low-level descriptions of the compatibility issues. We investigated this issue and were able to create a bare Ubuntu instance with manual configuration (i.e., CUDA 11.4, Nvidia driver 470.103.01, and other Linux headers/libraries).

Furthermore, the DeepJ repository does not sufficiently describe the environment setup and the memory resource requirement for their implementation, crashing the VM instance frequently. Initially, it was challenging to find the root cause of failures because they were not evident from the system metrics by the observability tool. The tool was not a reliable source to monitor the real-time CPU utilization and memory usage in the fine-grained time requirement for the bugs causing the crash. We performed code analysis and identified the failure along the high-level metrics provided by the tool, which indicated the short-lived bursts of heavy resource utilization implied by the code. We modified the parameter values for the CPU multi-threading usages to reduce the run-time memory requirement along with the state-of-art software techniques to reduce the real-time rendering of the data required for the framework. We are debugging this further to configure on-demand memory pulling for the cloud instance.

Compatibility is one of the main challenges we faced during the translation process. The original code is based on the TensorFlow framework, whose Keras layers have automated back-end processing for defining and utilizing their layers. However, Pytorch was missing some layer definitions used by Keras, which are frequently used by the original implementations. We had to either find their open-source implementation or manually define our class. Furthermore, Pytorch requires a fine-grained definition of the layer for the declaration; however, the vaguely defined documentation issues above challenged us to find the proper parameters and shape values.

Initial environment setup and the cloud reliability:

At the early implementation stage, we utilized the local environment equipped with the lower-end GPU (Nvidia GeForce series with less than 100 GB memory) and CPU with 1-2 cores. Even with the low efficiency, we were able to utilize CPU resources for the computations and preprocessing; however, later at the stage, we faced issues as we were progressing toward the model verification and training stages due to the lack of the memory in our GPU accelerator, where its storage is not sufficient enough to transfer the host data to the device memory at the run-time. Our local memory architecture with the PCIe bus was not good enough to keep up with the data transfer bandwidth.

The cloud support was also not fully reliable (we are not using the auto-scaling feature). We detected the unstable ingress bandwidth for our instance, and the ssh connection constantly gets lost for the large data batch transfer, where the monitoring tool does not provide many insights for the ssh failures and network traffic monitoring due to its naive nature of the virtualization. The instance is also not reliable because it is not fault-tolerant on the data loss of the transfer during the connection loss. We are currently investigating this issue further.

Insights

Due to the challenges discussed above, we are currently still working on model training and troubleshooting. We are fixing the layer implementation issues caused by the lack of naive support of APIs. At this point, we do not have concrete results generated yet, but we have successfully implemented our preprocessing pipeline to work on the specific data files we are using and are wrapping up our training implementation. Moreover, because of the restriction of specifying extra parameters and shape values in PyTorch when translating layers, we believe our final model will be more stable and scalable comparing to the original model in Tensorflow. We are positive that our translated model will be able to produce comparable performance as that in the original manuscript once everything is fixed. However, we do expect a fair amount of time in training given the size of the data.

Plan

By now we have found the suitable dataset for our model and finished preprocessing based on the specifications from the original manuscript. We converted the model from Tensorflow to PyTorch framework, and deployed the model to Google Cloud Platform (GCP) for training. We will dedicate more time to address the aforementioned compatibility issues encountered in cloud computation, pinpoint APIs that are more stable than what we are currently using, and fine-tune the hyperparameters eventually. Furthermore, our preprocessed data (piano roll presentation of Musical Instrument Digital Interface files) takes up more than 100 GB of disk memory, but the published paper does not comment on the conventional computational problems such as scarce system memory and/or disk space which can potentially lead to VM failures at the GCP deployment. We plan to explore options to parse the data and find methods to dynamically load the data, in order to scale with the limited resources available.

References

[1] H. H. Mao, T. Shin and G. Cottrell, "DeepJ: Style-Specific Music Generation," 2018 IEEE 12th International Conference on Semantic Computing (ICSC), 2018, pp. 377-382, doi: 10.1109/ICSC.2018.00077.

Log in or sign up for Devpost to join the conversation.