Project Overview
Neural networks in deep learning have become a powerful tool in style and domain transformations, and have been especially successful with media such as images. We wanted to translate these mechanisms to music, with our goal being to take in piano music inputs and generate the same song but in a different musical genre. We mainly referenced “Symbolic Music Genre Transfer with CycleGAN” for the implementation of our model. Using two GAN models in a cyclic fashion has allowed us to move samples from one model to another in an efficient manner to transfer styles, without the need for paired data.
Methodology:
Preprocessing:
We extracted MIDI files from MIDI datasets in the jazz, pop-rock, and classical genres to use for training and testing. We then processed them with various Python packages to split them into 8 second intervals, added augmentation such as shifting the key and increasing/decreasing the tempo, removed the drums, and added start and end tokens. Rather than using a piano-roll representation (2D array representation) of the MIDI files and using CNNs like the original paper, we instead used a timeshift representation, which was in part inspired by the Magenta Project’s “Performance RNN”. The reason why we did not choose to use the piano roll representation is because the majority of the values in a piano roll are 0 (since, at any given point, there are not very many notes being played). This makes learning with a CNN more difficult, and computationally expensive. Timeshift representations are themselves 1D, with each value in the array being a value from 0-387. With timeshift representations, we were able to encode the velocities of each note, which is essentially how loud a note is played, and is important in musicality and expressiveness. This diverges from the original paper, which capped their velocities to be a single value – resulting in the output songs feeling very mechanical. Once we generated the timeshift representations, we were able to treat each possible value in the array as a class, and we one-hot encoded the arrays and passed them as inputs to our model. We also padded or truncated each sequence to be 402 elements long.
Model Architecture:
Using PyTorch, we designed the GAN architecture of our model, including designing a discriminator and a generator. We created the CycleGAN architecture by using two GAN models and training them in unison, making a cyclic architecture structure. Both the generators and discriminators were built as sequence modeling architectures, as we used embedding layers and Gated Recurrent Units. We applied temperature to the probabilities in order to gain more entropy and reach a more uniform distribution for more randomness. This allowed us to overcome the issue of having the same note generated repeatedly, since applying weights before softmax produced variability in our outputs. We are able to adjust the temperature of the model when generating songs, which has led to interesting results we will discuss in the Results section.
We then created a genre classifier to see how well our model changed the style. We originally built a CNN classifier, but decided to switch to an LSTM classifier due to the sequential nature of the timeshift data. Further, we found that the CNN classifier was having trouble learning, which was likely due to using a CNN across a one hot encoded array, in which the bulk of values were 0 and everything else was a 1. Across all three genres—pop, classical, and jazz —our LSTM classifier was able to reach 90% accuracy on training data and 79% accuracy on testing data. We also used our LSTM classifier on our model’s output songs. However we found that most of our output songs were classified as “classical”, even though to us, we found that the output songs sounded significantly jazzier. However, we did feel like our output, jazzified songs alone (when not compared to the input songs) were not immediately classifiable as any genre – the rhythm was jazzier, as was the way notes got louder and quieter, but the melodies sounded a bit more classical, and the notes were slurred a bit more in a way that classical compositions are more than jazz.
Attached is our poster which also includes a simplified graphical representation of our model.
Results:
We found that our CycleGAN model was able to produce songs where various intervals were quite musical, however the songs lacked long term structure and the style transfer overpowered the recognizability of the input song. This was despite pre-training for 79 epochs in order to help our model to learn to output similar sounding songs. Thus, while our output songs do sound “jazzier”, they lack a close resemblance to the original input song.
When generating songs with our saved CycleGAN models, we were able to specify the temperature to use, with a higher temperature being more random and a lower temperature being less random. We found that using an extremely low temperature (0.001) would, given some song inputs, output a timeshift representation that largely consisted of 355s (which corresponds to a note-off event) and that when read in with GarageBand as a MIDI file, would essentially just be an empty song. However, we found that some song outputs would sound rather similar to the input with a very low temperature, however there would be 355-filled “note-off” chunks that would result in long intervals in the middle being silent. In Figure 1 below, we show the visual representation of the input classical song (Kleisen08-12.MID), and the classical-jazz model’s output songs. The middle song (jazzy_pop_trained) uses a higher temperature than the bottom song (jazzy_pop_trained_lower). We can see that the bottom song bears a bit more resemblance to the original song. This makes sense since with a lower temperature, there is less randomness and a higher chance of “note off” events occurring.
Log in or sign up for Devpost to join the conversation.