Our main target group for this project are musicians. When improvising over a jazz track, musicians are faced with two key tasks: 1) identifying the underlying chord progression and 2) generating a melody that matches the harmonic environment defined by the chords. We will tackle each of these steps individually to construct an end-to-end system that takes as input raw jazz audio and outputs a novel creative solo.
Even in isolation, both phases are directly useful for musicians of all backgrounds. Chord transcription is a challenging and time-consuming process even for experienced human experts, and beginner musicians often struggle to identify progressions in music they want to play along to. While online services such as Chordify attempt to address this need, transcription accuracy is poor, especially for the more complicated harmonics present in jazz music--we aim to experiment with applying machine learning to improve performance for end users. Solo generation is more directly useful to jazz musicians looking for a jamming partner. The improvisational dynamic between musicians is a defining feature of jazz, with both performances and casual play featuring call-and-response and trading fours. In times where meeting in person is difficult, an automated jazz improviser could be a valuable practicing buddy for jazz musicians across many instruments and styles.
Jazz-O-Matic will perform signal processing on a given audio file to extract tempo and harmony, and then use an AI to create novel improvisation to accompany the music. There are four main aspects to our project:
Tempo extraction is the first step in the classification process. The tempo of a given music sample quantitatively refers to its beats per minute. This is crucial information as the generative model needs to properly match the tempo to create a complementary solo. To do this, a piece of software that can take a raw audio stream as an input and output the BPM needs to be designed. This seems like a simple task for humans because our ears and brain are extremely good at audio sampling and classification, but a computer only sees the raw bit stream digitized from the microphone. Jazz, like most music, has many elements that layer on top of one another, creating an audio signal that looks nothing like it’s constituent parts. The current plan is to take the Fourier transform of the input waveform to move into the frequency domain. This is in an attempt to unmix the fused elements of the time domain signal. Next, we will either train a recurrent neural network or implement spectral analysis techniques to predict beat positioning. Then, we will use a resonating comb filter to detect periodicity in the signal and detect the maxima i.e. the BPM. This is subject to change based on our continuing research, but some variation of the above process will most likely be implemented. The accuracy of this model will be evaluated using comprehensive datasets and success will be measured by comparing the performance of our algorithm with the industry standard ones.
After looking in to the two classification techniques, we settled upon the spectral analysis technique as it is more computationally efficient and doesn't require any pretrained models with specific weights and biases to estimate the BPM. The main concern was that it's generality would lead to poor BPM estimates, but that was not the case. It works great for most simple beat patterns and even does quite well with many layered instruments and voice over the underlying beat. At times it does output relatively inaccurate estimations due to the windowing scheme that sweeps over the entire song, but with proper averaging and outlier elimination, the actual BPM estimate is remarkably accurate. With extremely complex audio samples and many underlying beats of similar amplitude it has a drop in performance, but we don't expect that to be a huge problem as the estimate is still in the ballpark and the average user probably won't be uploading such complicated samples.
To identify harmonic notes, we analyze our audio signals in the frequency domain with the discrete cosine transform (DCT). The DCT spectrum as a function of time is known as a spectrogram. Plotting frequency on a log scale is natural because every octave is a doubling of frequency, and notes in twelve tone equal temperament are spaced linearly on a log scale. For every octave we sum the energy around each note to make the final determination of what notes are present. This algorithm works quite nicely to extract chords.
The note-breakdown synchronized with the extracted tempo of the music will be enough for the neural network to improvise a consonant solo. However, proper chord names for a human user would be a very useful feature. This can be attempted in a deterministic way, by simply mapping the note names to the chord that best explains their presence. However, nonidealities and musical subtleties will make this easier said than done. A potential solution could be training a neural network on audio with pre-transcribed chords. This network could then identify the most likely chord name given the current notes and the notes in the previous time window. Simple metrics we will employ will be note identification accuracy and chord identification accuracy. For specifications, we hope to achieve at least comparable performance to existing systems such as Chordify for jazz audio tracks with known transcriptions.
Neural Networks for Generating Novel Solos
For the improvisational system, we have started out by using a LSTM model to generate notes given an underlying chord progression. Our database consists of over 500 jazz tunes from the Weimar Jazz Database. These contain chord, melody, and tempo information as well as metadata such as musical style. For simplicity, we began by removing tempo information and assuming all notes are evenly spaced quarter notes with no rests. We train over this corpus using a LSTM model and observe that we can adjust the probability of the LSTM to select "risky" notes to generate a range of improvisational styles. Furthermore, ablation testing suggests that supplying the LSTM with information of the underlying chord progression does in fact help it write better solos.
Our next step is to add tempo information and rests to the network. Once this is in place, we hope to determine a way to measure the creativity of our outputted solo. One approach would be to borrow ideas from Generative Adversarial Networks to train a discriminator to tell human-composed and machine-composed solos apart. A better composer would be able to fool the discriminator more often. We would also send recordings over our output to human musicians and gather feedback on the quality and originality of the compositions.
To tie everything together, we plan to develop a self-contained web application for users to interact with our service. The web application will consist of a backend server that contains our generation logic and a client side interface. We plan to develop this using modern web technologies, like Node.js/Express, React, and MongoDB. Depending on the scope of computation and training our models, we’ll most likely utilize cloud computing services like AWS that will perform both the transcription and the jazz generation. Our full-fledged web app will host the entire end to end user experience, from uploading audio files all the way to generating, downloading, and sharing their music. We’ll also keep scalability/concurrency issues in mind, especially for simultaneous users of the service.