Our main target group for this project are musicians. When improvising over a jazz track, musicians are faced with two key tasks: 1) identifying the underlying chord progression and 2) generating a melody that matches the harmonic environment defined by the chords. We will tackle each of these steps individually to construct an end-to-end system that takes as input raw jazz audio and outputs a novel creative solo.
Even in isolation, both phases are directly useful for musicians of all backgrounds. Chord transcription is a challenging and time-consuming process even for experienced human experts, and beginner musicians often struggle to identify progressions in music they want to play along to. While online services such as Chordify attempt to address this need, transcription accuracy is poor, especially for the more complicated harmonics present in jazz music--we aim to experiment with applying machine learning to improve performance for end users. Solo generation is more directly useful to jazz musicians looking for a jamming partner. The improvisational dynamic between musicians is a defining feature of jazz, with both performances and casual play featuring call-and-response and trading fours. In times where meeting in person is difficult, an automated jazz improviser could be a valuable practicing buddy for jazz musicians across many instruments and styles.
Jazz-O-Matic will perform signal processing on a given audio file to extract tempo and harmony, and then use an AI to create novel improvisation to accompany the music. There are four main aspects to our project:
Tempo extraction is the first step in the classification process. The tempo of a given music sample quantitatively refers to its beats per minute. This is crucial information as the generative model needs to properly match the tempo to create a complementary solo. To do this, a piece of software that can take a raw audio stream as an input and output the BPM needs to be designed. This seems like a simple task for humans because our ears and brain are extremely good at audio sampling and classification, but a computer only sees the raw bit stream digitized from the microphone. Jazz, like most music, has many elements that layer on top of one another, creating an audio signal that looks nothing like it’s constituent parts. The current plan is to take the fourier transform of the input waveform to move into the frequency domain. This is in an attempt to unmix the fused elements of the time domain signal. Next, we will either train a recurrent neural network or implement spectral analysis techniques to predict beat positioning. Then, we will use a resonating comb filter to detect periodicity in the signal and detect the maxima i.e. the BPM. This is subject to change based on our continuing research, but some variation of the above process will most likely be implemented. The accuracy of this model will be evaluated using comprehensive datasets and success will be measured by comparing the performance of our algorithm with the industry standard ones.
We will apply signal processing methods to handle harmony extraction. Spectrograms will be constructed from FFTs of discrete time windows throughout the song. Filtering and averaging methods will be used to diminish non-harmonic frequencies (percussion, various noises, melodies). The notes of western music cycle throughout the frequency spectrum, so we will categorize frequencies and determine the breakdown of the signal energy in terms of notes. The note-breakdown synchronized with the extracted tempo of the music will be enough for the neural network to improvise a consonant solo. However, proper chord names for a human user would be a very useful feature. This can be attempted in a deterministic way, by simply mapping the note names to the chord that best explains their presence. However, nonidealities and musical subtleties will make this easier said than done. A potential solution could be training a neural network on audio with pre-transcribed chords. This network could then identify the most likely chord name given the current notes and the notes in the previous time window. Simple metrics we will employ will be note identification accuracy and chord identification accuracy. For specifications, we hope to achieve at least comparable performance to existing systems such as Chordify for jazz audio tracks with known transcriptions.
Neural Networks for Generating Novel Solos
For the improvisational system, we hope to adapt and compare various leading models from text generation research. We will start with a basic hidden Markov model that learns simple probabilistic distributions of successor notes based on the last note. Notable challenges here include learning distributions over different chord types and interfacing this with the harmony extraction component. Later, we hope to move on to LSTM and Transformer models. Since output from this stage is creative, it can be difficult to develop a metric measuring which of these approaches is most “successful.” We hope to consider two separate validation approaches here: the first would be to borrow ideas from Generative Adversarial Networks to train a discriminator to tell human-composed and machine-composed solos apart. A better composer would be able to fool the discriminator more often. We would also send recordings over our output to human musicians and gather feedback on the quality and originality of the compositions.
To tie everything together, we plan to develop a self-contained web application for users to interact with our service. The web application will consist of a backend server that contains our generation logic and a client side interface. We plan to develop this using modern web technologies, like Node.js/Express, React, and MongoDB. Depending on the scope of computation and training our models, we’ll most likely utilize cloud computing services like AWS that will perform both the transcription and the jazz generation. Our full-fledged web app will host the entire end to end user experience, from uploading audio files all the way to generating, downloading, and sharing their music. We’ll also keep scalability/concurrency issues in mind, especially for simultaneous users of the service.