R.A.C.K. Neural Audio Inverter

Model architecture overview

Inspiration

Inspired by experimental sound design and circuit bending, we set to explore the manipulation of encoded representations of sounds with the help of an audio/music autoencoder such as EnCodec.

What it does

This model generates a song remix by manipulating its latent representations in two ways. First, it inverts the normalized latent representations, then it reverses it along the time dimension. We found that the combination of these two operations creates a unique transformation for a new remix.

How we built it

We started from the EnCodec model and the 48kHz pre-trained weights from Meta in pytorch 2.0. We hosted our backend on a Google Cloud GPU instance. We have a web UI in React and a backend server in python Flask.

Challenges we ran into

Our original idea involved more latent space modifications, but we realized that truncating or repeating these vectors is not well supported by the decoder, and it often generates a buzzing harsh artifact (probably a frequency multiple of the sample rate) that drowns any real signal. Because of that, we had to reduce the scope of latent transformations we could reasonably do for a good output.

Accomplishments that we're proud of

We made an interesting new sound effect that has (likely) never been used or heard before!

What we learned

Most pre-trained EnCodec decoders are not resilient to corrupted dimensions in the latent space. They also have a hard time encoding/decoding silence without generating buzzing frequencies. This could be solved with fine-tuning with dropout and more specific examples.