Mixingjays

Inspiration

Some of our team members are hobbyist DJs, playing shows for college parties and other local events. As smaller-size DJs we've always dreamed of having amazing live visuals for our sets, at the same quality as the largest EDM festivals; however, the cost, effort, and skillset was clearly infeasible for us, and infeasible for any musicians but the higher-profile ones who have the budget to commission lengthy visualizations and animations from third-party digital artists. That was a few years ago; recently, since generative vision models in ML have reached such a high quality of output, and open-source models (namely, stable diffusion) allow for intricate, transformative engineering at the level of model internals, we've realized our music live visuals dream has become possible.

We've come up with an app that takes a song and a prompt, and generates a music video in real-time that's perfectly synced to the music. The app will allow DJs and other musicians to have professional-level visuals for their live shows without the need for expensive commissions or complex technical setups. With our app, any musician can have access to stunning visuals that will take their performances to the next level.

What it does

Mixingjays is an innovative video editing tool powered by machine learning that's designed specifically for creating stunning visual accompaniments to music. Our app allows clients to upload their music tracks and specify various parameters such as textures, patterns, objects, colors, and effects for different sections of the song.

Our system then takes these specifications and translates them into prompts for our advanced ML stable diffusion model, which generates a unique video tailored to the uploaded music track. Once the video is generated, the user can easily edit it using our intuitive and user-friendly interface. Our app provides users with a range of editing options, from basic graphical edits (similar to popular photo editing apps like Instagram and VSCO) to advanced generative edits that are handled by our ML pipeline.

The user experience of Mixingjays is similar to that of well-known video editing software like Adobe Premiere or Final Cut, but with the added power of our machine learning technology driving the video outputs. Our app provides users with the ability to create professional-quality music videos with ease and without the need for extensive technical knowledge. Whether you're a professional musician or an amateur DJ, Mixingjays can help you create stunning visuals that will take your music to the next level while giving the musicians full creative control over the video.

How we built it

One of our baseline innovations is generating video from stable diffusion and engineering on the model internals of stable diffusion. As a standalone, stable diffusion outputs static images based on natural language prompts. Given a list of images that stable diffusion outputs from several prompts, we're able to generate video between the images by interpolating between images in CLIP latent space. A detail of stable diffusion is that it actually computes its output on a latent space, a compressed coordinate system the the model abstractly uses to represent images and text; the latent vector in latent space is then decoded into the image we see. This latent space is well-behaved in Euclidean space, so we can generate new image outputs and smooth transitions between images, strung into a video, by linearly interpolating between the latent vectors of two different images before decoding every interpolated vector. If the two images we interpolate between are from two different prompts from the same seed of stable diffusion, or the same prompt across two different seeds of stable diffusion, the resulting interpolations and video end up being semantically coherent and appealing.[1] A nice augmentation to our interpolation technique is perturbing the latent vector in coordinate space, making a new latent vector anywhere in a small Lp-norm ball surrounding the original vector. (This resembles the setting of adversarial robustness in computer vision research, although our application is not adversarial.) As a result, given a list of stable diffusion images and their underlying latent vectors, we can generate video by latent-space-interpolating between the images in order.

We generate videos directly from the music, which relies on a suite of algorithms for music analysis. The demucs model from Facebook AI Research (FAIR) performs "stem separation," or isolation of certain individual instruments/elements in a music track; we pass our music track into the model and generate four new music tracks of the same length, containing only the vocals, drums, bass, and melodic/harmonic instruments of the track, respectively. With OpenAI Whisper, we're able to extract all lyrics from the isolated vocal track, as well as bucket lyrics into corresponding timestamps at regular intervals. With more classical music analysis algorithms, we're able to:

convert our drum track and bass track into timestamps for the rhythm/groove of the track
and convert our melody/harmony track into a timestamped chord progression, essentially the formal musical notation of harmonic information in the track.

These aspects of music analysis directly interface into our stable diffusion and latent-space-interpolation pipeline. In practice, the natural language prompts into stable diffusion usually consist of long lists of keywords specifying:

objects, textures, themes, backgrounds
physical details
adjectives influencing tone of the picture
photographic specs: lighting, shadows, colors, saturation/intensity, image quality ("hd", "4k", etc)
artistic styles Many of these keywords are exposed to the client at the UI level of our application, as small text inputs, sliders, knobs, dropdown menus, and more. As a result, the client retains many options for influencing the video output, independently of the music. These keywords are concatenated with natural language encapsulations of musical aspects:
Lyrics are directly added to the prompts.
Melodic/harmonic information and chord progressions map to different colors and themes, based on different chords' "feel" in music theory.
Rhythm and groove is added to the interpolation system! We speed up, slow down, and alter our trajectory of movement though latent space in time with the rhythm. The result is a high-quality visualizer that incorporates both the user's specifications/edits and the diverse analyzed aspects of the music track.

Theoretically, with good engineering, we'd be able to run our video generation and editing pipelines very fast, essentially in real time! Because interpolation occurs at the very last layer of the stable diffusion model internals, and video output from it depends only on a simple decoder instead of the entire stable diffusion model, video generation from interpolation is very fast and the runtime bottleneck depends only on stable diffusion. We generate one or two dozen videos from stable diffusion per video, so by generating each image in parallel on a wide GPU array, as well as incorporating stable diffusion speedups, we're able to generate the entire video for a music track in a few seconds!

Challenges we ran into

Prompt generation: Generating good quality prompts is hard. We tried different ways of prompting the model to see its affect on the quality of generated images and output video and discovered we needed more description of the song in addition to the manually inputted prompt. This motivated us to look into music analysis.

Music Analysis: Our music analysis required pulling together a lot of disparate libraries and systems for digital signal processing that aren't typically used together, or popular enough to be highly supported. The challenge here was wrangling these libraries into a single system without the pipeline crumbling to obscure bugs.

Stable Diffusion: The main challenge in using stable diffusion for multiple image generation is the overhead cost, compute and time. Primitive implementations of stable diffusion took 30s to generate an image which made things hard for us since we need to generate multiple images against each song and prompt pair. Modal workshop at Treehacks was very helpful for us to navigate this issue. We used modal containers and stubs to bake our functions into the images such that they are run only when the functions are called and free the GPUs otherwise for other functions to run. It also helped us parallelise our code which made things faster. However, since it is new, it was difficult to get it to recognise other files, adding data types to function methods etc.

Interpolation: This was by far the hardest part. Interpolation is not only slow but also hard to implement or reproduce. After extensive research and trials with different libraries, we used Google Research’s Frame-interpolation method. It is implemented in tensorflow, in contrary to the other code which was more PyTorch heavy. In addition to that it is slow and scales exponentially with the number of images. A 1 minute video took 10 minutes to generate. Given the 36 hour time limitation, we had to generate the videos in advance for our app but there is tons of scope to make it faster.

Integration: The biggest bottlenecks came about when trying to generate visuals for arbitrary input mp3 files. Because of this, we had to reduce our expectation of on-the-fly, real-time rendering of any input file, to a set of input files that we generated offline. Render times ranged anywhere from 20 seconds to many minutes, which means that we’re still a little bit removed from bringing this forward as a real-application for users around the world to toy with. Real-time video editing capabilities also proved difficult to implement given the lack of time for building substantial infrastructure, but we see this as a next step in building out a fully functional product that anyone can use!

Accomplishments that we're proud of

We believe we've made significant progress towards a finalized, professional-grade application for our purposes. We showed that the interpolation idea produces video output that is high-quality, semantically coherent, granularly editable, and closely linked to the concurrent musical aspects of rhythm, melody, harmony, and lyrics. This makes our video offerings genuinely usable in professional live performances. Our hackathon result is a more rudimentary proof-of-concept, but our results lead us to believe that continuing this project in a longer-term setting would lead to a robust, fully-featured offering that directly plugs into the creative workflow of hobbyist and professional DJs and other musical live performers.

What we learned

We found ourselves learning a ton about the challenges of implementing cutting-edge AI/ML techniques in the context of a novel product. Given the diverse set of experiences that each of our team members have, we also learned a ton through cross-team collaboration between scientists, engineers, and designers.

What's next for Mixingjays

Moving forward, we have several exciting plans for improving and expanding our app's capabilities. Here are a few of our top priorities:

Improving speed and latency: We're committed to making our app as fast and responsive as possible. That means improving the speed and latency of our image-to-video interpolation process, as well as optimizing other parts of the app for speed and reliability.

Video editing: We want to give our users more control over their videos. To that end, we're developing an interactive video editing tool that will allow users to regenerate portions of their videos and choose which generation they want to keep. Additionally, we're exploring the use of computer vision to enable users to select, change, and modify objects in their generated videos.

Script-based visualization: We see a lot of potential in using our app to create visualizations based on scripts, particularly for plays and music videos. By providing a script, our app could generate visualizations that offer guidance on stage direction, camera angles, lighting, and other creative elements. This would be a powerful tool for creators looking to visualize their ideas in a quick and efficient manner.

Overall, we're excited to continue developing and refining our app to meet the needs of a wide range of creators and musicians. With our commitment to innovation and our focus on user experience, we're confident that we can create a truly powerful and unique tool for music video creation.