Inspiration

Every creator hits the same wall: you've made a great video, but finding music that actually fits it is miserable. Licensed tracks are expensive, royalty-free libraries are generic and overused, and real scoring tools have a brutal learning curve. We're musicians and editors ourselves, and we were tired of scrubbing through stock-music sites trying to force a pre-made loop onto a moment it was never written for. The track that's "close enough" never lands the emotional beat, and clearing rights for anything better is its own headache. We wanted music composed for your specific video, original, license-free, and shaped to the story, without needing to be a producer or pay a sync-licensing fee.

What it does

BananaMOV turns any video into its own original soundtrack. You upload a clip and AI watches it scene by scene, reading its mood, energy, pacing, color, camera movement, and even the existing audio, then builds an emotional timeline of the footage. From that timeline it composes a custom score whose structure rises and falls with the video, so the music swells where the video swells and settles where it settles. The result is yours and royalty-free. You can preview the score against your video with synced playback, split it into stems (drums, bass, melody, and vocals), and then remix everything in a built-in browser DAW with a full mixer, insert effects, tempo control, and timeline editing. Everything runs in the browser with no external software and no licensing headaches.

How we built it

BananaMOV is a Next.js 16 and TypeScript app built behind a clean provider/factory architecture, so each AI service is swappable without touching the rest of the code. Google Gemini 2.5 Flash performs the video analysis, returning a structured emotional timeline plus sound-only musical direction (key, tempo, dynamics, instrumentation, and per-segment transitions). We map that timeline into an ElevenLabs Music composition plan, one music section per arc segment, so the generated score's structure aligns to the video rather than playing as a generic loop underneath it. Demucs handles local stem separation into four tracks, ffmpeg extracts the original audio for synced preview, and a custom Web Audio API engine powers the DAW with pitch-preserved tempo changes, effects routing, and offline WAV rendering. Saved projects persist on Neon Postgres with Drizzle and Vercel Blob storage, with optional Clerk authentication.

Challenges we ran into

Aligning generated music to a video's timeline was the hardest part. ElevenLabs caps each section between 3 and 120 seconds, so we had to merge tiny scene cuts into their neighbors and split long sustained moods into musically coherent blocks while keeping the section durations honest. Getting Gemini to describe sound instead of pictures took heavy prompt engineering plus a "visual-leak" filter that strips out any language referencing what is seen. Browser audio fought us too. Many uploaded clips use codecs the browser cannot decode, so the video plays silently, which meant we had to extract the original audio with ffmpeg and play it as a separate slaved track that drift-corrects against the video clock. Building a real-time DAW clock that keeps the playhead, the audio scheduler, and the pitch-preserving time-stretch all in sync off a single time source was its own rabbit hole.

Accomplishments that we're proud of

A score that follows the video instead of looping generically beneath it, with section changes that track the actual emotional arc. A guaranteed layered arrangement on every track, with a real bass layer, a melodic lead, harmonic texture, and drums or wordless vocals added when the analysis calls for them. And a complete, working browser DAW with stems, an FL-style mixer, insert effects including a tempo-synced filter envelope, clip trim and split, and WAV export, all running on the Web Audio API with no native plugins. We also kept the whole pipeline resilient, with validators and fallbacks at every step so a bad response from one model never crashes the flow.

What we learned

How to translate a visual emotional arc into concrete musical direction, mapping mood and pacing onto tempo, key, dynamics, and section transitions. A lot about the realities of browser audio, including codec decoding limits, sample-accurate scheduling, and pitch-preserving time-stretch. How much careful prompt design matters when you need structured, reliable JSON out of a model instead of prose. And how a strict provider/factory pattern pays off, letting us swap analysis and music models behind a single interface without rewriting routes, hooks, or components.

What's next for BananaMOV

Frame-accurate "hit point" syncing so the music lands on specific cuts and beats. User-steerable generation, so you can pick the genre and instruments or regenerate a single section without redoing the whole track. Real-time collaborative mixing in the DAW. Longer-form support beyond the current three-minute cap. And direct export back into an edited video with the score and mix baked in, so you leave BananaMOV with a finished, shareable clip.

Share this project:

Updates