Inspiration
I watch a lot of Netflix and Prime Video. Like, a lot. And the ads drive me insane. You're completely locked into a scene, emotionally invested, and then a random 30 second ad for car insurance pulls you right out of it. I don't have the budget to upgrade to ad free tiers on every platform I use, and honestly it's frustrating that some platforms started sneaking ads into paid subscriptions too.
But here's the thing. I don't want to use ad blockers either. YouTube needs ad revenue. Netflix needs ad revenue. These platforms employ thousands of people and fund the shows I love watching. Blocking ads entirely feels like a lose lose. Creators lose income, platforms lose revenue, and eventually the content gets worse for everyone.
So I started thinking about a middle ground. What if ads didn't have to be annoying? What if instead of interrupting your show with a random pre roll, the ad was actually part of the show? A product that naturally appears in a scene, a short branded clip that flows between scenes so smoothly you barely notice the transition. The advertiser gets their product on screen, the platform gets paid, and the viewer doesn't get ripped out of the story.
That's Mirage. An AI system that analyzes video content, finds the perfect moment, generates a branded video clip that matches the show's visual style, and splices it in seamlessly.
How I built it
The pipeline has two phases.
Preprocessing happens once per video. FFmpeg pulls out key frames at scene cuts. Instead of looking at all 86,000+ frames in an episode, I get maybe 50 to 80 that actually matter. While that's happening, ElevenLabs Scribe v2 transcribes the audio with word level timestamps and flags stuff like laughter, silence, and music. AWS Rekognition runs object detection on each key frame so I know what's in every scene. Tables, cups, restaurants, laptops, whatever is there.
Ad creation is where it gets fun. The advertiser opens a chat interface, uploads a product image, and describes what they want. All the preprocessed data (frames, transcript, Rekognition labels) gets fed to Claude via OpenRouter, which picks the best insertion point. Usually a natural scene transition with a moment of silence. It returns the timestamp, its reasoning, and generation prompts.
Then I generate. Gemini helps craft the scene description, and Fal's Nano Banana Pro creates 3 product showcase image variants showing the product in a setting that matches the show's visual style. Fal's Kling 3 Pro then generates two 4 second video clips with audio. Clip A transitions from the last movie frame into the product shot. Clip B transitions from the product back into the next movie frame. FFmpeg stitches it all together: [movie] + [Clip A] + [Clip B] + [movie continues]. That gives you an 8 second branded moment that feels like part of the episode.
I tested across four shows to make sure it generalizes: Business Proposal (K drama romance), Squid Game (dark thriller), Stranger Things (80s American sci fi), and a Vietnamese clip to test multilingual support. Different languages, different moods, different visual styles. Mirage handles all of them.
What I learned
Cost engineering turned out to be just as important as the AI itself. Sending every frame through a vision model would run $20 to $50 per episode. The funnel approach I built (local FFmpeg processing narrows it down, cheap parallel analysis identifies candidates, expensive LLM only touches a handful of frames) brings that under $0.15. That's the kind of number that makes the business case real.
The transition between real footage and generated content is where the magic happens or doesn't. Early attempts looked obviously fake because the lighting and color grading didn't match. Using Kling 3 to generate smooth morphs between actual movie frames and product images was the key. It handles the style matching way better than trying to composite static images.
Building the Business Proposal demo was fun because K dramas have so many natural product placement opportunities. Café scenes, office desks, restaurant dates. It also pushed me to make sure the pipeline works across languages, not just English content, since I'm presenting at LotusHack in HCMC.
Challenges I faced
Video generation speed was the main pain point. Kling 3 isn't instant and generating clips takes real time. I precomputed results for my demo clips and kept a live analysis flow for short segments so judges can see the tech actually running.
Getting the audio right was harder than I expected. You can't just splice a silent ad into a scene because the audio gap is immediately noticeable. I had to generate ambient SFX through ElevenLabs that match the surrounding scene's audio environment.
The LLM kept picking technically valid but emotionally terrible insertion points early on. It would drop an ad right after a character cries or during a tense moment. I iterated on the scene selection prompt a lot to factor in narrative pacing and emotional beats, not just visual opportunity.
Built With
- amazon-web-services
- elevenlabs
- fal
- fastapi
- openrouter
- react
- rekognition
Log in or sign up for Devpost to join the conversation.