Story Labs

User Inputs
Google Cloud Services Used
Editing Flow
Script Generation
Video Generation

Inspiration

Video editing hasn't fundamentally changed in 20 years. You still drag clips, scrub timelines, and manually adjust every parameter. We asked: what if your editor could just listen?

With Gemini's Live API enabling real-time multimodal interaction, we saw the opportunity to build something genuinely new - an agent that hears your intent, sees your current edit, generates creative assets, and applies changes directly to the timeline.

App is deployed and fully functional: http://gemininychackathon.vercel.app

The website is live and fully functional. Right now, it supports only 1 video per email, so please do not abuse it.

What it does

StoryLab is a Gemini Live-powered video creation and editing agent with three modes:

Live Edit Mode - Talk to your video. Scout, our editing agent, runs on Gemini Live API with full interruption support. Ask it to add text overlays, swap music, apply visual effects, trim scenes, or fill YouTube metadata - all by voice, in one persistent session.

Screen-Aware Mode - Scout watches your editor at 1 frame/second. When you say "edit this image" or "fix this scene," it knows exactly what's on screen. No clicking, no selecting, no explaining context.

Creative Director Mode - Gemini's interleaved generation produces storyboards, scene images, and narrative direction in one co-created output stream. Not stitched together after the fact - generated together.

StoryLab also includes a full video generation pipeline: voice/text/PDF → Gemini 2.5 Pro script agent → TTS voiceover → image generation with visual QA → FFmpeg composition → Twick render → publish to YouTube Shorts, Instagram, and TikTok.

How we built it

Backend: FastAPI (Python) on Google Cloud Run, two services - voicevid-api (512MB/60s) and voicevid-worker (4GB/900s, internal-only)
Live Agent: Gemini Live API WebSocket session with 25+ edit tools - music, effects, captions, image generation, timeline operations
Script Agent: Gemini 2.5 Pro ReAct agent with 14-turn reasoning loop, quality scoring (threshold ≥ 70/100), and async Cloud Tasks queue
Video Pipeline: 7-stage async pipeline - TTS (gemini-2.5-flash-tts), image generation (Gemini Image 3.1 Flash), visual QA (gemini-2.5-flash), FFmpeg composition, Twick renderer (Node.js/Puppeteer on Cloud Run)
Creative Director: Gemini 2.0 Flash interleaved text+image generation
Frontend: Next.js 15 App Router, Firebase Auth, Twick editor SDK
Infrastructure: Cloud Tasks, Cloud Firestore, Cloud Storage, Secret Manager, Cloud Build, Artifact Registry, Vertex AI

Challenges we ran into

Managing Gemini Live WebSocket state across interruptions and mode switches while keeping tool execution in sync with the editor
Building screen-aware context injection at 1fps without overloading the Live session with redundant frames
Veo 3 was too expensive for fully animated motion video, so we shifted to motion-picture style outputs instead

Accomplishments that we're proud of

A truly interruption-friendly live agent that maintains edit context across the full session
25+ edit tools working reliably via voice with zero manual UI interaction
End-to-end video generation from voice input to published YouTube Short in one flow
Screen-aware editing - the agent understands what's visible without DOM access or APIs

What we learned

Gemini Live API's bidirectional streaming is genuinely suited for stateful, long-running creative sessions - not just Q&A
Interleaved generation changes how creative workflows feel - assets emerge with the narrative, not after it
Building reliable async pipelines on Cloud Run requires careful concurrency design (containerConcurrency: 1 on the worker was critical)