Inspiration

We started from a clear product gap: real-time assistants are often reactive but shallow, while offline content generators are powerful but disconnected from live context. Our goal was to merge both modes into a single agent that can listen and respond naturally in live conversation, then instantly switch into a creative director role to produce coherent visual storytelling.

The core idea is "Conversation to Cinema":

  • Dialogue creates intent.
  • Intent becomes storyboard.
  • Storyboard becomes preview.
  • Preview becomes final video.

What it does

Diorama is a production-ready Gemini system that combines two hackathon tracksβ€”Live Agents and Creative Storytellerβ€”into one coherent user experience. The flow is intentionally staged for quality and control:

  1. Directing: The user talks to the agent in real time via an interruptible voice interface.
  2. Planning: Gemini generates a structured cinematic scenario.
  3. Storyboarding: Imagen 3 creates per-shot preview images.
  4. Validation: The user explicitly approves the storyboard via a UI button.
  5. Production: Veo 3.1 Fast generates shot videos based on those previews.
  6. Final Cut: ffmpeg stitches the individual shots into the final film.

This gives users both responsiveness and confidence: natural conversation first, controllable high-quality media output second.

How we built it

We implemented a Google Cloud-native architecture consisting of three core services:

  • gemini-ws-server: Handles live WebSocket sessions, intent orchestration, and real-time status events.
  • gemini-visualization-api: Serves as the retrieval layer for preview and final assets.
  • gemini-front: A custom React + Vite interface managing the interactive user control flow.

The generation pipeline leverages Gemini 2.5 Pro for structured scenarist output, Imagen 3 for previews, and Veo 3.1 Fast for video. We used a Canary-first deployment strategy on Cloud Run, utilizing smoke tests before traffic shifts and Cloud Logging for request traceability.

To ensure shot continuity, we utilize the structured output to maintain a consistent visual manifest ( M ). For any shot ( S_n ), the prompt conditioning is defined as: $$P(S_n) = f(M, Context_{n-1})$$ where ( f ) is our orchestration logic that balances new narrative intent with existing visual anchors.

Challenges we ran into

  • UX Synchronization: Preserving real-time UX while running heavy media generation required precise asynchronous event handling.
  • Resource Management: Handling quota pressure, specifically Imagen 429 RESOURCE_EXHAUSTED errors, without breaking the conversational flow.
  • State Reliability: Preventing false β€œready” states before final assets were fully available in Google Cloud Storage.
  • Coherence: Maintaining scene continuity across multiple generated shots.
  • DevOps: Stabilizing canary/stable deployment operations and cleaning legacy fallback logic without regressions.

Accomplishments that we're proud of

  • Production-Grade Stability: Achieving a system that uses Canary deployments and E2E verification, moving beyond a simple prototype.
  • User Trust: Implementing the Preview-first flow, which significantly improved user confidence in the AI's creative direction.
  • Seamless Orchestration: Successfully chaining four distinct generative models into a single, fluid user journey that feels like a conversation with a human director.

What we learned

  • Control is Key: Explicit approval gates reduce wasted compute and improve user satisfaction.
  • Structure over Chaos: Structured scenarist output is critical for maintaining shot continuity across multimodal outputs.
  • Truthful Signaling: Transparent status signaling is mandatory for a production-ready UX.
  • System Design: Multimodal quality depends as much on the orchestration as it does on the underlying model capability.
  • Ops Discipline: The Canary + Smoke + Promote discipline drastically reduces production risks.

What's next for Diorama

Moving forward, we plan to implement:

  • Real-time Re-shoots: Allowing the user to interrupt the video generation to tweak specific shots while keeping the rest of the film intact.
  • Character Consistency Tuning: Deepening the conditioning between Imagen and Veo to ensure perfect actor persistence.
  • Spatial Audio Integration: Using Gemini to generate synchronized soundscapes and dialogue tracks to match the 4K video output.## Inspiration

Built With

  • fastapi
  • ffmpeg
  • gemini-2.5-pro
  • gemini-live-api
  • google-cloud
  • google-cloud-build
  • google-cloud-logging
  • google-cloud-run
  • google-genai-sdk
  • imagen-3
  • python
  • react+vite
  • veo-3.1-fast
  • websocket
Share this project:

Updates