Diorama

Inspiration

We started from a clear product gap: real-time assistants are often reactive but shallow, while offline content generators are powerful but disconnected from live context. Our goal was to merge both modes into a single agent that can listen and respond naturally in live conversation, then instantly switch into a creative director role to produce coherent visual storytelling.

The core idea is "Conversation to Cinema":

Dialogue creates intent.
Intent becomes storyboard.
Storyboard becomes preview.
Preview becomes final video.

What it does

Diorama is a production-ready Gemini system that combines two hackathon tracks—Live Agents and Creative Storyteller—into one coherent user experience. The flow is intentionally staged for quality and control:

Directing: The user talks to the agent in real time via an interruptible voice interface.
Planning: Gemini generates a structured cinematic scenario.
Storyboarding: Imagen 3 creates per-shot preview images.
Validation: The user explicitly approves the storyboard via a UI button.
Production: Veo 3.1 Fast generates shot videos based on those previews.
Final Cut: ffmpeg stitches the individual shots into the final film.

This gives users both responsiveness and confidence: natural conversation first, controllable high-quality media output second.

How we built it

We implemented a Google Cloud-native architecture consisting of three core services:

gemini-ws-server: Handles live WebSocket sessions, intent orchestration, and real-time status events.
gemini-visualization-api: Serves as the retrieval layer for preview and final assets.
gemini-front: A custom React + Vite interface managing the interactive user control flow.

The generation pipeline leverages Gemini 2.5 Pro for structured scenarist output, Imagen 3 for previews, and Veo 3.1 Fast for video. We used a Canary-first deployment strategy on Cloud Run, utilizing smoke tests before traffic shifts and Cloud Logging for request traceability.

To ensure shot continuity, we utilize the structured output to maintain a consistent visual manifest ( M ). For any shot ( S_n ), the prompt conditioning is defined as: $$P(S_n) = f(M, Context_{n-1})$$ where ( f ) is our orchestration logic that balances new narrative intent with existing visual anchors.

Challenges we ran into

UX Synchronization: Preserving real-time UX while running heavy media generation required precise asynchronous event handling.
Resource Management: Handling quota pressure, specifically Imagen 429 RESOURCE_EXHAUSTED errors, without breaking the conversational flow.
State Reliability: Preventing false “ready” states before final assets were fully available in Google Cloud Storage.
Coherence: Maintaining scene continuity across multiple generated shots.
DevOps: Stabilizing canary/stable deployment operations and cleaning legacy fallback logic without regressions.

Accomplishments that we're proud of

Production-Grade Stability: Achieving a system that uses Canary deployments and E2E verification, moving beyond a simple prototype.
User Trust: Implementing the Preview-first flow, which significantly improved user confidence in the AI's creative direction.
Seamless Orchestration: Successfully chaining four distinct generative models into a single, fluid user journey that feels like a conversation with a human director.

What we learned

Control is Key: Explicit approval gates reduce wasted compute and improve user satisfaction.
Structure over Chaos: Structured scenarist output is critical for maintaining shot continuity across multimodal outputs.
Truthful Signaling: Transparent status signaling is mandatory for a production-ready UX.
System Design: Multimodal quality depends as much on the orchestration as it does on the underlying model capability.
Ops Discipline: The Canary + Smoke + Promote discipline drastically reduces production risks.

What's next for Diorama

Moving forward, we plan to implement:

Real-time Re-shoots: Allowing the user to interrupt the video generation to tweak specific shots while keeping the rest of the film intact.
Character Consistency Tuning: Deepening the conditioning between Imagen and Veo to ensure perfect actor persistence.
Spatial Audio Integration: Using Gemini to generate synchronized soundscapes and dialogue tracks to match the 4K video output.## Inspiration