Inspiration
We started from a clear product gap: real-time assistants are often reactive but shallow, while offline content generators are powerful but disconnected from live context. Our goal was to merge both modes into a single agent that can listen and respond naturally in live conversation, then instantly switch into a creative director role to produce coherent visual storytelling.
The core idea is "Conversation to Cinema":
- Dialogue creates intent.
- Intent becomes storyboard.
- Storyboard becomes preview.
- Preview becomes final video.
What it does
Diorama is a production-ready Gemini system that combines two hackathon tracksβLive Agents and Creative Storytellerβinto one coherent user experience. The flow is intentionally staged for quality and control:
- Directing: The user talks to the agent in real time via an interruptible voice interface.
- Planning: Gemini generates a structured cinematic scenario.
- Storyboarding: Imagen 3 creates per-shot preview images.
- Validation: The user explicitly approves the storyboard via a UI button.
- Production: Veo 3.1 Fast generates shot videos based on those previews.
- Final Cut:
ffmpegstitches the individual shots into the final film.
This gives users both responsiveness and confidence: natural conversation first, controllable high-quality media output second.
How we built it
We implemented a Google Cloud-native architecture consisting of three core services:
gemini-ws-server: Handles live WebSocket sessions, intent orchestration, and real-time status events.gemini-visualization-api: Serves as the retrieval layer for preview and final assets.gemini-front: A custom React + Vite interface managing the interactive user control flow.
The generation pipeline leverages Gemini 2.5 Pro for structured scenarist output, Imagen 3 for previews, and Veo 3.1 Fast for video. We used a Canary-first deployment strategy on Cloud Run, utilizing smoke tests before traffic shifts and Cloud Logging for request traceability.
To ensure shot continuity, we utilize the structured output to maintain a consistent visual manifest ( M ). For any shot ( S_n ), the prompt conditioning is defined as: $$P(S_n) = f(M, Context_{n-1})$$ where ( f ) is our orchestration logic that balances new narrative intent with existing visual anchors.
Challenges we ran into
- UX Synchronization: Preserving real-time UX while running heavy media generation required precise asynchronous event handling.
- Resource Management: Handling quota pressure, specifically Imagen 429 RESOURCE_EXHAUSTED errors, without breaking the conversational flow.
- State Reliability: Preventing false βreadyβ states before final assets were fully available in Google Cloud Storage.
- Coherence: Maintaining scene continuity across multiple generated shots.
- DevOps: Stabilizing canary/stable deployment operations and cleaning legacy fallback logic without regressions.
Accomplishments that we're proud of
- Production-Grade Stability: Achieving a system that uses Canary deployments and E2E verification, moving beyond a simple prototype.
- User Trust: Implementing the Preview-first flow, which significantly improved user confidence in the AI's creative direction.
- Seamless Orchestration: Successfully chaining four distinct generative models into a single, fluid user journey that feels like a conversation with a human director.
What we learned
- Control is Key: Explicit approval gates reduce wasted compute and improve user satisfaction.
- Structure over Chaos: Structured scenarist output is critical for maintaining shot continuity across multimodal outputs.
- Truthful Signaling: Transparent status signaling is mandatory for a production-ready UX.
- System Design: Multimodal quality depends as much on the orchestration as it does on the underlying model capability.
- Ops Discipline: The Canary + Smoke + Promote discipline drastically reduces production risks.
What's next for Diorama
Moving forward, we plan to implement:
- Real-time Re-shoots: Allowing the user to interrupt the video generation to tweak specific shots while keeping the rest of the film intact.
- Character Consistency Tuning: Deepening the conditioning between Imagen and Veo to ensure perfect actor persistence.
- Spatial Audio Integration: Using Gemini to generate synchronized soundscapes and dialogue tracks to match the 4K video output.## Inspiration
Built With
- fastapi
- ffmpeg
- gemini-2.5-pro
- gemini-live-api
- google-cloud
- google-cloud-build
- google-cloud-logging
- google-cloud-run
- google-genai-sdk
- imagen-3
- python
- react+vite
- veo-3.1-fast
- websocket
Log in or sign up for Devpost to join the conversation.