Inspiration ✨
Stories are the oldest form of human connection, yet the way we consume them digitally has remained largely static. We were inspired by the magic of childhood pop-up books and oral storytelling—where every page turn is a surprise and the narrator reacts to your gasps and choices. With the announcement of The Gemini Live Agent Challenge, we realized we had the exact technological missing link to make this a reality: the ability to process real-time voice and interleaved multimodal outputs. We wanted to build an AI that doesn't just "chat," but directs a living, breathing cinematic universe.
What it does 🎬
Lunar Storyteller is a next-generation interactive storybook and creative AI director. It seamlessly weaves together a fluid, mixed-media experience. You don't type; you talk directly to the Chubby Bunny, our low-latency Voice Agent who acts as your energetic Game Master. Once you verbally agree on a narrative path, the system generates a rich, interleaved stream of:
- Bilingual, emotive text narration.
- Generative 3D-styled illustrations visually representing the exact scene.
- Cinematic video renders stitched on-the-fly from the generated assets, complete with spatial auto-generated background music. It is an educational and interactive journey tailored entirely by the user's voice.
How we built it 🏗️
We designed a modern Dual-Agent Routing Architecture hosted on Google Cloud Run to satisfy the interleaved output requirements:
- The Receptionist (Live Audio WebSocket): We used the
gemini-2.5-flash-native-audio-latestmodel via thev1alphaLive API. This agent handles the bidrectional audio stream. It uses Function Calling to evaluate when the user has made a conclusive choice, triggering thecommit_story_steptool without breaking conversation. - The Studio Engine (REST API): The backend catches the tool's payload and routes it to
gemini-2.5-flash. We provided Gemini with a custom Domain-Specific Language (Tags) and passed the historical images as base64 context. - The Interleaved Mixed Output: The engine computes the narrative and outputs a structured formatting block:
[TEXT],[IMAGE_PROMPT],[CHOICE], and[VIDEO_STORYBOARD]. - Asynchronous Video Assembly: Our Python backend parses the
[VIDEO_STORYBOARD]tag and fires off an isolated Node.js Thread usingFFCreatorLiteandFFMPEGto render a 16:9 mp4 from the Imagen 4.0 generated frames, injecting an audio track—all while the user is already reading the text on the React/Three.js frontend.
To formalize the routing logic we built, the probability of rendering a coherent mixed-media narrative arc $M_t$ at step $t$ can be expressed with the following mathematical factorization:
$$ P(M_t) = P_{Live}(s_t \mid A_t, \theta_{Gemini2.0}) \times P_{Studio}(T_t, I_t, V_t \mid s_t, I_{<t}, \theta_{Gemini2.5}) $$
Where $A_t$ is the user's streaming audio, $s_t$ is the committed story state, and $(T_t, I_t, V_t)$ represent the interleaved text, image prompts, and video storyboard instructions derived from the historical visual context $I_{<t}$.
Challenges we ran into 🧩
- The "Silent Video" Rendering Race Condition: Rendering video takes time. Initially, our frontend would try to fetch the
.mp4the exact millisecond the Node.js process created the file shell, resulting in an unplayable 0-byte video block. We solved this by using Pythonsubprocessmonitoring, implementing an artificial_done.txtquarantine lock, and running an automatic FFmpeg-movflags faststartpass so the video could stream instantly on Chrome/Safari. - Quota Exhaustion Management: Processing ultra-high-quality images using the
imagen-4.0-ultramodel quickly hit our Hackathon tier quota limits. We had to build a fallback interception script that uses PIL (Pillow) to grab lower-res pre-rendered fallback emotion cards, resize and center-crop them to exactly $1024 \times 1024$ pixels via Lanczos resampling, so the Video builder wouldn't crash due to dimension mismatches. - Variable Shadowing in WebSocket Threads: Managing Python’s asynchronous memory pools for the Live API WebSocket while juggling global API authorization scopes caused several internal loop crashes which required heavy debugging to isolate the variables.
Accomplishments that we're proud of 🏆
- True Interleaved Output: We didn't just build a text bot with images glued on. The fact that the Gemini 2.5 Engine outputs directorial commands (
[IMAGE_PROMPT]) interwoven with text, and that the backend parses those mid-stream to trigger external rendering engines, feels incredibly powerful. - The Zero-Click Interface: Achieving a workflow where a child (or adult) could theoretically play the entire game just by speaking to their microphone, and watching a full video get synthesized dynamically is magical.
What we learned 📚
- The
v1alphaGemini Live API is unbelievably fast, but designing the system prompt so the agent waits for user confirmation before firing a tool requires very strict behavioral constraints. - Managing File I/O across different sub-processes (Python calling Node.js calling FFmpeg) is treacherous without robust state locking mechanisms.
- The difference between
imagen-4.0-fastandultrais notable, but clever prompting and standardizing aspect ratios across the board can hide lower-tier flaws effectively.
What's next for Lunar Storyteller 🚀
- Persistent Character Memory: Implementing a Vector Database (like Pinecone) so Chubby Bunny remembers choices you made in entirely different storytelling sessions.
- Dynamic 3D Generation: Swapping the static 2D image generator for native
.glbor.gltf3D model generation APIs, allowing the user to rotate the generated scene instantly in the Three.js viewport instead of relying on a pre-rendered video. - Multi-Agent Voice Cast: Hooking up
gemini-2.5-flash-native-audio-latestto multiple voice profiles so when different characters speak in the story, the Text-To-Speech engine dynamically switches audio profiles based on an assigned[SPEAKER]tag.
Built With
- antigravity
- docker
- fastapi
- ffcreatorlite
- ffmpeg
- gemini-2.0-flash-exp
- gemini-2.5-flash
- google-cloud-run
- imagen-4.0-fast
- imagen-4.0-ultra
- node.js
- python-3.11
- react
- tailwind-css
- three.js
- typescript
- uvicorn
- vite
- websockets
Log in or sign up for Devpost to join the conversation.