LUNAR STORYTELLER

AI Generated by Lunar Storyteller
AI Generated by Lunar Storyteller
AI Generated by Lunar Storyteller
AI Generated by Lunar Storyteller
AI Generated by Lunar Storyteller
AI Generated by Lunar Storyteller
AI Generated by Lunar Storyteller
AI Generated by Lunar Storyteller

Inspiration ✨

Stories are the oldest form of human connection, yet the way we consume them digitally has remained largely static. We were inspired by the magic of childhood pop-up books and oral storytelling—where every page turn is a surprise and the narrator reacts to your gasps and choices. With the announcement of The Gemini Live Agent Challenge, we realized we had the exact technological missing link to make this a reality: the ability to process real-time voice and interleaved multimodal outputs. We wanted to build an AI that doesn't just "chat," but directs a living, breathing cinematic universe.

What it does 🎬

Lunar Storyteller is a next-generation interactive storybook and creative AI director. It seamlessly weaves together a fluid, mixed-media experience. You don't type; you talk directly to the Chubby Bunny, our low-latency Voice Agent who acts as your energetic Game Master. Once you verbally agree on a narrative path, the system generates a rich, interleaved stream of:

Bilingual, emotive text narration.
Generative 3D-styled illustrations visually representing the exact scene.
Cinematic video renders stitched on-the-fly from the generated assets, complete with spatial auto-generated background music. It is an educational and interactive journey tailored entirely by the user's voice.

How we built it 🏗️

We designed a modern Dual-Agent Routing Architecture hosted on Google Cloud Run to satisfy the interleaved output requirements:

The Receptionist (Live Audio WebSocket): We used the gemini-2.5-flash-native-audio-latest model via the v1alpha Live API. This agent handles the bidrectional audio stream. It uses Function Calling to evaluate when the user has made a conclusive choice, triggering the commit_story_step tool without breaking conversation.
The Studio Engine (REST API): The backend catches the tool's payload and routes it to gemini-2.5-flash. We provided Gemini with a custom Domain-Specific Language (Tags) and passed the historical images as base64 context.
The Interleaved Mixed Output: The engine computes the narrative and outputs a structured formatting block: [TEXT], [IMAGE_PROMPT], [CHOICE], and [VIDEO_STORYBOARD].
Asynchronous Video Assembly: Our Python backend parses the [VIDEO_STORYBOARD] tag and fires off an isolated Node.js Thread using FFCreatorLite and FFMPEG to render a 16:9 mp4 from the Imagen 4.0 generated frames, injecting an audio track—all while the user is already reading the text on the React/Three.js frontend.

To formalize the routing logic we built, the probability of rendering a coherent mixed-media narrative arc $M_t$ at step $t$ can be expressed with the following mathematical factorization:

$$ P(M_t) = P_{Live}(s_t \mid A_t, \theta_{Gemini2.0}) \times P_{Studio}(T_t, I_t, V_t \mid s_t, I_{<t}, \theta_{Gemini2.5}) $$

Where $A_t$ is the user's streaming audio, $s_t$ is the committed story state, and $(T_t, I_t, V_t)$ represent the interleaved text, image prompts, and video storyboard instructions derived from the historical visual context $I_{<t}$.

Challenges we ran into 🧩

The "Silent Video" Rendering Race Condition: Rendering video takes time. Initially, our frontend would try to fetch the .mp4 the exact millisecond the Node.js process created the file shell, resulting in an unplayable 0-byte video block. We solved this by using Python subprocess monitoring, implementing an artificial _done.txt quarantine lock, and running an automatic FFmpeg -movflags faststart pass so the video could stream instantly on Chrome/Safari.
Quota Exhaustion Management: Processing ultra-high-quality images using the imagen-4.0-ultra model quickly hit our Hackathon tier quota limits. We had to build a fallback interception script that uses PIL (Pillow) to grab lower-res pre-rendered fallback emotion cards, resize and center-crop them to exactly $1024 \times 1024$ pixels via Lanczos resampling, so the Video builder wouldn't crash due to dimension mismatches.
Variable Shadowing in WebSocket Threads: Managing Python’s asynchronous memory pools for the Live API WebSocket while juggling global API authorization scopes caused several internal loop crashes which required heavy debugging to isolate the variables.

Accomplishments that we're proud of 🏆

True Interleaved Output: We didn't just build a text bot with images glued on. The fact that the Gemini 2.5 Engine outputs directorial commands ([IMAGE_PROMPT]) interwoven with text, and that the backend parses those mid-stream to trigger external rendering engines, feels incredibly powerful.
The Zero-Click Interface: Achieving a workflow where a child (or adult) could theoretically play the entire game just by speaking to their microphone, and watching a full video get synthesized dynamically is magical.

What we learned 📚

The v1alpha Gemini Live API is unbelievably fast, but designing the system prompt so the agent waits for user confirmation before firing a tool requires very strict behavioral constraints.
Managing File I/O across different sub-processes (Python calling Node.js calling FFmpeg) is treacherous without robust state locking mechanisms.
The difference between imagen-4.0-fast and ultra is notable, but clever prompting and standardizing aspect ratios across the board can hide lower-tier flaws effectively.

What's next for Lunar Storyteller 🚀

Persistent Character Memory: Implementing a Vector Database (like Pinecone) so Chubby Bunny remembers choices you made in entirely different storytelling sessions.
Dynamic 3D Generation: Swapping the static 2D image generator for native .glb or .gltf 3D model generation APIs, allowing the user to rotate the generated scene instantly in the Three.js viewport instead of relying on a pre-rendered video.
Multi-Agent Voice Cast: Hooking up gemini-2.5-flash-native-audio-latest to multiple voice profiles so when different characters speak in the story, the Text-To-Speech engine dynamically switches audio profiles based on an assigned [SPEAKER] tag.

Built With

antigravity
docker
fastapi
ffcreatorlite
ffmpeg
gemini-2.0-flash-exp
gemini-2.5-flash
google-cloud-run
imagen-4.0-fast
imagen-4.0-ultra
node.js
python-3.11
react
tailwind-css
three.js
typescript
uvicorn
vite
websockets

Updates

Uriel Hernandez started this project — Mar 03, 2026 11:16 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.