Questory — Gemini Live Agent Challenge Submission


Inspiration

We kept coming back to a simple frustration: kids learn best when they are inside the story, not just reading it. A textbook can say "the Amazon rainforest has four layers" — but what if a child could speak that fact back to a living AI guide, watch a comic panel illustrate it in real time, and then answer for it before the next chapter unlocks? That is the gap Questory was built to close.

We were also drawn to the raw potential of the Gemini Live API. The ability to hold a real, interruptible voice conversation while simultaneously generating images felt like a genuine leap — not a feature, but a new interaction paradigm. We wanted to push that paradigm into the hands of a six-year-old.


What it does

Questory is an AI-powered interactive comic adventure for children. A child talks to Dora, a 3D animated voice guide, and tells her what they want to learn — by speaking, typing, pasting a YouTube URL, or uploading a PDF. From there:

  1. Dora proposes three heroes themed around the topic. AI-generated portrait images for each appear in parallel as they stream in.
  2. The child picks a hero. While this happens, the first six to ten comic panels are silently pre-generated in the background so there is zero wait when the story begins.
  3. The story unfolds panel by panel. Dora narrates live over audio, a fresh AI-illustrated panel appears for each scene, and a quiz interrupts the flow to test what was just taught.
  4. The child's spoken answers shape the story. Correct answers unlock the next chapter; wrong answers prompt Dora to gently revisit the concept before continuing.
  5. Completed stories are saved to a personal library the child can revisit.

The entire experience is multimodal: real-time bidirectional audio, live AI image generation, and natural voice interaction — no tap-through menus, no walls of text.


How we built it

Voice and Intelligence — Gemini Live API

The heart of Questory is a bidirectional WebSocket proxy (gemini_live.py) that streams raw PCM audio between the browser and gemini-2.5-flash-native-audio-preview. Dora lives entirely in this audio stream — she can be interrupted mid-sentence, ask follow-up questions, and adapt to the child's spoken responses in real time. She uses function calling to signal the backend when to generate heroes, start the comic builder, or present a quiz — keeping all orchestration server-side and invisible to the user.

Image Generation — Gemini Image Model

Hero portraits and comic panels are generated using gemini-3.1-flash-image-preview via the Google GenAI SDK. Hero images are fired in parallel so all three portraits render simultaneously. Panel art is generated on demand with a prompt that captures the current scene, the chosen hero, and the art style.

Story Pre-generation — Flash Lite Headstart

When a hero is selected, story_headstart.py immediately kicks off an async job using gemini-3.1-flash-lite-preview to pre-write and pre-generate the first block of panels. By the time the child reaches the story page, content is already waiting — eliminating the cold-start delay that would otherwise break the experience.

Comic Builder Agent

comic_builder.py is a second Gemini Live session that takes over after setup. It orchestrates the panel delivery loop: calling present_prebuilt_panel for pre-generated content, add_comic_panel for live generation, and ask_quiz at pedagogically appropriate moments. Quiz answers flow back as text tool results, and Dora's audio response adapts accordingly.

Frontend — React and Three.js

The frontend is built with React 18, Vite, and TypeScript. Dora's avatar is rendered in Three.js via @react-three/fiber — a circular textured plane that swaps between a listening and speaking texture in sync with real-time audio volume from a Web Audio API AnalyserNode. Sparkles, ambient glow, and subtle scale-pulse give her a magical, alive quality. Background scenery cycles through AI-generated environments.

Cloud Deployment

The FastAPI backend is containerized with Docker and deployed to Google Cloud Run, giving us fully managed, scalable serving with automatic HTTPS — critical for WebSocket reliability.


Challenges we ran into

Latency is the enemy of magic. Young children have zero patience for loading spinners in the middle of a story. We had to architect the headstart pre-generation system carefully — kicking it off the moment a hero is selected and surfacing pre-built panels before any live-generated ones.

WebSocket session handoff. Transferring context from the setup agent to the comic builder agent without the child noticing a seam required careful session state management and a shared session store that survives across two WebSocket connections.

Real-time lip-sync. Making Dora feel alive required driving Three.js animations at 60fps from a Web Audio API AnalyserNode, with texture swaps and scale pulses tuned to feel natural rather than mechanical.

Keeping Dora on-script. A live language model in conversation with a child has a natural tendency to go broad. We spent significant time on system prompt design — teaching Dora appropriate brevity, keeping her focused on the educational objective, and ensuring quizzes loop back to the content rather than accepting tangential answers.

Image model constraints. Working with a preview image model meant managing rate limits and handling occasional generation failures gracefully — retrying silently so errors never surface mid-story.


Accomplishments that we're proud of

  • A fully interruptible AI storyteller that a child can actually talk to. Not a chatbot with a text box — a warm, animated guide you speak to naturally.
  • Sub-second panel transitions. The headstart system means the first block of panels is ready before the child even reaches the story page.
  • An end-to-end multimodal pipeline in a single session: voice input, AI comprehension, tool-called image generation, voice narration, quiz assessment, and adaptive story continuation — orchestrated in real time with audio never cutting out.
  • A 3D avatar driven entirely by live audio data. Dora's lip-sync and pulse animations use no pre-recorded animation — every motion is a live function of the actual audio volume stream.
  • Multimodal input for topic selection. Children, or their parents, can speak a topic, type it, paste a YouTube video URL, or upload a PDF. Dora understands all four and builds a coherent story concept from any of them.

What we learned

  • The Gemini Live API's function-calling mechanism is a powerful orchestration primitive. Letting the model decide when to trigger a hero proposal or panel generation — rather than hard-coding a UI flow — made the experience feel genuinely conversational rather than scripted.
  • Audio latency compounds in browser contexts. PCM streaming, Web Audio buffering, and WebSocket round-trips each add frames that are imperceptible in isolation but noticeable together. Buffer sizes required aggressive tuning.
  • Prompt engineering for children's content is its own discipline. The vocabulary, pacing, and encouragement style Dora uses required many iterations to feel age-appropriate without being condescending.
  • Parallel async generation changes the UX calculus entirely. Firing three hero portrait requests simultaneously instead of sequentially turned a potential 15-second wait into a staggered 5-second reveal that felt exciting rather than slow.
  • Cloud Run and WebSockets work well together, but require explicit timeout configuration to keep long-lived connections alive.

What's next for Questory

  • Persistent profiles and progress tracking — so children build a library of adventures and Dora remembers them between sessions.
  • Parent and teacher dashboards — showing topics explored, quiz scores, and knowledge gaps flagged for follow-up.
  • Curriculum alignment — mapping story topics to Common Core standards so teachers can assign Questory adventures as supplementary material.
  • PDF and video comprehension — deepening multimodal input so a child can upload a school textbook chapter and Questory builds an adventure directly from that content.
  • Multiplayer adventures — two children in the same story, taking turns answering quizzes and shaping the narrative together.
  • Voice-customized characters — letting children choose their hero's voice to make the story feel truly personal.

Built With

Share this project:

Updates