Inspiration

  • Children learn best through play, curiosity, and immediate feedback.
  • Existing educational apps are passive, screen-locked, and disconnected from the physical world. They reduce learning to tapping buttons on a screen.
  • This demands a fundamentally multimodal interaction: an AI that can see, hear, and speak simultaneously in real time. Text-based interfaces cannot deliver this.
  • The Gemini Live API enables a new paradigm: real-time bidirectional conversation with an AI that processes video frames, hears speech, and responds with streaming voice, all with sub-second latency over a single WebSocket connection.
  • Combined with MediaPipe's on-device hand tracking, a child's pointing finger becomes the input device, the physical world becomes the learning material, and an AI companion turns every discovery into personalized play.
  • Vision: what if a child could point at anything and have an AI friend explain it, play with it, and remember it? No typing. No tapping. Just point, speak, and learn.

What it does

  • The entire experience is live, multimodal, and context-aware with no text boxes, no menus, no typing.
  • Child points at real-world objects through device camera.
  • Spark (AI companion powered by Gemini 2.5 Flash Native Audio) responds instantly with streaming voice through the Gemini Live API.
  • Discoveries are saved as "treasures" with AI-generated illustrations (Nano Banana 2).
  • Background content enrichment via Gemini 3 Flash with Google Search grounding provides factually grounded, age-appropriate facts, avoiding hallucination by sourcing from trusted educational content rather than parametric knowledge.
  • Semantic similarity detection using Gemini Embedding 2 enables smart treasure deduplication and contextual play.
  • Treasures fuel personalized stories, I Spy games, find-something-new challenges, category sorting, building challenges, magic wand drawing, repeat-after-me, and recall quests. Each activity leverages the child's own discoveries.
  • Parent dashboard with session review, treasure gallery, story history, and content controls.
  • Memory consolidation across sessions so Spark remembers what the child discovered last time.

How we built it

  • Firebase and Google Cloud as the backbone: Firebase powers every layer of the backend: Auth (email/password, Google, guest), Cloud Firestore (treasures, stories, sessions, user profiles), Cloud Storage (camera frames, focus crops, illustrations), Hosting (global CDN deployment), Analytics (session engagement tracking), and App Check with reCAPTCHA Enterprise (request verification and bot prevention). GCP Secret Manager secures the Gemini API key, keeping it out of client bundles entirely.
  • 5 Cloud Function endpoints (Node.js 22, TypeScript) form the server-side AI pipeline with independent scaling pools: mintLiveToken (ephemeral token minting), contentComposer (grounded follow-ups, idle prompts, story scaffolds, memory consolidation), contentMedia (illustration generation via Nano Banana 2, separate instance pool to prevent starving text requests), contentEmbed (multimodal embeddings), and playCooldown (play mode rate limiting).
  • All Gemini API calls use the official Google GenAI SDK (@google/genai), on the client for the Live API WebSocket connection and on the server (Cloud Functions) for Standard API and Embeddings calls.
  • 9 AI agent roles across 4 Gemini models: Gemini 2.5 Flash Native Audio handles real-time voice and vision through the Live API. Gemini 3 Flash powers the Content Composer with Google Search grounding for hallucination-free facts. Nano Banana 2 (Gemini 3.1 Flash Image) generates treasure illustrations. Gemini Embedding 2 produces multimodal embeddings for semantic similarity.
  • Direct browser-to-Gemini via ephemeral tokens: The browser connects directly to the Gemini Live API over WebSocket using short-lived ephemeral tokens minted by a Cloud Function. No proxy server sits between the child and Spark, minimizing latency. The liveConnectConstraints field locks the full Gemini config (system instruction, tools, safety settings, voice preset) server-side, preventing client-side tampering.
  • Robust reconnection and error handling: goAway messages trigger proactive reconnection at 80% of remaining token lifetime, background pre-minting refreshes tokens 5 minutes before expiry, and exponential backoff (max 5 attempts) handles transient failures, all invisible to the child.
  • Gemini function calling in the Live API enables structured actions triggered from natural voice conversation: addToTreasure, startISpy, suggestActivity, celebrateDiscovery, askForRepoint, startStory, and nextStoryBeat.
  • Blackboard play queue (NOW/NEXT/LATER) orchestrates content delivery to solve the synchronization problem between real-time and non-real-time calls. Children need instant feedback (< 1 second), but enriched content takes 2-10 seconds. Immediate responses go to NOW, enriched content queues in NEXT with priority decay and diversity bonuses, and idle engagement fills LATER. When the child interrupts, stale items are cleared and the new interaction takes priority immediately.
  • MediaPipe Gesture Recognizer runs locally in the browser for hand detection and fingertip tracking. Gemini handles all authoritative pointing classification via function calling. MediaPipe's role is deliberately limited to gating the frame streaming rate, not classifying what the child is pointing at.
  • Adaptive frame streaming driven by MediaPipe hand presence: Using MediaPipe as a local gatekeeper for vision budget. When no hand is detected, the frame streamer sends background frames at ~0.2 fps (one every 5 seconds), just enough to keep Gemini's visual context fresh. When MediaPipe detects a hand, the rate jumps to ~1 fps with fingertip coordinates sent as text context. This 5x rate reduction during idle periods saves significant vision token budget (1 fps x 258 tokens/frame x 60s = 15,480 tokens/min vs ~3,096 at 0.2 fps), enabling longer sessions within the context window. The Live API's built-in contextWindowCompression is configured with a slidingWindow target of 12k tokens, so Gemini automatically summarizes and discards older context (earlier frames, audio, and conversation turns) once the window fills, keeping only the most recent and relevant context alive. This allows play sessions to run indefinitely without hitting context limits.
  • Structured JSON output with responseSchema ensures reliable, typed responses from the Standard API, eliminating parsing failures.
  • Deferred write pattern for crash resilience: Zustand + localStorage during play, batch persist to Firestore/Storage at session end. If the app crashes mid-session, the next load recovers from localStorage and retries the upload.
  • 4-tier testing strategy: unit (Vitest), integration (real services), UI (Playwright, services disabled), E2E (Playwright, full stack). 1,500+ Vitest tests across 122 test files, plus 16 Playwright test files.

Challenges we ran into

  • Acoustic echo cancellation (AEC): Getting real-time bidirectional audio to work smoothly across different devices required a dual-routing audio architecture. An AudioWorklet routes Spark's output through a MediaStreamAudioDestinationNode to provide the browser's AEC pipeline a reference signal, with an adaptive buffer (100-200 ms) and an optional mic-mute-during-playback fallback.
  • Ephemeral token security: Working around the system instruction bug with liveConnectConstraints to lock the full Gemini config server-side, preventing client-side prompt injection while still enabling direct WebSocket connections.
  • Context window management: Long video + audio sessions can generate over 100k tokens. Solved with sliding window compression at 12k tokens plus a discovery context prefix to keep sessions running indefinitely.
  • Pointing detection for children: Small hands and unpredictable movements made traditional gesture recognition unreliable. Solved by using MediaPipe only for hand presence detection and letting Gemini handle all pointing classification through vision.
  • Balancing immediacy with enrichment: Children lose interest in seconds, but enriched content takes 2-10 seconds. The Blackboard NOW/NEXT/LATER architecture decouples immediate response from background processing entirely.
  • Frame budget optimization: Minimizing frame rate is critical for staying within the context window. The adaptive streaming approach based on MediaPipe hand detection was the breakthrough — dropping to 0.2 fps when idle and only ramping to 1 fps when a hand is detected.
  • Hallucination avoidance for children: A children's app cannot tolerate made-up facts. Google Search grounding on every content enrichment call ensures facts come from trusted educational sources, not the model's parametric knowledge.

Accomplishments that we're proud of

  • Fully real-time multimodal experience: point at something and get a voice response in under 1 second, with no proxy server in the path.
  • Direct browser-to-Gemini architecture with ephemeral tokens and locked configuration that is secure, low-latency, and serverless.
  • 9 AI agent roles working together seamlessly across 4 Gemini models, orchestrated through the Blackboard.
  • Comprehensive 5-layer safety architecture protecting children: App Check, Gemini safety filters, client-side regex, fallback replacement, and age-appropriate grounding.
  • Adaptive vision budget that reduces token consumption by 5x during idle periods, enabling longer play sessions.
  • Photo-to-illustration pipeline with Nano Banana 2 transforms camera captures into child-friendly artwork.
  • Deferred write pattern with crash recovery ensures no discoveries are lost.
  • Memory consolidation so Spark remembers discoveries across sessions and builds on them.
  • 1,500+ passing Vitest tests with a 4-tier testing strategy.

What we learned

  • Gemini's multimodal Live API enables genuinely new interaction paradigms that transcend text-based interfaces. Simultaneous vision + voice + function calling in a single WebSocket makes real-time AI companions feasible in a browser.
  • Ephemeral tokens with liveConnectConstraints are a powerful mechanism for securing direct browser-to-API connections without a proxy, eliminating prompt injection attacks while preserving zero-latency.
  • Function calling in the Live API bridges natural conversation and structured actions. A child says "add this to my treasures" and Gemini triggers a typed function call, no parsing required.
  • Google Search grounding transforms AI from creative to authoritative. For a children's app, the difference between "butterflies might live for a few weeks" and "monarch butterflies migrate up to 3,000 miles" (sourced from educational content) is the difference between toy and tool.
  • MediaPipe + Gemini is a powerful local/cloud AI combination: on-device detection speed (30 fps hand tracking) paired with cloud AI intelligence (scene-level pointing classification) gives the best of both worlds.
  • The Blackboard pattern elegantly solves the real-time/async synchronization problem. Decoupling immediate response from background enrichment avoids both stale content and response delays.
  • Separating Cloud Function instance pools by workload type prevents resource contention. Heavy image generation on a shared pool would spike P99 latency for lightweight text requests.

What's next for Point. Play. Learn.

  • Multi-language support leveraging Gemini's native multilingual capabilities.
  • Collaborative play mode for siblings discovering together.
  • Advanced gesture interactions (drawing in the air, counting with fingers).
  • Deeper story mode with chapter progression, branching narratives and illustrations, and making it educational at the same time.
  • Parent insights powered by session analytics and learning trajectory tracking.
  • Outdoor exploration mode with GPS-aware context for parks, museums, and nature walks.
  • Native mobile wrappers for App Store and Play Store distribution.

Built With

Share this project:

Updates