Inspiration

Timelines don't stick. I've watched people memorize dates and forget them a week later. I wanted to see what happens when you actually talk to the people who lived it.

The phone call felt right. It's the oldest real-time voice interface. You dial, someone picks up, you talk. I built on that.

What it does

You type a topic - or speak it, or point your camera at a textbook. Flash figures out what you're studying and gives you people who were there, and they answer.

Constantine XI tells you the walls are falling and wants your advice. Gene Kranz has 25 seconds of fuel. Jamukha is deciding whether to fight Genghis Khan or walk away. At some point, the character gives you 2-3 choices as tappable cards. You pick. They react with what actually happened.

When you hang up, you get a call log — key facts, what happened after, and a farewell message from the character. No quizzes. No wrong answers.

How we built it

Three Gemini models work together per session:

  1. Flash (gemini-3-flash-preview) resolves the topic into a character, setting, stakes, color palette, and voice. Returns structured JSON.
  2. Image (gemini-3.1-flash-image-preview) generates the scene banner and character portrait at preview time - cached before the call starts. When Live calls show_scene mid-conversation, the image is already there. 0ms.
  3. Live (gemini-2.5-flash-native-audio-preview-12-2025) runs the actual voice session. Affective dialog, function calling, context window compression. The character speaks, you interrupt, they stop and respond.

Post-call, Flash turns the transcript into key facts and a farewell.

Backend is Hono on Cloud Run - a WebSocket relay between the browser and Gemini Live. Firestore stores student profiles and session history. Returning students get recognized in-character: "Back again? Last time you let the harbor fall."

Frontend is Astro 5 + Svelte 5 on Cloudflare Workers. The call UI looks like a phone - timer counting up, live transcript, mute/speaker/hangup.

Challenges we ran into

Tool calling with native audio crashes. GitHub issue #843, 43+ reactions, open since May 2025. I removed googleSearch entirely after it kept killing sessions. Rule I landed on: one tool per turn, keep the set small.

Image generation latency is a queue problem, not a prompt problem. I ran a 5×5 benchmark — 5 prompt styles, 5 runs each. Variance across runs was 12-15 seconds. Variance across prompts was less than 1 second. The bottleneck is GPU queue position. Pre-generating at preview hides this completely.

sessionResumption.transparent doesn't exist. Passing it crashes the connection. Took me a while to figure out — it's not in the docs, but examples online include it.

System prompt order matters more than wording. Persona first, rules second, guardrails last. If you define the persona last, earlier instructions override it. Google's docs mention "unmistakably" outperforms MUST/NEVER/ALWAYS for voice models. That checked out.

Audio + video sessions cap at about 2 minutes. Audio-only runs to 15. I made camera input for textbook scanning only — no video during calls.

Accomplishments that we're proud of

The interruption works. You talk while the character is speaking, the audio queue clears, they stop mid-sentence and respond to what you just said. That's the whole point of a phone call.

"No wrong answers" turned out to be the highest-impact line. I tested it against 6 user personas (ages 13-42). One — an exchange student from a shame-avoidant culture — read it three times before proceeding.

Every word comes from the character. No narrator voice. switch_speaker brings in other characters, not a system voice.

Emotional boundaries are baked into the system prompt: "You existed before this call and will continue after. End every call with a positive observation. Never make the student feel guilty for hanging up."

Blocked callers stay in metaphor. Request a perpetrator and you get "This number is not in service" with 3 alternative witnesses from the same event.

What we learned

v1alpha is required for enableAffectiveDialog. Standard API versions reject it silently — no error, just no emotional tone.

All function declarations need NON_BLOCKING. If end_session blocks, the character goes silent mid-farewell. Breaks the illusion.

Bounded audio queues (maxSize=10) with backpressure prevent memory issues in long sessions. On interrupted, clear the queue instantly.

WebSocket 1008 errors are transient — retry with exponential backoff handles them. Auth errors (401/403) should skip retry.

Web Speech API running in parallel gives better student transcript than Gemini's built-in inputAudioTranscription, which has no config options.

What's next for Past, Live

Characters referencing past calls from Firestore history. A larger pool of preset characters that rotates. Multi-language support — characters speaking in their native language. Content safety blocklist with automatic redirect to witnesses and resistors.

Built With

  • astro
  • firestore
  • gemini-3.1-image
  • gemini-flash
  • gemini-live-api
  • google-cloud-run
  • hono
  • svelte
  • typescript
  • websocket
  • workers
Share this project:

Updates

posted an update

Post-submission update (March 17)

The submission text above was written in a rush at the deadline. Here's the version I'd have written if I had more than 15 minutes.

I built Past, Live in four days from Santa Marta, Colombia. Sitting on the floor, laptop on a chair, no AC (bought a 220v unit in a 110v country). I pivoted four times -- gamified quiz, stressful roleplay, rigid script, and finally the "bag of sticks" architecture where Flash generates a bag of material and Live pulls from it freely. The breakthrough came at 3am when I found a leftover directive that said the model cannot make jokes. I flipped it. Cleopatra asked if my call was a pyramid scheme.

Since submission I've fixed a Firestore crash affecting anonymous users (undefined fields rejected by Firestore -- stripped them before write), deployed the Bolivar portrait that was missing from the frontend, and rewritten the README from scratch in first person. The architecture, prompts, and functionality are the same -- just documentation and a one-line backend fix.

What makes this different from other voice AI apps:

  • Natural interruption -- VAD tuned with low start sensitivity, high end sensitivity, 500ms silence. You cut them off, they adapt.
  • Characters remember you across calls via Firestore profile injection into every system prompt.
  • Re-anchoring every 4 model turns -- without it, characters drift, start lecturing, forget they're on a phone call.
  • 3 Gemini models collaborating -- Flash writes the personality, Image generates the visuals, Live performs. Each does what it's best at.
  • The character voice lives in one file (server/src/character-voice.ts): be the funniest person at a dinner party who happens to have lived through something insane.
  • Curated art direction -- crosshatch engravings inspired by USD paper money, monochrome on vibrant orange, 30% orange at focal points in scene images.
  • Graceful error handling for Gemini Live's ~40% crash rate -- auto-reconnect, context replay, "Signal Lost" screen.

Content created for this hackathon:

The live version is genuinely better than the demo video. Try it: https://past-live.ngoquochuy.com

GeminiLiveAgentChallenge

Log in or sign up for Devpost to join the conversation.