"I cooked a restaurant-quality dish and never typed on my phone once — my AI sous-chef watched, listened, and coached me the whole way."

Inspiration

Cooking while following a recipe on your phone is a miserable experience. Your hands are covered in garlic. The oil is about to smoke. You just need to know if the pan is hot enough — but your phone is locked, across the counter, and you're not touching it with raw chicken fingers.

Recipe apps are fundamentally broken for actual cooking. They're one-way information delivery: you read, you scroll, you set timers manually, and the app has no idea what's actually happening in your kitchen. If you make a mistake, it won't tell you. If you're about to burn something, it's silent.

I wanted to build the opposite: an AI that is present in your kitchen with you — watching, listening, and speaking up when it matters.

What It Does

SousChef Live is a real-time AI sous-chef that works entirely hands-free. You tap one button, prop your phone on the counter, and then you just cook. The chef does the rest.

What it sees, hears, and says:

  • Watches your kitchen through your camera at 1 frame per second — it sees your ingredients, your pan, your technique, your browning
  • Listens to both your voice and your cooking sounds — it hears sizzle intensity, crackling butter, and silence that means a cold pan
  • Speaks back in a natural voice (Aoede) with short, practical kitchen instructions

What makes it a real agent, not a chatbot:

  • Proactive interruption — if it sees unsafe knife grip, smoking oil, or a cold pan with chicken in it, it interrupts you without being asked
  • Automatic timers — the chef says "Starting a 2-minute sear timer" and a countdown appears on screen. You never say "set a timer"
  • Barge-in handling — interrupt the chef mid-sentence and it stops immediately, answers your question, and resumes
  • Step tracking — the app knows whether you're in prep, heating, searing, basting, or resting, and monitors accordingly
  • Substitution help — say "I don't have thyme" and get an immediate practical answer ("Use rosemary — half the amount")
  • Session memory — if the connection drops, the chef picks up exactly where it left off

How I Built It

Architecture

Browser (JS + AudioWorklets)
    │
    │  PCM16 audio (binary, 16kHz)
    │  JPEG frames (1 FPS, JSON)
    │  Text input + control events
    │
    ▼
FastAPI on Google Cloud Run
    │
    │  Bidirectional Live session
    │  (gemini-2.5-flash-native-audio-latest)
    │
    ▼
Gemini Live API

The key architectural decision was using raw binary PCM16 audio over WebSocket rather than base64-encoded JSON. This keeps round-trip audio latency low and matches how a real-time voice system should work. Video frames are JPEG at 1 FPS — enough for the chef to see pan heat, browning, and hand position without overwhelming the connection.

The backend is a thin FastAPI bridge. It doesn't try to be smart — it forwards audio and video to Gemini, executes tool calls, manages session state, and fans out events to the browser. Gemini does the actual reasoning.

The Proactive System

The hardest engineering challenge was getting the chef to speak up at the right moments without becoming annoying. A naive implementation fires constantly and users stop trusting it. The system I built has three layers:

  1. Direct proactive path (urgent) — The system instruction tells Gemini: "If you see danger or an obvious imminent mistake, interrupt immediately." Unsafe knife grip, burning food, smoking oil. This fires through Gemini's own reasoning based on the video it sees.

  2. Passive evaluator — A background loop (every 12 seconds) sends the latest camera frame to gemini-2.0-flash-lite with a structured prompt that returns urgency, confidence, and a reason code. Results require 2 consecutive confirmations before a non-urgent candidate is promoted (persistence gating).

  3. Proactive coordinator — A session-owned dispatcher manages candidate lifecycle, phase gating (prep stays in urgent-only mode), quiet-gap detection, cooldowns, and deduplication. All server-originated speech flows through this coordinator — no other code path injects proactive speech.

The result: the chef is silent when you're cooking correctly, and speaks up when it actually matters.

Timer System

Timers are first-class citizens. When the chef calls set_timer, a background task runs a real countdown. At 80% elapsed, the chef delivers a pre-alert ("Almost time — don't move it yet"). At expiry, it fires the flip/rest instruction. These are injected as system messages into the Gemini session so the chef's own voice delivers them, not a separate notification tone.

A Demo 10x mode compresses all timers by 10x — a 2-minute sear becomes 12 seconds — so the full cooking flow fits in a 4-minute demo.

Session Memory

Conversation history is compacted on a rolling basis using a structured memory format: preferences, substitutions, observations, and decisions. On reconnect, the server builds a rich primer text and re-primes the Gemini session so the chef picks up with full context.

Technology Stack

Layer Technology
Frontend Vanilla JS, AudioWorklets, Web Audio API, Vite
Backend Python FastAPI, google-genai SDK
AI gemini-2.5-flash-native-audio-latest (Live API)
Passive eval gemini-2.0-flash-lite
Deployment Google Cloud Run (us-central1 + europe-west1)
Audio Raw PCM16 binary frames (16kHz in, 24kHz out)
Video 1 FPS JPEG frames

I chose the raw google-genai SDK over ADK because raw binary audio is lower latency, tool execution is simpler to debug, and the hackathon timeline benefited from staying close to the metal.

Challenges

Getting proactive behavior right without making it annoying. The default LLM tendency is to comment on everything it sees. The system instruction, phase gating, persistence thresholds, and cooldowns were all tuned iteratively until the chef felt genuinely useful rather than intrusive.

Barge-in with clean audio. The browser needs to stop playback the moment the user starts speaking, and the server needs to flush the Gemini session's output buffer. Getting this to feel natural — where the chef actually stops when you interrupt rather than continuing over you — required careful handling of the interrupted event from Gemini and immediate drain of the playback queue.

Transcription latency on fast turns. Gemini's outputTranscription arrives cumulatively, often trailing the actual audio turn by a few hundred milliseconds. Several test failures early on were from reading the transcript too early and finding empty strings. The fix was a 3-second drain window after turnComplete.

Timer milestones during live cooking. Timer alerts need to fire even when no one is talking. This meant the timer system couldn't depend on the normal input-response loop — it injects directly into the Gemini input queue as a system message, bypassing the conversational turn structure.

Accomplishments

  • Full hands-free demo — ingredient recognition → recipe → knife correction → pan readiness → automatic timer → doneness check → substitution, all in one uninterrupted session with no screen touches
  • Proactive safety — "Pause — curl your fingertips for safety" fires reliably from a single camera frame of poor knife grip
  • Real multimodal — the chef references both what it sees ("I'm not seeing shimmer yet") and what it hears ("the sizzle is weak")
  • 14/14 live tests pass against the deployed Cloud Run service, including negative controls (session stays silent during idle and safe-looking frames)

What I Learned

The Gemini Live API is genuinely different from a chat API. The bidirectional nature — where the model can be interrupted, can speak up unprompted, and handles continuous audio and video simultaneously — opens up interaction patterns that are impossible in a request-response model. The hard part is not getting Gemini to do these things (it wants to), but building the infrastructure around it that makes the behavior reliable and not annoying.

System instruction quality matters enormously for tool-use reliability. The update_recipe tool wasn't being called consistently until the instruction explicitly said "call this in the same turn you suggest a dish, do not wait for explicit acceptance" with a concrete example. That one line change moved the tool-call rate from ~60% to ~95%.

Proactive AI is a product design problem as much as an engineering one. The question isn't "can the model notice this?" — it usually can. The question is "should it say something right now, given what the user is doing, how recently it last spoke, and how confident it is?" That's a policy problem, and getting the policy right took as much work as the underlying model integration.

What's Next

  • Expanded recipe support — the chef can handle any dish, but the step-tracking state machine would benefit from recipe-specific step graphs
  • Multi-camera angle — a second camera at counter level for better ingredient recognition and browning assessment
  • Post-session report — "Your sear temperature was good; your rest time was a little short"
  • Offline fallback — a lite mode that handles network drops more gracefully for real kitchen use

Links

Built for the Gemini Live Agent Challenge — Live Agents category.

Share this project:

Updates