🍳 Kitchen Copilot — Hackathon Journey

Category: Live Agents 🗣️


Inspiration

Cooking should be enjoyable, not a juggling act between a hot stove and a fragile screen.

We've all been there - hands covered in flour, trying to scroll to the next step of a recipe with an elbow, or shouting "Hey, how long do I bake this?" at no one in particular. We watched friends prop their phones against jars, smear grease across their screens, and lose their place in recipes mid-stir. The reality is that existing cooking apps are designed for clean hands and full attention - two things you never have while cooking.

When we discovered Google's Gemini Live API - real-time AI that can see, hear, and speak simultaneously - the idea clicked instantly: What if your kitchen had its own copilot? Not a chatbot you type at. Not a timer app you tap. A voice-first, vision-enabled companion that watches your kitchen counter through a camera, speaks to you like a patient friend, and handles everything - recipes, timers, step tracking - so you never have to touch a screen while cooking.

We wanted to build something that genuinely breaks the text-box paradigm and proves that the most powerful AI interactions happen when you forget you're talking to AI at all.


What It Does

Kitchen Copilot is a real-time, hands-free AI cooking assistant powered by Gemini's Live API. It sees your kitchen, speaks naturally, and guides you through entire recipes without you ever touching the screen.

🎙️ See, Hear, and Speak — All at Once The experience is truly multimodal and simultaneous. Gemini watches your kitchen through the camera, listens to your voice, and responds with natural speech - all streaming in real time over a single WebSocket connection. It's not turn-based; you can interrupt it mid-sentence, and it adapts.

🧑‍🍳 Distinct Persona: The Patient Kitchen Friend Kitchen Copilot isn't just an assistant - it has personality. It's calm, warm, and encouraging. When you accidentally burn the onions, it doesn't judge. It says "No worries, let's work with what we've got" and suggests a fix. It keeps responses to 1–2 sentences because it knows your hands are messy and you need quick answers.

🍽️ Context-Aware Recipe Discovery Tell Copilot what's in your fridge - "I have eggs, oats, and milk" -and it generates 3 practical, home-cooking recipes tailored to your ingredients using Gemini 2.5 Flash. Each recipe card shows estimated cook time and how many extras you'd need to buy, so you can make an informed choice at a glance.

📋 Intelligent Step-by-Step Guidance Once you pick a recipe, Copilot walks you through it one step at a time. It reads each step aloud, watches your progress through the camera, and only advances when you say "done" or "next." The sidebar visually tracks your progress with step locking - you can't accidentally skip ahead or re-do completed steps.

⏱️ Proactive Concurrent Timers The AI doesn't wait for you to ask - when a step says "bake for 20 minutes," it proactively offers to set a timer. Multiple timers run simultaneously with live countdown widgets that stack, expand on hover, and pulse when completed. You can manage them entirely by voice: "pause the pasta timer," "how much time left on the chicken?"

🛡️ Safety & Session Management Inactivity detection checks on you if you've been quiet too long. Recipe completion triggers a congratulatory message with an auto-end countdown. And the voice-triggered "Stop session" command shows a confirmation popup so you never accidentally disconnect mid-cook.


How We Built It

🏗️ Architecture — Real-Time Multimodal Bridge

The core technical achievement is a real-time multimodal bridge that connects the user's browser directly to Gemini's Live API, streaming audio, video, and tool responses simultaneously.

┌──────────────────────┐    Binary Multiplexed WebSocket    ┌──────────────────────┐
│    React Frontend    │ ◄─────────────────────────────────► │   FastAPI Backend    │
│                      │                                     │                      │
│  📷 Live Camera      │   0x00: Audio PCM (16-bit, 16kHz)   │  🔐 Session Auth     │
│  🎤 Mic Capture      │   0x01: Video JPEG Frames           │  🤖 Gemini Live API  │
│  🍽️ Recipe Sidebar   │   Text: JSON commands               │  ⏱️ Timer Manager    │
│  ⏱️ Timer Widgets    │                                     │  🔧 4 Custom Tools   │
│  🎨 Glassmorphic UI  │                                     │  🔥 Firestore DB     │
└──────────────────────┘                                     └──────────┬───────────┘
                                                                        │
                                                                ┌───────▼────────┐
                                                                │  Gemini 2.5    │
                                                                │  Flash Native  │
                                                                │  Audio (Live)  │
                                                                │  + Tool Calls  │
                                                                └────────────────┘

We built a custom binary multiplexing protocol over a single WebSocket: a one-byte header (0x00 = audio, 0x01 = video) for binary frames, and text frames for JSON. This keeps latency low and eliminates the need for multiple connections.

🛠️ Google Technology Stack

Technology How We Used It
Gemini 2.5 Flash (Native Audio) Core AI brain - real-time voice conversation with vision and function calling via the Live API
Google GenAI SDK (google-genai) All Gemini interactions - Live session management, tool registration, content generation
Google Cloud Firestore Async recipe caching with the google-cloud-firestore SDK
Google Cloud Run Serverless, auto-scaling deployment of the containerized backend
Google Cloud Build Automated CI/CD pipeline - push to deploy with cloudbuild.yaml
Artifact Registry Docker image storage for Cloud Run deployments

🤖 Gemini Tool System (Function Calling)

We registered 4 custom tools with the Gemini Live session, giving the AI agency to control the entire application:

Tool Purpose
search_recipes Takes ingredients → Gemini generates 3 practical recipes with estimated time and ingredient needs
search_recipe_by_name Direct recipe lookup by name - bypasses picker, loads sidebar immediately
timer Full timer lifecycle: create, start, pause, stop, query - all by voice
ui_command 7 actions: show/hide sidebar, advance steps, toggle mute, select recipe, focus timer, stop session

Gemini decides when and which tools to call based on the conversation context. The tool results are returned to Gemini, which then summarizes them naturally in speech - e.g., after search_recipes returns, Gemini says "I found 3 recipes! Take a look and tell me which one you'd like to try" rather than reading raw data.

🔐 Security

WebSockets can't send Authorization headers, so we implemented a one-time-use token handshake:

  1. Frontend calls POST /session/token (HTTP)
  2. Backend generates a cryptographic token → stores it in memory
  3. Frontend opens WebSocket with ?token=abc123
  4. Backend validates and deletes the token (single use)

🎨 Frontend — Premium Glassmorphic UI

The React frontend is designed to feel premium and alive:

  • Glassmorphic panels with backdrop blur and subtle gradients
  • Micro-animations on every interaction - slide-ins, scale bounces, pulse on completion
  • Three responsive layouts from a single codebase: desktop (side panel), portrait mobile (bottom sheet), landscape mobile (compact side panel)
  • Floating timer widgets that stack, expand on hover, and pulse when done

☁️ Cloud Deployment (Bonus Points)

Our deployment is fully automated via infrastructure-as-code:

  • infra/Dockerfile - Python 3.12 slim container with Uvicorn
  • infra/cloudbuild.yaml - Automated build → push → deploy pipeline to Cloud Run
  • Single git push triggers the entire CI/CD flow

Challenges We Ran Into

🔊 Binary Audio Multiplexing

Streaming three data types (PCM audio, JPEG frames, JSON) over a single WebSocket required a custom protocol. Getting the audio format conversion right - Float32 → Int16 at 16kHz outbound, Int16 → Float32 at 24kHz inbound - took significant debugging to achieve gap-free playback using the Web Audio API's scheduling system.

🪟 Cross-Platform Console Stability

The Gemini SDK prints Unicode characters (✓) on connection that crash Windows terminals. We built a monkey-patched print() shield that catches encoding errors silently, keeping the server alive regardless of what the SDK outputs.

🎤 Voice Activity Detection Tuning

Early audio amplification (15×) picked up kitchen background noise and broke Gemini's built-in silence detection - the AI constantly interrupted itself. We tuned to a gentle 1.5× gain and relied on the browser's autoGainControl for the rest.

⏱️ Async Timer State

Managing multiple concurrent async timers with pause/resume/complete states that survive across tool calls and broadcast to the frontend in real time required careful lifecycle management with asyncio tasks.

📱 Three Layouts, One Component

Making the recipe sidebar render as a fixed side panel (desktop), a slide-up bottom sheet (portrait mobile), and a compact side panel (landscape mobile) from the same React component with pure CSS media queries was a significant layout challenge.


Accomplishments We're Proud Of

🏆 Breaking the Text Box

Kitchen Copilot is a zero-typing experience. From the moment you start a session to the moment you plate your food, you never touch the screen. Voice in, voice out, camera watching - this is what "beyond the text box" looks like in practice.

🎯 Seamless Multimodal Integration

Audio, video, and tool calling all stream simultaneously through a single connection. The AI sees your kitchen, hears your voice, speaks back, and controls the UI - all in real time, all interruptible. It doesn't feel disjointed or turn-based; it feels like a conversation with someone standing next to you.

🧑‍🍳 A Persona You'd Actually Want in Your Kitchen

The AI isn't robotic. It's warm, patient, and brief. It proactively suggests timers, celebrates when you finish a recipe, and checks on you if you've been quiet. It feels less like software and more like a friend who happens to know every recipe.

☁️ Production-Grade Architecture

Token-based WebSocket security, Firestore persistence, Docker containerization, and a full Cloud Build → Cloud Run CI/CD pipeline. This isn't a hackathon prototype - it's deployable infrastructure.


What We Learned

🧠 Gemini Live API Is a Game-Changer

The ability to have a real-time, bidirectional conversation with an AI that simultaneously processes audio, sees through a camera, and calls functions - all in a single streaming session - is genuinely transformative. Learning to work with its native audio streaming, handle tool call interruptions mid-speech, and manage the conversation flow was the most rewarding part of this project.

🔧 Tool Calling in Live Context

Using function calling in a live, streaming context is fundamentally different from batch API calls. The AI can decide to call a tool mid-sentence, and you need to execute it, return the result, and let Gemini continue speaking - all without breaking the conversational flow. We learned to design tools that are fast, return structured data, and let Gemini summarize naturally.

☁️ Google Cloud Ecosystem

Firestore's async client made our data layer invisible - write and read without infrastructure concerns. Cloud Run with Cloud Build gave us production-grade deployment from a single YAML file. The ecosystem just fits together.


What's Next for Kitchen Copilot

🔮 Immediate

  • Ingredient Recognition: Use Gemini's vision to identify ingredients on the counter automatically
  • Shopping List Export: Generate a list of missing ingredients and send it to Google Keep
  • Multi-Language Support: Guide users in their preferred language
  • Meal Planning: Suggest a week of meals based on dietary preferences

🚀 Long-Term

  • Smart Kitchen Integration: Connect to smart ovens and thermometers for automated alerts
  • Family Profiles: Remember dietary restrictions and taste preferences per household member
  • Recipe History: Track what you've cooked, rate recipes, and build a personalized cookbook
  • Wearable Support: Adapt for Google Nest Hub and smartwatches

Bonuses

Automated Cloud Deployment: Our infra/Dockerfile and infra/cloudbuild.yaml fully automate the build → push → deploy pipeline to Google Cloud Run. This code is included in our public repository.


🍳 Built with passion - transforming messy-hands cooking into a hands-free, AI-guided experience!

#GeminiLiveAgentChallenge

Built With

  • docker
  • fastapi
  • google-cloud-build
  • google-cloud-firestore
  • google-cloud-run
  • google-genai
  • react
  • uvicorn
  • vite
  • web-audio-api
Share this project:

Updates