SousChef Voice Interface
Cooking Mode
Cooking Mode - Active Tools

SousChef AI 👨‍🍳

Inspiration

I've always loved cooking, but I hate the friction of using technology in the kitchen. Trying to scroll through a recipe on my phone with dough-covered hands is a recipe for disaster (and a messy phone). Plus, I have a box of my grandmother's handwritten recipe cards that are precious but impossible to search. I wanted a way to digitize these analog treasures and interact with them in a completely hands-free, natural way. That's how SousChef AI was born—a bridge between the tactile joy of cooking and the power of modern AI, accelerated by Google's Antigravity program.

What it does

SousChef AI is a voice-first cooking assistant. You can upload PDF cookbooks or snap photos of handwritten recipes, and the AI ingests them into a private knowledge base. You then simply talk to it.

*"Which recipe is the easiest to make using the ingredients that I have"
"How many eggs for the cake, I am vegan btw?"
"What's the next step?"
"Set a timer for the pork to cook"

It guides you step-by-step, manages your timers based on the cooking context, and even updates your shopping list, all through a natural, low-latency voice conversation.

How I built it

I built SousChef using the most recent multimodal AI stack:

1. The Voice Pipeline

I used LiveKit for the real-time audio infrastructure. The agent orchestrates a high-bandwidth audio stream:

Hearing: Silero VAD (Voice Activity Detection) identifies when the user starts and stops speaking.
Understanding & Speaking: I use Gemini 2.5 Flash (Live API) for its native audio-to-audio capabilities. This means no separate STT or TTS steps—it's incredibly fast, preserves nuance, and makes the conversation feel natural and interruptible.

2. The Brain (RAG Engine & Multi-modal Logic)

To give the AI specific knowledge about your recipes, I built a specialized multimodal pipeline:

Ingestion: LlamaIndex manages my document processing.
Vision (Agentic OCR): For handwritten recipe cards, I use Gemini 3.0 Flash Vision released just a couple of weeks before to extract text and structure from stains and scribbles.
Embeddings: I use Gemini Embedding 001 to vectorise recipes, allowing the entire app to run on a single Google API key.
Storage: Chunks are stored in a session-aware index, ensuring your data stays private and localized to your cooking session.
Retrieval: When you ask a question, I generate an embedding vector to find the most relevant recipe chunks.

3. The Frontend

The UI is a modern Next.js application and its deployed on AWS Amplify forshowcasing.

Challenges I ran into

The Latency War: Voice interfaces live or die by latency. By using the native audio-to-audio capabilities of Gemini 2.5 Flash, I bypassed traditional STT/TTS overhead, cutting response times by over 1.5 seconds.
Multi-modal Context: Feeding visual data (recipe photos) into a live voice stream while maintaining conversation flow was a significant orchestration challenge. I solved this by using a background indexing flow that notifies the user once the "brain" has assimilated the new recipe.
State-of-the-Art Sync: Keeping the visual UI (timers, transcript, shopping list) in sync with a streaming audio response required building a custom event protocol on top of LiveKit's real-time channels.

Accomplishments that I'm proud of

Single Key Setup: The entire project—from vision to embeddings to live voice—runs on a single Gemini API key.
Analog-to-Digital Bridge: Transforming stained, decades-old recipe cards into searchable, interactive kitchen guides.
True Hands-Free Flow: A dual-pane immersive UI that responds to voice commands, handles kitchen tools, and shows guided videos synchronously.

What I learned

Multimodal is the Future: Combining Vision (seeing the recipe), Voice (talking to the user), and Embeddings (searching knowledge) creates a synergy that traditional text bots can't touch.
Local is Snappy: I learned that for high-performance voice agents, keeping the compute local (or as close to the user as possible) is critical for UX.
RAG tuning: For recipes, precise chunks are better than long ones—users need specific ingredient amounts, not the whole history of the dish.

What's next for SousChef AI

Cooking Mode V2: Integrating more vision-based guidance, like identifying if a steak is done via the camera.
Hardware Integration: A dedicated "Kitchen Hub" device running the local agent.
Dietary Personalization: Real-time recipe substitutions for allergies or ingredient availability on the fly.
Adding more tool calls : Additional tools that could be added that would make the cooking more seemless and handsfree.
Integrating with third-party apps:Like Instacart or a similar communication method, enabling users to readily transmit their shopping carts or timers to a mobile phone or any alternative device.

Built With

antigravity
flask
gemini
livekit
mcp
nextjs
python

Updates

Pranam Shetty started this project — Feb 09, 2026 07:46 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.