Voice-Powered Family Recipe Assistant
Inspiration
During the last couple of years I have changed my life by adopting a low-carb/keto diet and lots of regular cardio and weight training. As part of this I have accumulated many recipes using alternative ingredients like nut flours, sugar substitutes, and more. Also our family has accumulated more than 600 recipes over the years - everything from bread experiments I've been perfecting, to my grandmother's handwritten brown bread recipe on a stained index card, to my wife's collection of holiday baking favourites bookmarked from blogs that no longer exist. They were scattered across PDFs, photos of handwritten cards, screenshots, and dead URLs. Finding anything was painful - I'd scroll through folders, skim titles, and still end up Googling a recipe I already had saved somewhere.
I wanted a way to just ask for what I needed: "what keto bread recipes do we have?", "what's in grandma's homemade brown bread?", or "what are the macros in that almond flour waffle recipe?" - and get answers from our own collection, with the option to look up accurate nutrition from the USDA database for any recipe or ingredient list.
My first version of an app to handle this text-based Family Recipe Assistant works really well but having to type questions into the frontend wasn't very useful when actually cooking. Every time I'm in the kitchen with ingredients on my hands, reaching for my phone to type "how long do I bake the banana bread?" feels wrong. I wanted to just ask.
The previous version used the Nova Pro model to process all the raw recipes and it searches my family's recipe collection, calculates nutrition from USDA data, and handles multi-turn conversations through a web UI. It even had a "cooking mode" that read recipes aloud using Amazon Polly. But listening to a long recipe read start-to-finish by a TTS voice is surprisingly tedious - you can't ask it to slow down, skip ahead, or clarify a step without going back to the screen and typing.
Reading about Amazon Nova Sonic v2 launched with sub-700ms speech-to-speech latency and the Strands Agents SDK added experimental support for bidirectional streaming, I saw an opportunity to turn my recipe assistant into a real conversation partner - one I could talk to hands-free while cooking.
What it does
A hands-free kitchen assistant you can talk to while cooking, powered by your own family recipe collection and real-time nutrition data.
- Search recipes by voice ("find me a quick pasta recipe") - retrieves results from a Bedrock Knowledge Base containing my family's recipe collection
- Set cooking timers ("set a timer for 12 minutes for the pasta") - async timers that run in the background while the conversation continues
- Look up nutrition ("how many calories in a cup of rice?") - real-time queries against the USDA FoodData Central API
- Convert units ("convert 2 cups to milliliters" or "what is 350 fahrenheit in celsius?") - kitchen measurement and temperature conversions
- Handle interruptions - change your mind mid-sentence and the agent adapts naturally
Tool execution happens concurrently with audio streaming. When you ask "find me a keto muffin recipe," the agent calls the search tool while continuing to listen for follow-up input. No blocking, no silence gap.
The browser-based frontend uses the Web Audio API for echo cancellation, so you can use laptop speakers and microphone directly - no headset required. It works on desktop, tablets, and phones.
How I built it
The core stack is Amazon Nova Sonic v2 for real-time speech-to-speech AI with tool use, the Strands Agents SDK (BidiAgent) for managing the bidirectional WebSocket and concurrent tool execution, a FastAPI WebSocket server bridging browser audio to BidiAgent, and a React 19 + Vite frontend with Web Audio API for mic capture (16kHz) and AudioWorklet playback (24kHz). Recipe search runs against a Bedrock Knowledge Base (Titan Embed V2 + S3 Vectors), and nutrition lookups query the USDA FoodData Central API.
For deployment, the FastAPI server runs as an ARM64 container on AgentCore Runtime with OpenTelemetry instrumentation for CloudWatch logging. The frontend is hosted on CloudFront + S3, authentication goes through Cognito with Identity Pool for temporary AWS credentials, and the browser SigV4-signs the AgentCore WebSocket URL directly using those credentials - no Lambda or API Gateway in the voice path. Everything is provisioned with Terraform.
The BidiAgent abstraction made the agent surprisingly simple - roughly 20 lines of core code. You pass ws.receive_json and ws.send_json as I/O functions, decorate your tools with @tool, and the SDK handles WebSocket management, audio encoding, interruption handling, and concurrent tool execution.
Challenges I ran into
Echo cancellation - Without echo cancellation, the assistant hears its own output and creates a feedback loop. The Web Audio API's built-in AEC solved this for browsers, but it took experimentation to discover that this was the right layer to handle it (not the model, not the server).
Audio pacing - Nova Sonic generates audio faster than real-time. Without client-side buffering, responses sound like chipmunk audio with no pauses between words. The solution was an AudioWorklet with a 60-second ring buffer that plays audio at the correct 24kHz sample rate regardless of how fast chunks arrive.
Voice ID case sensitivity - Nova Sonic voice IDs must be lowercase ("tiffany", not "Tiffany"). Passing the wrong case results in a
ValidationExceptionthat silently kills the session. This is not documented anywhere obvious.SigV4 WebSocket signing - AgentCore only supports IAM auth. Since the browser connects directly to AgentCore (no Lambda proxy for voice), it must SigV4-sign the WebSocket URL using temporary Cognito credentials. This required implementing the full credential exchange and presigning flow in the browser.
Cognito IAM resource scoping - Scoping the
InvokeAgentRuntimeWithWebSocketStreampermission to a specific runtime ARN silently fails. The action evaluates against an ARN that includes session and qualifier components.Resource: "*"is currently the only option that works (matching the official AWS sample).AgentCore container logging - Containers produce no CloudWatch logs by default. You need
aws-opentelemetry-distroinstalled andopentelemetry-instrumentas the CMD wrapper for log capture.
Accomplishments that we're proud of
True hands-free cooking - The assistant works with laptop speakers and microphone, no headset needed. You can talk to it while your hands are covered in dough.
$0.07 per cooking session - A typical 5-minute session costs roughly 7 cents. An 8-minute session (the Nova Sonic maximum) costs about $0.11 - approximately 80% cheaper than comparable real-time voice APIs.
Sub-second response latency - Nova Sonic v2 delivers sub-700ms speech-to-speech latency, making the conversation feel natural and responsive.
Concurrent tool execution - Tools run without interrupting the audio stream. The agent can search recipes, look up nutrition, and set timers while continuing to listen for follow-up questions.
Full infrastructure as code - Everything from Cognito to CloudFront to ECR is provisioned with Terraform. A new developer can clone the repo and deploy the complete stack.
What I learned
Voice is closer to text than you think - The gap between a text-based agent and a voice-based agent is smaller than expected. Strands abstracts the hard parts and the same
@tooldecorator, docstring-based tool selection, and Bedrock integration work identically. If you have an existing Strands agent, adding voice is closer to a weekend project than a rewrite.Echo cancellation is a client concern - This is an I/O problem, not a model problem. Modern browsers, iOS, Android, and smart speakers all have AEC built in. Solve it at the client layer and the model never needs to know.
Audio buffering matters more than latency - Getting low latency from the model is important, but without proper client-side buffering, fast audio delivery actually makes things worse. The AudioWorklet ring buffer was the key to natural-sounding playback.
AgentCore container deployment has sharp edges - Unique ECR tags for each deployment (not
latest), ARM64-only images, OTEL instrumentation for logging, and IAM resource scoping quirks all required discovery through trial and error rather than documentation.Application-level cost tracking is essential - AWS does not currently publish CloudWatch metrics for bidirectional streaming invocations. The only way to track per-session costs is to count audio chunks and estimate in your application code.
What's next for Voice-Powered Family Recipe Assistant
Unified app - Merge the text and voice assistants into a single application with both input modes. Start with text ("what should I make for dinner?"), switch to voice once you start cooking ("what is the next step?"), and go back to text when things get noisy.
Session rotation - Automatically reconnect when the 8-minute Nova Sonic limit is reached, preserving conversation context across sessions.
Meal planning and shopping lists - Extend the tools to plan weekly meals and generate shopping lists with store availability and pricing from Canadian grocers (Loblaws, Walmart, Farm Boy, Metro, Costco).

Log in or sign up for Devpost to join the conversation.