SayAndStay

"Just say where. We'll handle the stay."

Inspiration

Travel planning has a friction problem. Not a dramatic, broken-system kind of problem — but the quiet, exhausting kind. The kind where you spend 45 minutes clicking through Booking.com filters, comparing prices, reading reviews, and adjusting dates before you even decide on a hotel. The actual trip hasn't started, yet you're already tired.

We kept coming back to one observation: talking is the most natural interface humans have. You wouldn't hesitate to tell a friend, "Find me a hotel in Da Nang, this Friday to Monday, two adults, nothing over $80 a night." They'd understand instantly. So why can't software?

The release of AWS Nova Sonic 2 — a speech-to-speech model built for real-time, low-latency voice conversations — gave us the missing piece. Combined with Amazon Bedrock Nova Lite 2 for browser-level reasoning, we saw the opportunity to build something genuinely useful: a voice agent that books hotels the way a human assistant would, by listening to you and then doing the work.

How We Built It

SayAndStay is built as a full-stack system with three distinct layers, each responsible for a clear part of the speak-to-book pipeline.

The Voice Layer is powered by AWS Nova Sonic 2 S2S (speech-to-speech). Using the Pipecat framework, src/voice_bot.py establishes a real-time audio pipeline that captures the user's spoken input, sends it to Nova Sonic 2, and receives a natural language response — all with minimal latency. This is what gives SayAndStay its conversational feel: the agent doesn't just transcribe words, it understands intent and responds naturally.

The Browser Agent Layer is where the real automation happens. Once the voice layer extracts the user's travel intent — destination, dates, guest count, budget — it passes those parameters to src/browser_agent.py, which uses the browser-use library backed by Bedrock Nova Lite 2. This agent opens a headless browser, navigates Booking.com, fills in the search fields, applies filters, and extracts the top matching results. The key design principle here was atomic task decomposition: rather than issuing one sweeping instruction to the LLM, we broke the booking workflow into small, explicit steps — navigate, fill field, click, read results — which dramatically improves reliability.

The Frontend is a React application that serves as the user-facing interface. It displays the voice interaction in real time and renders the hotel results returned by the browser agent, giving users a clean way to review and confirm their booking.

The entire stack — Python backend, React frontend, and both service entrypoints (main_voice.py and main_browser_service.py) — is containerized with Docker, making deployment straightforward across environments.

What We Learned

Building this project taught us as much about voice interfaces as it did about agent architecture.

On the voice side, we learned that natural language for travel is surprisingly ambiguous. Phrases like "next weekend," "a couple of nights," and "not too expensive" require normalization before they can be passed to a browser agent as structured parameters. Nova Sonic 2 handles the speech side beautifully, but the mapping from conversational intent to actionable data took careful prompt engineering to get right.

On the agent side, we learned that the reliability of a multi-step agent is only as strong as its weakest step. A single flaky DOM selector or a missed page-load event can derail the whole workflow. Using browser-use with Nova Lite 2 gave us more control than a pure end-to-end agent would have, because we could observe, debug, and retry individual steps independently.

We also gained a deep appreciation for Pipecat as a voice pipeline framework. Its modular design made it easy to swap components and tune the real-time audio flow without rebuilding the entire pipeline from scratch.

⚔️ Challenges

The most persistent challenge was dynamic UI instability on Booking.com. The site's interface shifts depending on locale, active A/B tests, promotional banners, and scroll position. Selectors that worked in one session would silently break in the next. We addressed this by writing resilient, fallback-aware selectors and adding retry logic with exponential backoff for steps that were prone to timing issues.

Headless browser behavior was another hurdle. Running in headless mode required extra care around lazy-loaded content, scroll-triggered rendering, and the occasional CAPTCHA-adjacent challenge that wouldn't appear in a normal browser session. We spent significant time tuning page-load strategies to make the agent behave as reliably headless as it did headed.

On the voice side, managing WebSocket stability for the Nova Sonic 2 S2S connection over longer sessions required careful handling of reconnection logic and audio buffer management — details that aren't obvious until the pipeline is running under real conditions.

Finally, there was the challenge of stitching the two services together cleanly. The voice service and the browser agent service run as separate Python processes. Making them communicate reliably — passing structured intent from one to the other and surfacing results back to the frontend in near real time — required thoughtful API design between main_voice.py and main_browser_service.py.