Inspiration

Every year, travelers spend an average of 5+ hours researching and planning a single trip — juggling flight aggregators, hotel sites, weather apps, maps, and review platforms across dozens of browser tabs. We asked: what if you could just talk to someone who handles all of it?

NovaTour was born from the vision of a voice-first travel intelligence — an AI companion that doesn't just answer questions but actively orchestrates your entire trip in real-time conversation. Inspired by the way a seasoned travel concierge works — listening to your preferences, pulling together options, adjusting the plan on the fly — we set out to build the world's first fully voice-driven, end-to-end travel planning and booking agent powered entirely by Amazon Nova.

The launch of Amazon Nova Sonic (bidirectional speech-to-speech) and Nova Act (autonomous browser control) made this vision technically possible for the first time: a single AI system that can hear you, reason about your trip, search real-time travel data, generate a visual itinerary, and book your flights — all in one seamless voice conversation.

What It Does

NovaTour is a full-duplex voice AI travel assistant that:

  1. Listens & Understands — Real-time speech recognition via Amazon Nova Sonic with barge-in support (interrupt the AI mid-sentence, just like a real conversation)
  2. Searches & Reasons — Orchestrates 8 specialized travel tools in real-time: flights, hotels, attractions, routes, weather, and more
  3. Plans & Visualizes — Generates day-by-day itineraries with Amazon Nova Lite, rendered as interactive timelines and maps with route polylines
  4. Books Autonomously — Uses Amazon Nova Act to navigate Google Flights and complete real bookings through browser automation
  5. Adapts Verbosity — A novel Level-of-Detail (LOD) system with 60+ bilingual trigger patterns lets users dynamically control response depth — from quick facts to immersive podcast-style narration

How We Built It

Architecture:

Browser (Next.js 16 + React 19)
  ↕ WebSocket (full-duplex audio + events)
FastAPI Backend (Python 3.13)
  ├── Strands BidiAgent (Nova Sonic wrapper)
  ├── 8 Travel Tools (@tool decorated)
  ├── LOD Adaptive System (60+ patterns)
  └── 3-Tier Resilience Engine
AWS Services
  ├── Amazon Nova Sonic (voice)
  ├── Amazon Nova Lite (reasoning)
  ├── Amazon Nova Act (booking)
  └── DynamoDB + S3 (persistence)

Voice Pipeline: We built a custom bidirectional audio streaming pipeline using the Strands Agents SDK's BidiAgent class. The browser captures microphone audio, resamples from native rate → 16 kHz PCM, base64-encodes it, and streams it over WebSocket at ~85ms intervals. The backend feeds this into Nova Sonic and simultaneously streams back 24 kHz audio responses, transcripts, and tool call events. The result is a sub-second voice interaction with full barge-in support.

Tool Orchestration: Each of our 8 travel tools is built as a Strands @tool-decorated function with:

  • Primary API integration (Google Places, Google Routes, OpenWeather, Gemini Search, Nova Lite, Nova Act)
  • Automatic mock fallback for resilient demo/testing
  • @retry_api_call() decorator with exponential backoff
  • Error classification (is_recoverable()) for intelligent retry vs. fail-fast decisions

LOD System: Our most innovative feature — a 3-level verbosity control system:

  • LOD 1 (Brief): 15–40 words, quick answers for time-pressed travelers
  • LOD 2 (Standard): 80–150 words, conversational recommendations
  • LOD 3 (Narrative): 400–800 words, immersive podcast-style storytelling with sensory details

The system uses 60+ bilingual (English/Chinese) trigger patterns with priority-based signal classification (explicit > implicit) and confidence scoring. Users can switch modes naturally: "tell me more" → LOD 3, "keep it short" → LOD 1. System prompts are dynamically interpolated without restarting the voice session.

3-Tier Resilience: Every component follows our fallback architecture:

  1. Primary path (BidiAgent + real APIs)
  2. Retry with exponential backoff
  3. MockAgent fallback (ensures the app never crashes)

Plus: idle timeout detection (45s), WebSocket auto-reconnect (3 attempts), and TTS sanitization (strips markdown for natural speech).

Frontend: Built with Next.js 16 and React 19 — no external UI libraries. The interface features:

  • Real-time voice transcript display with interim/final states
  • Interactive itinerary timeline with activity photos
  • MapLibre GL map with day-coded markers and route polylines
  • Nova Act booking progress overlay with live screenshots
  • LOD selector for manual verbosity control

Challenges We Faced

  1. Full-Duplex Audio Synchronization — Achieving gapless 24 kHz audio playback while simultaneously streaming 16 kHz input required careful AudioContext scheduling and buffer management. We solved this with scheduled AudioBufferSourceNode chains and a nextStartTime tracker.

  2. Barge-In Handling — When the user interrupts the AI mid-sentence, we need to instantly clear the audio playback buffer, cancel pending responses, and transition the voice state machine. This required a custom VoiceStateMachine with 4 states and validated transitions.

  3. Nova Act Dependency Conflict — Nova Act requires strands-agents ≤1.23.0, but BidiAgent (Nova Sonic) needs ≥1.30.0. We solved this with isolated installations (--no-deps) and runtime ImportError handling.

  4. Tool Result Enrichment — Itineraries generated by Nova Lite lack coordinates for activities. We built a places cache that stores coordinates from search_places calls and automatically injects them into itinerary activities, enabling map visualization.

  5. Bilingual LOD Detection — Supporting both English and Chinese trigger patterns required careful priority ordering and confidence scoring to avoid false positives from ambiguous phrases.

What We Learned

  • Amazon Nova Sonic's bidirectional streaming is remarkably low-latency — achieving near-human conversational feel
  • The Strands Agents SDK's BidiAgent abstraction elegantly handles the complexity of voice + tool orchestration
  • Adaptive verbosity (LOD) dramatically improves voice UX — users instinctively adjust detail level
  • Resilience engineering is non-negotiable for voice applications — any dropped frame or timeout breaks the conversational illusion

What's Next

  • Multi-language expansion beyond English and Chinese
  • Persistent trip memory using DynamoDB sessions (infrastructure already provisioned)
  • Collaborative planning — multiple travelers in one voice session
  • Nova Multimodal Embeddings for destination-aware recommendations (model configured, integration planned)

Built With

Share this project:

Updates