Inspiration

250 million people worldwide live with visual impairments, yet most AI tools still assume you can see the interface. I wanted to build something that flips that assumption entirely — an AI companion that works for the user, not the other way around.

The demo scenario that drove every decision: a blind user named Sarah walks into a restaurant alone. She can't read the menu. She doesn't know what's around her. She has no way to check if a dish has allergens. Sight fixes all of that — hands-free, voice-only, no buttons required.

What it does

Sight is a real-time voice-first accessibility companion for visually impaired users. Say "Hey Sight" to activate, then ask anything:

  • "What do you see?" — Sight describes the room, objects, and people in front of you
  • "Read this menu" — points camera at a menu, reads items, prices, and allergens aloud
  • "Any pasta without meat?" — filters menu options by dietary preference
  • "Find Chinese food near me" — searches nearby restaurants using AWS Location Service
  • "How do I get to the nearest pharmacy?" — gives turn-by-turn directions by voice

Everything is spoken back. No screen needed. No tapping. No reading.

How I built it

Voice pipeline: Amazon Nova 2 Sonic via AWS Bedrock handles all speech input and output in a single speech-to-speech model. Pipecat orchestrates the real-time audio pipeline with WebRTC transport via Daily.co, and Silero VAD detects when the user starts and stops speaking.

Vision: When a scene or menu description is requested, the phone camera captures a frame and sends it as base64 to the backend. Amazon Nova 2 Lite via AWS Bedrock analyzes the image and returns a structured description.

Intelligence: A Strands SDK agent backed by Nova 2 Lite handles tool routing — it reasons about the user's request and picks the right tool automatically. Nova Sonic only needs to know about one function (handle_query), and Strands decides whether to call describe_scene, read_menu, find_places, get_directions, or get_current_time.

Location: AWS Location Service (geo-places + geo-routes) handles all place search, geocoding, and routing — replacing third-party APIs entirely and keeping everything on AWS.

Wake word: Porcupine by Picovoice detects "Hey Sight" on the Mac backend, activating the voice pipeline without any button press.

Frontend: A mobile-optimized browser client served over ngrok — no app install needed. The phone camera streams frames to the backend every 3 seconds via POST /api/camera, stored in a thread-safe camera store that the vision tools pull from.

Challenges I ran into

  • Nova Sonic tool routing — Getting Nova Sonic to always call handle_query instead of answering from its own knowledge required careful system instruction engineering. The fix was making the tool description say "REQUIRED for ALL requests" and passing tools=tools to both the LLM constructor and the LLMContext.

  • Camera architecture — The initial approach used cv2.VideoCapture(0) which only works on a laptop. Switching to a mobile camera required building a thread-safe frame store (camera_store.py) that accepts base64 frames from the phone and serves them to the vision tools.

  • Wake word on mobile — The WASM version of Porcupine caused the phone browser to freeze. The solution was running wake word detection on the Mac backend using the native .ppn model instead.

  • Strands + Nova Sonic integration — Connecting the Strands delegated architecture to the Pipecat Nova Sonic pipeline required careful ordering: tool schema must be defined before the LLM constructor, and tools must be passed to both the service and the context.

What I learned

  • Nova Sonic is remarkably capable as a speech-to-speech model — the latency is low enough for natural conversation, and its tool-calling works reliably once the schema is wired correctly.
  • The Strands delegated architecture is powerful for open-ended queries — instead of hardcoding tool selection logic, the agent reasons about it dynamically using Nova 2 Lite.
  • Building for accessibility forces you to think differently about UX. Every design decision has to work without a screen — audio feedback, contextual fillers, wake word activation, and automatic connection all matter enormously when your user literally cannot see the interface.

What's next

  • Deploy to AWS ECS for production (currently runs locally with ngrok)
  • Android support
  • Wake word detection directly in the browser using WASM
  • Expand vision capabilities to include object detection, currency recognition, and facial expression reading
  • Multi-language support using Nova Sonic's language capabilities

Demo

Advay is visually impaired and walks into a restaurant alone.

  1. He says "Hey Sight" — a beep confirms activation, no button needed 2.He points her phone at the menu and says "read this menu" — Sight reads every item, price, and allergen aloud
  2. He asks "any pasta without meat?" — Sight filters and responds
  3. HE asks "does the mushroom pasta have nuts?" — Sight checks and warns her
  4. He says "find reviews for this restaurant" — Sight searches nearby

Fully hands-free. No screen. No tapping. Just voice.

Built With

  • amazonlocationservice
  • amazonnova2lite
  • amazonnova2sonic
  • bedrock
  • html/css
  • javascript
  • nova
  • pipecat
  • porcupine
  • python
  • strands
  • webrtc
Share this project:

Updates