Inspiration
250 million people worldwide live with visual impairments, yet most AI tools still assume you can see the interface. I wanted to build something that flips that assumption entirely — an AI companion that works for the user, not the other way around.
The demo scenario that drove every decision: a blind user named Sarah walks into a restaurant alone. She can't read the menu. She doesn't know what's around her. She has no way to check if a dish has allergens. Sight fixes all of that — hands-free, voice-only, no buttons required.
What it does
Sight is a real-time voice-first accessibility companion for visually impaired users. Say "Hey Sight" to activate, then ask anything:
- "What do you see?" — Sight describes the room, objects, and people in front of you
- "Read this menu" — points camera at a menu, reads items, prices, and allergens aloud
- "Any pasta without meat?" — filters menu options by dietary preference
- "Find Chinese food near me" — searches nearby restaurants using AWS Location Service
- "How do I get to the nearest pharmacy?" — gives turn-by-turn directions by voice
Everything is spoken back. No screen needed. No tapping. No reading.
How I built it
Voice pipeline: Amazon Nova 2 Sonic via AWS Bedrock handles all speech input and output in a single speech-to-speech model. Pipecat orchestrates the real-time audio pipeline with WebRTC transport via Daily.co, and Silero VAD detects when the user starts and stops speaking.
Vision: When a scene or menu description is requested, the phone camera captures a frame and sends it as base64 to the backend. Amazon Nova 2 Lite via AWS Bedrock analyzes the image and returns a structured description.
Intelligence: A Strands SDK agent backed by Nova 2 Lite handles tool
routing — it reasons about the user's request and picks the right tool
automatically. Nova Sonic only needs to know about one function (handle_query),
and Strands decides whether to call describe_scene, read_menu,
find_places, get_directions, or get_current_time.
Location: AWS Location Service (geo-places + geo-routes) handles all place search, geocoding, and routing — replacing third-party APIs entirely and keeping everything on AWS.
Wake word: Porcupine by Picovoice detects "Hey Sight" on the Mac backend, activating the voice pipeline without any button press.
Frontend: A mobile-optimized browser client served over ngrok — no app
install needed. The phone camera streams frames to the backend every 3 seconds
via POST /api/camera, stored in a thread-safe camera store that the vision
tools pull from.
Challenges I ran into
Nova Sonic tool routing — Getting Nova Sonic to always call
handle_queryinstead of answering from its own knowledge required careful system instruction engineering. The fix was making the tool description say "REQUIRED for ALL requests" and passingtools=toolsto both the LLM constructor and theLLMContext.Camera architecture — The initial approach used
cv2.VideoCapture(0)which only works on a laptop. Switching to a mobile camera required building a thread-safe frame store (camera_store.py) that accepts base64 frames from the phone and serves them to the vision tools.Wake word on mobile — The WASM version of Porcupine caused the phone browser to freeze. The solution was running wake word detection on the Mac backend using the native
.ppnmodel instead.Strands + Nova Sonic integration — Connecting the Strands delegated architecture to the Pipecat Nova Sonic pipeline required careful ordering: tool schema must be defined before the LLM constructor, and tools must be passed to both the service and the context.
What I learned
- Nova Sonic is remarkably capable as a speech-to-speech model — the latency is low enough for natural conversation, and its tool-calling works reliably once the schema is wired correctly.
- The Strands delegated architecture is powerful for open-ended queries — instead of hardcoding tool selection logic, the agent reasons about it dynamically using Nova 2 Lite.
- Building for accessibility forces you to think differently about UX. Every design decision has to work without a screen — audio feedback, contextual fillers, wake word activation, and automatic connection all matter enormously when your user literally cannot see the interface.
What's next
- Deploy to AWS ECS for production (currently runs locally with ngrok)
- Android support
- Wake word detection directly in the browser using WASM
- Expand vision capabilities to include object detection, currency recognition, and facial expression reading
- Multi-language support using Nova Sonic's language capabilities
Demo
Advay is visually impaired and walks into a restaurant alone.
- He says "Hey Sight" — a beep confirms activation, no button needed 2.He points her phone at the menu and says "read this menu" — Sight reads every item, price, and allergen aloud
- He asks "any pasta without meat?" — Sight filters and responds
- HE asks "does the mushroom pasta have nuts?" — Sight checks and warns her
- He says "find reviews for this restaurant" — Sight searches nearby
Fully hands-free. No screen. No tapping. Just voice.
Built With
- amazonlocationservice
- amazonnova2lite
- amazonnova2sonic
- bedrock
- html/css
- javascript
- nova
- pipecat
- porcupine
- python
- strands
- webrtc
Log in or sign up for Devpost to join the conversation.