Sight

Inspiration

250 million people worldwide live with visual impairments, yet most AI tools still assume you can see the interface. I wanted to build something that flips that assumption entirely — an AI companion that works for the user, not the other way around.

The demo scenario that drove every decision: a blind user named Sarah walks into a restaurant alone. She can't read the menu. She doesn't know what's around her. She has no way to check if a dish has allergens. Sight fixes all of that — hands-free, voice-only, no buttons required.

What it does

Sight is a real-time voice-first accessibility companion for visually impaired users. Say "Hey Sight" to activate, then ask anything:

"What do you see?" — Sight describes the room, objects, and people in front of you
"Read this menu" — points camera at a menu, reads items, prices, and allergens aloud
"Any pasta without meat?" — filters menu options by dietary preference
"Find Chinese food near me" — searches nearby restaurants using AWS Location Service
"How do I get to the nearest pharmacy?" — gives turn-by-turn directions by voice

Everything is spoken back. No screen needed. No tapping. No reading.

How I built it

Voice pipeline: Amazon Nova 2 Sonic via AWS Bedrock handles all speech input and output in a single speech-to-speech model. Pipecat orchestrates the real-time audio pipeline with WebRTC transport via Daily.co, and Silero VAD detects when the user starts and stops speaking.

Vision: When a scene or menu description is requested, the phone camera captures a frame and sends it as base64 to the backend. Amazon Nova 2 Lite via AWS Bedrock analyzes the image and returns a structured description.

Intelligence: A Strands SDK agent backed by Nova 2 Lite handles tool routing — it reasons about the user's request and picks the right tool automatically. Nova Sonic only needs to know about one function (handle_query), and Strands decides whether to call describe_scene, read_menu, find_places, get_directions, or get_current_time.

Location: AWS Location Service (geo-places + geo-routes) handles all place search, geocoding, and routing — replacing third-party APIs entirely and keeping everything on AWS.

Wake word: Porcupine by Picovoice detects "Hey Sight" on the Mac backend, activating the voice pipeline without any button press.

Frontend: A mobile-optimized browser client served over ngrok — no app install needed. The phone camera streams frames to the backend every 3 seconds via POST /api/camera, stored in a thread-safe camera store that the vision tools pull from.

Challenges I ran into

Nova Sonic tool routing — Getting Nova Sonic to always call handle_query instead of answering from its own knowledge required careful system instruction engineering. The fix was making the tool description say "REQUIRED for ALL requests" and passing tools=tools to both the LLM constructor and the LLMContext.
Camera architecture — The initial approach used cv2.VideoCapture(0) which only works on a laptop. Switching to a mobile camera required building a thread-safe frame store (camera_store.py) that accepts base64 frames from the phone and serves them to the vision tools.
Wake word on mobile — The WASM version of Porcupine caused the phone browser to freeze. The solution was running wake word detection on the Mac backend using the native .ppn model instead.
Strands + Nova Sonic integration — Connecting the Strands delegated architecture to the Pipecat Nova Sonic pipeline required careful ordering: tool schema must be defined before the LLM constructor, and tools must be passed to both the service and the context.

What I learned

Nova Sonic is remarkably capable as a speech-to-speech model — the latency is low enough for natural conversation, and its tool-calling works reliably once the schema is wired correctly.
The Strands delegated architecture is powerful for open-ended queries — instead of hardcoding tool selection logic, the agent reasons about it dynamically using Nova 2 Lite.
Building for accessibility forces you to think differently about UX. Every design decision has to work without a screen — audio feedback, contextual fillers, wake word activation, and automatic connection all matter enormously when your user literally cannot see the interface.

What's next

Deploy to AWS ECS for production (currently runs locally with ngrok)
Android support
Wake word detection directly in the browser using WASM
Expand vision capabilities to include object detection, currency recognition, and facial expression reading
Multi-language support using Nova Sonic's language capabilities

Demo

Advay is visually impaired and walks into a restaurant alone.

He says "Hey Sight" — a beep confirms activation, no button needed 2.He points her phone at the menu and says "read this menu" — Sight reads every item, price, and allergen aloud
He asks "any pasta without meat?" — Sight filters and responds
HE asks "does the mushroom pasta have nuts?" — Sight checks and warns her
He says "find reviews for this restaurant" — Sight searches nearby

Fully hands-free. No screen. No tapping. Just voice.

Built With

amazonlocationservice
amazonnova2lite
amazonnova2sonic
bedrock
html/css
javascript
nova
pipecat
porcupine
python
strands
webrtc

Submitted to

Amazon Nova AI Hackathon

Created by

I built Sight solo as a first-year college student — solo developer,
self-funded, 5 days from idea to submission.

I designed and implemented the full stack:

- Voice pipeline using Amazon Nova 2 Sonic via Pipecat and Daily WebRTC
- Vision system using Amazon Nova 2 Lite for real-time camera analysis
- Strands SDK agent for intelligent tool routing
- AWS Location Service integration replacing third-party APIs
- Thread-safe mobile camera frame store for phone-to-backend streaming
- Wake word detection using Porcupine
- Mobile browser client served over ngrok

Every architectural decision was made with one user in mind — a visually
impaired person who needs AI that works for them, not the other way around.

Advay kariyaden

Updates

Advay kariyaden started this project — Mar 16, 2026 06:23 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.