SonicSight

Inspiration

Every day, 285 million visually impaired people navigate a world designed for sight. Existing assistive tools offer basic object detection or OCR — but they don't reason. They'll tell you "there's a car" but not "don't cross yet, the car is approaching from your left at speed."

We asked: What if your phone could think like a sighted companion? Not just describe, but understand context, assess danger, and give actionable guidance — all through natural voice conversation.

What it does

SonicSight transforms a smartphone into an intelligent spatial assistant:

🎬 Scene Understanding — Continuously analyzes camera frames, identifying objects, reading text, counting people, and detecting hazards
🛡️ Safety Reasoning — Goes beyond detection to contextual danger assessment: "The pedestrian signal is red. Wait at the curb."
🗣️ Voice Conversation — Natural speech Q&A: "What's in front of me?" → Contextual, spoken response
🏷️ Personal Object Recognition — Train it to recognize your keys, medication, or wallet using multimodal embeddings

How we built it

Frontend: React + Vite PWA with a blind-first design philosophy:

Full-screen tap zones (no tiny buttons to find)
Voice-guided onboarding that speaks instructions aloud
Camera auto-starts — zero visual interaction required
The entire bottom of the screen is a mic tap zone

Backend: Python FastAPI serving REST endpoints and a WebSocket proxy:

Scene analysis via Amazon Nova Lite (multimodal vision)
Safety reasoning via Amazon Nova Lite (contextual evaluation)
Voice conversation via Amazon Nova Sonic (speech-to-speech streaming)
Object embeddings via Amazon Nova Embed (multimodal embeddings with cosine similarity matching)

Key Design Decision: We chose a PWA over a native app — no install friction, instant camera/mic access, and works on any smartphone with a browser.

Challenges we faced

Accessibility-first is hard — Our initial UI had beautiful buttons that a blind user could never find. We had to rethink everything: auto-start camera, speak all instructions, make the entire bottom screen edge a tap target.
Latency budget — Scene analysis → safety reasoning → voice response must complete in under 1 second. We solved this with frame interval tuning (4s between captures) and streaming responses.
Stale closure bugs in React — Real-time hooks (camera, voice, API) with multiple async streams created subtle state bugs. We used refs to break closure staleness.
Nova Sonic integration — Bridging browser WebSocket audio to Nova Sonic's bidirectional streaming required careful session management and context injection.

What we learned

Accessibility isn't a feature, it's architecture. You can't bolt it on — it changes how you design every interaction.
Amazon Nova's multimodal capabilities (vision + reasoning in one model) enable safety reasoning that wasn't possible with detection-only models.
PWAs are underrated for accessibility — screen readers, haptic feedback, and speech APIs all work natively.