SightLine: Your AI Vision Assistant for the Visually Impaired

Inspiration

The idea for SightLine was born from a simple but profound question: What if the smartphone in your pocket could become your eyes?

Over 2.2 billion people globally live with vision impairment. Yet the tools available to them — screen readers, basic apps, white canes — haven't fundamentally changed in decades. We kept asking ourselves: in a world where AI can generate photorealistic images, write code, and hold conversations in real time, why can't it describe a crosswalk, read a handwritten menu, or tell someone if a friend is smiling at them?

That gap — between what AI can do and what it's being used for — is what inspired SightLine.

What We Learned

Building SightLine taught us that the hardest problems in AI aren't technical — they're human.

Latency is everything. A 3-second delay is annoying in a chatbot. In a navigation assistant, it's dangerous. We learned to optimize aggressively around Gemini Live API's real-time streaming to keep response time under $t < 2s$.
Voice UX is its own discipline. Sighted users scan. Blind users listen sequentially. Every design decision had to be rethought from the ground up.
Multimodal isn't just a feature — it's the product. The moment vision + voice + real-time processing converged, SightLine stopped feeling like an app and started feeling like a companion.

How We Built It

SightLine is built on a real-time multimodal pipeline:

$$\text{Camera Input} \xrightarrow{\text{Gemini Live API}} \text{Scene Understanding} \xrightarrow{\text{TTS}} \text{Voice Output}$$

Core Stack:

Gemini Live API — continuous video streaming + real-time natural language response
Google Cloud Vision API — object detection, text recognition, facial expression analysis
Google Maps Platform — turn-by-turn navigation with landmark awareness
TensorFlow Lite — on-device inference for privacy-first processing
Firebase — real-time sync for user preferences and emergency contacts

The architecture prioritizes three principles:

Speed — stream-first, never wait for a full frame to process
Privacy — on-device processing wherever possible, zero retention
Simplicity — one tap to activate, entirely voice-driven from there

Challenges We Faced

1. Real-Time Latency Under Load

Maintaining sub-2-second responses while processing live video was our biggest technical challenge. We solved this through frame-sampling optimization:

$$f_{\text{sample}} = \frac{1}{T_{\text{response}}} \cdot \alpha$$

where $\alpha$ is an adaptive confidence threshold that reduces processing load when the scene is static.

2. Describing the World Without Overwhelming the User

Early versions narrated everything — every object, every movement. Users found it exhausting. We built a relevance scoring system that prioritizes:

Immediate hazards (obstacles, traffic)
User-initiated queries
Contextual changes (new person enters room, text appears)

3. Designing for Users We're Not

None of our team is visually impaired. We conducted rapid user interviews and iterated on feedback constantly — learning that what seems intuitive visually often fails completely in audio-only UX.

4. Edge Cases in Scene Description

Ambiguous scenes — low light, motion blur, crowds — produced unreliable descriptions. We implemented a confidence flagging system that signals uncertainty to the user rather than stating incorrect information as fact.

What's Next

SightLine is not just a hackathon project. The path to real-world deployment is clear:

B2C: App Store launch with freemium model
B2B: Enterprise accessibility compliance partnerships
NGO: Collaboration with blindness organizations globally

"The best technology disappears. It just becomes part of how you live."

SightLine doesn't ask visually impaired users to adapt to technology. It asks technology to finally adapt to them.

Built With

Updates

Kler Ross started this project — Mar 16, 2026 07:44 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.