NexHacks_vision : Smartphone Spatial Navigation for Blind & Low‑Vision Users

Inspiration

NexHacks_vision's idea started from a simple moment: talking directly with blind and low-vision students and realizing the hardest part of navigation isn’t “maps,” it’s spatial awareness in the last 10–30 meters. GPS can tell you where to go, but it can’t tell you what’s in your way: chairs, people, construction cones, stairs, cables, cluttered hallways, or a doorway that’s slightly off to the side. We wanted to build something that feels less like a robot shouting alerts and more like a calm companion that helps you move safely and confidently using what you already have: a smartphone.

What it does

NexHacks_vision is a smartphone-only spatial intelligence navigation assistant for visually impaired users. It combines:

Google Maps route guidance (A → B, turn-by-turn progress)
Real-time environment understanding from the phone camera (objects + obstacles + distances)
Simple spoken instructions like: “Walk forward 10 steps. Keep slightly right. Turn right in 6 steps.”

Instead of generic beeps, we built an AI system that gives contextual guidance about nearby hazards and keeps the user aligned to the route, especially in dynamic indoor/outdoor environments where GPS alone falls short.

How we built it

We built a running loop that updates every 15 seconds:

Capture a camera frame from the phone.
Run Overshoot's vision model for object detection/recognition, returning structured JSON with labels, risks, and bounding boxes.
Run Depth Anything to estimate a depth map and compute per-object distance in meters (robust median depth inside each bounding box).
Call the Google Routes API to fetch turn-by-turn steps plus the route longitude/latitude.
Feed a compact “navigation state packet” into an LLM (via OpenRouter) that outputs strict JSON instructions.
Convert the instruction to speech using TTS (ElevenLabs / OpenAI), and repeat.

We designed the LLM prompt so it doesn’t do heavy geometry; it focuses on clear language and prioritization, while deterministic code computes distances, progress, and safety thresholds.

Challenges we ran into

Safety + reliability: Navigation is safety-critical, so we needed deterministic guardrails (e.g., force “STOP” if a high-risk obstacle is too close) and strict JSON output for TTS.
Fusing route + perception: Maps guidance and camera perception often “disagree” in the last-mile; we had to build a clean interface so the system can prioritize immediate hazards without losing the route.
Depth robustness: Even with metric depth, real scenes have noise and edge artifacts. Using median depth inside a central crop of each bounding box improved stability.
Latency and cost: Running vision + depth + LLM on a loop required tight token budgets and small, reliable model choices with fallbacks.

Accomplishments that we're proud of

Built an end-to-end prototype that merges route navigation + real-time obstacle awareness into natural spoken instructions.
Implemented metric-distance hazard estimation from Depth Anything and combined it with object detections.
Added route progress metrics (next maneuver distance, remaining distance, progress %) using step endpoints and polyline validation.
Designed a structured prompting + schema approach that stays stable for TTS and reduces hallucinations.

What we learned

The “last 10–30 meters” is where navigation tools break and where spatial intelligence matters most.
Good assistive UX is not more information; it’s the right instruction at the right time, spoken simply.
Safety comes from layering: deterministic thresholds + perception + language, not from an LLM alone.
Keeping inputs and outputs structured (JSON) is the difference between a demo and a dependable system.

What's next for NexHacks_vision

Increase refresh rate (toward real-time), reduce latency, and improve robustness in crowded indoor spaces.
Add better “on-route” tracking with phone motion sensors (VIO/AR) for more accurate progress and rerouting.
Run pilots with blind/low-vision users to measure trust, comfort, and real-world reliability.
Expand scene understanding to include key landmarks (doors, stairs, crosswalks, signage) and personalization (walking speed, instruction style, preferred safety distance).