Inspiration

An estimated 2.2 billion people worldwide live with some form of vision impairment (WHO, 2023), and over 43 million are completely blind. While outdoor navigation has been largely solved by GPS and turn-by-turn apps, indoor spaces remain a dead zone GPS signals drop the moment you walk through a door. For a visually impaired person, navigating a university campus basement, a hospital corridor, or a convention center means relying entirely on memorized layouts or sighted assistance.

Indoor wayfinding is the #1 unmet accessibility need reported by blind and low-vision users in public buildings (Smith-Kettlewell Eye Research Institute). Existing solutions require expensive pre-installed infrastructure Bluetooth beacons, NFC tags, or Wi-Fi fingerprinting costing $5,000–$50,000+ per building. Most buildings simply never get retrofitted.

We asked: what if a phone camera and AI could replace all of that?

DWS — Digital Walking Stick was born from that question. No hardware installation. No beacons. Just a phone pointed forward and an AI that sees, understands, and speaks.

What it does

DWS is a voice-first indoor navigation assistant that combines real-time computer vision, AI-powered scene understanding, and intelligent pathfinding to guide visually impaired users through indoor environments using nothing but a smartphone camera.

Core experience:

  1. Voice command — Say "I'm in room 0020, take me to room 0010" and DWS plots the optimal route using Dijkstra's algorithm over a pre-analyzed floor plan graph.

  2. Continuous guidance every 3 seconds — Once navigating, DWS automatically captures a frame from the camera, runs object detection + depth estimation + multimodal AI reasoning, and speaks a step-based instruction via ElevenLabs: "Go 3 steps forward", "Chair 2 steps ahead, take 2 steps to your left", "Path clear, keep walking."

  3. Braille detection — DWS detects braille signage on walls and doors using a trained YOLO model and reads it aloud: "I also detected braille text that reads: Room 0010 Research Lab."

  4. Live floor plan tracking — A real-time SVG map shows the user's simulated position, traversed path (green), remaining path (animated dashed blue), direction arrow, and progress percentage.

  5. Zero infrastructure — No beacons, no pre-installed hardware. Upload a floor plan SVG, and Gemini Vision analyzes it once to extract rooms, hallway waypoints, and connections. DWS builds the navigation graph automatically.

How we built it

Architecture

DWS is a two-tier system — a Python FastAPI async backend handling all ML inference, and a Next.js 16 frontend serving as the accessible client interface.

Phone Camera → WebSocket → FastAPI Backend → YOLOv8 + MiDaS + Gemini 2.0 → ElevenLabs TTS → Audio to User

Backend (Python / FastAPI)

  • Object Detection: YOLOv8n (nano) via Ultralytics runs every frame at 30–50ms inference, filtered to the top 5 indoor-relevant detections from 80 COCO classes with a 0.4 confidence threshold.

  • Depth Estimation: MiDaS Small (Intel ISL) monocular depth model converts a single 2D frame into a dense depth map. Per-object distances are computed from the median inverse depth of the center 50% of each bounding box, converted to meters with a calibrated scale factor, capped at 10m. Depth runs on a 1.5-second time-based cooldown to avoid pipeline blocking while cached depth maps are reused for fast per-bbox lookups (<1ms).

  • Obstacle Classification: A custom ObstacleClassifier analyzes the walking corridor (center 40% of the frame). It computes bounding-box overlap with the corridor and applies distance-tiered blocking thresholds: 25% coverage at <1m (danger), 35% at 1–2m (warning), 50% at 2–3.5m (caution). It determines the clear side (left/right) by measuring free corridor space and converts all distances to walking steps (1 step ≈ 0.75m) for natural verbal instructions.

  • Scene Reasoning: Gemini 2.0 Flash multimodal receives the raw camera image + structured detection data (labels, step distances, corridor coverage, blocking status). A carefully engineered prompt constrains output to direct, step-based commands under 20 words with no visual language. Falls back to rule-based generation if Gemini times out (12s timeout, 2 retries with exponential backoff).

  • Floor Plan Analysis: Gemini 2.0 Flash Vision ingests an SVG floor plan (rendered to PNG), extracts room numbers, door coordinates, hallway intersections, and connectivity. The result is cached to JSON a one-time operation that produces the full navigation graph (~23 rooms, ~34 hallway waypoints, ~73 edges for our test building).

  • Pathfinding: Dijkstra's shortest path algorithm over a weighted adjacency list. Room label normalization handles "0020" / "020" / "20" as identical. Returns an ordered list of {x, y, label} waypoints.

  • Text-to-Speech: ElevenLabs eleven_flash_v2_5 model generates natural MP3 audio (44.1kHz, 128kbps) with a 15-second timeout and exponential backoff retry. The frontend falls back to browser SpeechSynthesis if the API is unavailable.

  • Braille Detection: A trained YOLO braille detection model identifies raised-dot signage on walls and doors, reading room numbers, exit signs, and accessibility labels in real time.

Frontend (Next.js 16 / React 19 / TypeScript)

  • WebSocket Camera Stream: The CameraStream component captures frames at 640x480, downscales to 240px width at 35–40% JPEG quality, and sends base64-encoded frames over WebSocket. A backpressure system ensures no frame is sent until the previous response arrives. An adaptive rate controller measures round-trip time and adjusts the send interval (80–800ms), achieving 5–12 FPS on real devices.

  • Auto-reconnection: Exponential backoff with up to 5 retries on WebSocket disconnect.

  • Voice Input: Web Speech API (SpeechRecognition) for one-shot voice commands, with automatic fallback to manual text input if the browser doesn't support it.

  • Floor Plan Visualization: SVG-based renderer with requestAnimationFrame animation loop live position beacon with pulse effect, traversed/remaining path rendering, directional arrow, and percentage progress bar.

  • Code Splitting: Heavy components (FloorPlanMap, CameraStream) are lazy-loaded via next/dynamic to minimize initial bundle size.

  • Authentication: Clerk with protected routes, middleware-based redirects, and Supabase for data persistence.

Performance Pipeline

The entire frame-to-speech pipeline is optimized for real-time:

Stage Time Strategy
Frame capture + encode ~5ms 240px, JPEG 35%
WebSocket transit ~10ms Base64, drain stale frames
YOLOv8n detection 30–50ms Every frame, thread pool
MiDaS depth ~80ms 1.5s cooldown, cached map
Gemini reasoning 1–3s Parallel with braille, retry
ElevenLabs TTS 0.5–2s Streamed MP3
Total (speech) ~3s Auto-fires every 3s

Challenges we ran into

  • Monocular depth is inherently noisy — MiDaS gives relative depth, not absolute. We spent significant time calibrating the inverse-depth-to-meters conversion and settled on sampling the center 50% of each bounding box with a median filter to reject outlier pixels. Even then, distance estimates can swing by ±0.5m, so we report in whole steps rather than precise decimals rounding errors become invisible.

  • Gemini latency vs. frame rate trade-off — Gemini 2.0 Flash takes 1–3 seconds per multimodal call. Sending every frame would create a massive queue. We solved this with the 3-second auto-announce cycle the system captures one high-quality frame, runs full analysis, speaks the result, and waits before the next cycle. Meanwhile, the WebSocket stream continues running lightweight YOLOv8 detection on every frame for visual feedback.

  • Audio overlap — With announcements firing every 3 seconds, previous audio could collide with new audio. We built overlap protection at every level: isSpeakingRef prevents new analysis while audio plays, stopAll() kills any running Audio element and cancels browser SpeechSynthesis before starting fresh, and an analyzingRef flag prevents concurrent API calls.

  • Phone camera quality over WebSocket — Sending full-resolution frames was too slow. We tuned down to 240px width at 35% JPEG quality with backpressure control the client never sends a new frame until the server responds to the previous one. An adaptive interval (RTT x 1.6) keeps throughput optimal without overwhelming slower connections.

  • Floor plan graph construction — Getting Gemini to consistently output structured JSON with room coordinates, hallway nodes, and connection edges from an SVG floor plan required extensive prompt engineering and a caching strategy (analyze once, cache forever).

Accomplishments that we're proud of

  • Zero infrastructure requirement — DWS works with any building floor plan SVG and a smartphone. No beacons, no hardware, no installation. Upload a floor plan, and the AI builds the navigation graph.

  • Sub-3-second guidance loop — From camera frame to spoken instruction in under 3 seconds, running continuously and automatically. The user never needs to tap a button during navigation.

  • Step-based natural language — Instead of "obstacle at 1.7 meters", users hear "Chair 2 steps ahead, take 2 steps to your left." This maps directly to physical action.

  • Multimodal AI pipeline — YOLOv8 detection + MiDaS depth + Gemini Vision reasoning + ElevenLabs voice synthesis + braille detection, all running in parallel with graceful fallbacks at every stage.

  • Production-grade resilience — Every external call has timeouts, retries with exponential backoff, and fallbacks. Gemini fails? Rule-based instructions. ElevenLabs down? Browser TTS. WebSocket drops? Auto-reconnect. No single failure breaks the experience.

What we learned

  • Prompt engineering is real engineering — The difference between Gemini saying "There appears to be a chair that you might want to avoid" and "Chair ahead. Take 2 steps left." came down to dozens of prompt iterations. Constraining multimodal models to produce concise, action-oriented, non-visual language is harder than it sounds.

  • Perceived latency matters more than actual latency — Users don't mind 3-second cycles if the audio feels immediate and confident. But a 1-second delay that feels like a hang is worse than a 3-second delay with a clear rhythm.

  • Depth estimation from a single camera is a solved-enough problem — MiDaS won't give you millimeter precision, but for "is this 1 step or 3 steps away?", monocular depth is surprisingly effective when paired with smart averaging and calibration.

  • Accessibility-first design changes everything — Building for blind users forced us to rethink every UX assumption. No visual feedback loops. No "tap here." Every piece of information must be speakable, concise, and actionable within 3 seconds.

What's next for DWS

  • Multi-floor support — Elevator and stairwell detection to handle floor transitions, with per-floor navigation graphs.

  • Crowd-sourced floor plans — Allow users to upload floor plans for any building. Gemini analyzes them automatically, building a growing database of navigable indoor spaces.

  • Spatial audio — Using Web Audio API panning to make obstacle warnings come from the direction of the obstacle (left ear = obstacle on left). The architecture is stubbed out and ready.

  • On-device inference — Porting YOLOv8n and MiDaS to ONNX/TFLite for client-side inference, eliminating the WebSocket round-trip entirely and enabling offline navigation.

  • Haptic feedback patterns — Vibration patterns that encode direction and urgency (short pulse = caution, long buzz = stop), allowing guidance without audio in noisy environments.

  • Live indoor positioning — Integrating Visual Inertial Odometry (VIO) via ARKit/ARCore to track the user's actual position on the floor plan, replacing the current simulated tracking.

Built With

Share this project:

Updates