Building NavCane (Ally)

The Spark

It started with a simple observation. On the University of Waterloo campus, I watched a student with a white cane trace the edge of a building, carefully sweeping left and right. Every step was deliberate, every inch of ground checked before committing. The cane is a brilliant tool — but it can only reach what's within arm's length. A chair, a backpack, a person standing still — all invisible until the cane physically touches them.

I wondered: what if the cane could see?

What if a camera on a lanyard, paired with a phone in a pocket, could whisper to the user: "Chair, two meters ahead on your left. Move right." That question sent me down a rabbit hole of real-time computer vision, edge AI, spatial reasoning, and a deep respect for the problem space of assistive technology.


What I Learned

Object Detection at the Edge

YOLO (You Only Look Once) treats detection as a single regression problem — bounding box coordinates and class probabilities are predicted directly from image pixels. The loss function is a weighted sum of three components:

The Real-Time Constraint

Every millisecond counts. YOLO11m runs a forward pass in roughly 80ms on a consumer GPU. But adding a VLM call (even a fast one like Gemma or llava) introduces 500ms–3s of latency. That's the difference between "step over this" and "you've already tripped."

The solution was a two-stage pipeline:

  1. YOLO-first, always. Every frame (~800ms intervals) runs through YOLO. Detections are reported as ground truth within 100ms.
  2. Model routing by context. If a person is detected, the frame is sent to a local VLM (Ollama + llava:7b) — fast, private, offline. If no person is present, we can afford the cloud latency and route to Google Gemini/Gemma for richer scene understanding.

Formally, the routing decision is:

$$M(x) = \begin{cases} \text{Ollama}_{\text{llava}}(x) & \text{if } \exists \, b \in B : \text{label}(b) = \text{person} \ \text{Gemini}(x) & \text{otherwise} \end{cases}$$

where $B$ is the set of YOLO detections for frame $x$.

GPS Isn't as Precise as You Think

Standard GPS accuracy ($\sigma \approx 3\text{--}5\text{m}$) is fine for driving directions, but terrible for telling a visually impaired user which side of the path they're on. We had to implement snap-to-route logic using the haversine distance and bearing:

$$d = 2R \arcsin\left(\sqrt{\sin^2\left(\frac{\Delta\phi}{2}\right) + \cos\phi_1 \cos\phi_2 \sin^2\left(\frac{\Delta\lambda}{2}\right)}\right)$$

and project the GPS fix onto the nearest route segment, snapping the user position accordingly. It's a hack — but it works.


How I Built It

Week 1: Proof of Concept

A laptop webcam, a script running YOLO, and terminal output. "Chair: 0.92 confidence, x=320, y=240." Not glamorous, but it proved the camera could see obstacles.

Week 2: Spatial Reasoning

Raw bounding boxes aren't helpful to a user. I needed spatial language — "on your left," "dead ahead," "close." I wrote a simple heuristic that divides the frame into three vertical zones:

$$\text{Zone}(x) = \begin{cases} \text{left} & \text{if } x < \frac{W}{3} \ \text{center} & \text{if } \frac{W}{3} \leq x \leq \frac{2W}{3} \ \text{right} & \text{if } x > \frac{2W}{3} \end{cases}$$

with depth estimated from bounding box area: larger = closer.

Week 3: The First Voice

Text-to-speech turned those zone heuristics into audible guidance. gTTS was free but slow. Google Cloud TTS was fast but cost money. I built a fallback chain: Google Cloud → ElevenLabs → gTTS, so the app degrades gracefully when APIs fail.

Week 4: Campus Routes

I plotted coordinates for 20+ campus buildings manually from OpenStreetMap. OSRM (Open Source Routing Machine) computes footpaths between them. The route is a polyline; the app checks the user's progress by finding the nearest segment and computing the remaining distance.

Week 5: The Web Interface

A single-page app with the camera feed, a Leaflet map showing the route, a microphone button for voice commands, and a log of every detection and direction. FastAPI served it all, streaming VLM responses via Server-Sent Events (SSE).

Week 6: Field Testing

I walked the ring road around campus with a phone strapped to my chest. The app correctly identified benches, bicycles, trash cans, and people. It misidentified a bush as a person (a classic YOLO false positive). It told me to "move left" when there was a wall on my left. I learned to add a persistence filter: an object isn't real until it's been detected in 3 out of the last 5 frames.


Challenges Faced

Latency Latency Latency

The biggest enemy. A naive pipeline that runs YOLO + VLM on every frame produces unusable lag. The solution was aggressive caching: detections are updated every 800ms, but VLM queries only fire when the scene significantly changes (defined by a threshold on the mean pixel difference between frames).

The Audio Feedback Loop

When the app speaks while the user is speaking (or while a previous message is still playing), the result is chaos. I implemented a speech queue with priority levels:

Priority Source Behavior
0 Obstacle alert Interrupts everything
1 Navigation direction Waits for priority 0 to finish
2 User query response Waits for priority 0 and 1

and a mutex on the audio output device.

Model Selection Trade-offs

Model Latency Quality Offline
YOLO11m ~80ms High (80+ classes) Yes
llava:7b (Ollama) ~500ms Medium Yes
Gemma 4 26B ~1.5s Very high No
Gemini 2.5 Flash ~2s Very high No

The routing decision ($R$) is itself an optimization problem:

$$R^* = \arg\min_{r \in {\text{local}, \text{cloud}}} \left( \alpha \cdot \text{latency}(r) + \beta \cdot (1 - \text{accuracy}(r)) + \gamma \cdot \text{cost}(r) \right)$$

In practice, $\alpha$ (latency weight) dominates — so local is preferred whenever possible.

Battery Life

Running YOLO on a phone generates heat and drains the battery. Inference on device (via CoreML or TFLite) was 3x slower than the desktop GPU. The current workaround: run inference on a lightweight server and stream frames from the phone over HTTP. Not ideal for real-world use — a dedicated edge accelerator (Google Coral, NVIDIA Jetson) is the obvious next step.

The Map vs. Reality Problem

OSRM returns the shortest path, not the most accessible one. A route through a construction zone, up stairs, or across a busy intersection is perfectly valid to the algorithm but dangerous for a visually impaired user. The fix was manual: I tagged building entrances and excluded stairs from the routing graph. A long-term solution would involve integrating sidewalk quality data and accessibility reports.


What's Next

The system works, but it's not a product. The next steps are:

  • On-device inference via CoreML/TFLite or a Coral TPU — no server required
  • Depth estimation from a single camera using MiDaS or similar, replacing the heuristic box-area approximation with metric depth
  • Crowd-sourced accessibility data — if users mark hazards, the map learns
  • Haptic feedback — a vibration belt that buzzes on the side where obstacles are detected, reducing cognitive load from audio


Gemma, VLMs, and LoRA Fine-Tuning

Why Gemma

When the scene contains no people, YOLO's 80-class vocabulary is too coarse. A chair, a trash can, a backpack — YOLO can label them, but it can't answer "is that chair blocking the path?" or "can I walk between those two tables?" For that, you need a Vision-Language Model that can reason about spatial relationships, scene semantics, and accessibility.

We evaluated several VLMs:

Model Parameter Count Open Weight? Spatial Reasoning Latency
LLaVA-1.6 7B / 13B Yes Good ~500ms (local)
Gemma 4 26B 26B Yes Excellent ~1.5s (cloud)
Gemini 2.5 Flash No Excellent ~2s (cloud)
GPT-4V No Excellent ~3s (cloud)

Gemma 4 26B stood out. It's open-weight, so we could run it on our own infrastructure. Its spatial reasoning — understanding object relationships, depth ordering, and path traversal — was head and shoulders above LLaVA for this use case. Crucially, Google released it with a permissive license that allows assistive technology applications.

Vision Encoder + Language Decoder

Like most modern VLMs, Gemma uses a dual-encoder architecture. Images are processed by a vision encoder (SigLIP-based) that projects patch embeddings into the language model's token space:

$$\mathbf{v} = \text{SigLIP}_{\text{enc}}(I) \in \mathbb{R}^{N \times d_v}$$

These visual tokens are projected through a learned connector:

$$\mathbf{z} = W_p \cdot \mathbf{v} + b_p \quad \text{where} \quad W_p \in \mathbb{R}^{d_{\text{llm}} \times d_v}$$

and concatenated with the text token embeddings before being fed into the Gemma language decoder. The model autoregressively generates a description of the scene, with attention distributed across both visual and textual tokens.

The LoRA Fine-Tuning Setup

The generic Gemma checkpoint was good at describing what it saw, but not at answering navigation-relevant questions like "Is the path clear ahead?" or "Which side should I walk around this obstacle?" We needed to fine-tune it on a dataset of campus navigation scenarios.

Full fine-tuning of a 26B model was infeasible on consumer hardware. A single AdamW step with 26B parameters requires storing ~104GB of optimizer states alone. Enter LoRA (Low-Rank Adaptation).

LoRA freezes the pretrained weights and injects trainable rank decomposition matrices into each attention layer. For a weight matrix $W_0 \in \mathbb{R}^{d \times k}$, the update is:

$$W_0 + \Delta W = W_0 + BA$$

where $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$, and the rank $r \ll \min(d, k)$. During training, only $A$ and $B$ are updated:

$$\mathcal{L}(\theta) = \frac{1}{|D|} \sum_{(x, y) \in D} -\log p_\theta(y \mid x) \quad \text{where} \quad \theta = {A, B}$$

The number of trainable parameters drops from 26B to:

$$\text{LoRA params} = \sum_{\text{layers}} 2 \cdot d_{\text{layer}} \cdot r \cdot n_{\text{targets}}$$

With $r = 16$ and LoRA applied to query, key, value, and output projections in all 42 decoder layers, we trained only ~340M parameters — roughly 1.3% of the full model.

The Navigation Dataset

We built a synthetic + real dataset of 5,000 examples:

Source Count Examples
Campus phone footage 1,200 Hallways, plazas, lecture halls
COCO subset (indoor) 2,000 Offices, kitchens, living rooms
Synthetic renders 1,800 Blender scenes with random obstacle layouts

Each example paired an image with a structured QA pair:

Image: [camera frame of a hallway with a chair on the left]
Q: "Describe the obstacles ahead and suggest a safe path."
A: "A chair is on the left approximately 2 meters ahead. 
    The path is clear on the right. Move slightly right 
    and continue forward."

We formatted these as multi-turn conversations to match Gemma's chat template.

Training Recipe

We used QLoRA (Quantized LoRA) to push further — the base model was loaded in 4-bit NF4 quantization:

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForVision2Seq.from_pretrained(
    "google/gemma-4-26b-it",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

The LoRA adapters were configured via PEFT:

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,           # scaling factor
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

Training hyperparameters:

Hyperparameter Value
Rank ($r$) 16
LoRA alpha ($\alpha$) 32
Dropout 0.05
Learning rate $2 \times 10^{-4}$
LR scheduler Cosine with 10% warmup
Batch size 8 (gradient accumulation 4)
Optimizer paged AdamW 8-bit
Precision bf16
Epochs 3
GPU 1x A100 80GB

Total training time: ~6 hours.

The Impact

Before fine-tuning, Gemma would describe a cluttered hallway as "a corridor with various objects including a chair, a table, and a backpack." It was accurate but useless for navigation.

After LoRA fine-tuning, the same image produced:

"A chair is on the left, 2m ahead. A backpack is on the floor to the right, 3m ahead. The center path is clear. Continue forward, then veer slightly left after passing the chair."

The key insight: we didn't teach Gemma to see better — SigLIP was already good at that. We taught it to reason about traversal from a first-person, navigation-centric perspective. LoRA made this adaptation cheap enough to iterate on a single GPU, and the quantized base model kept memory under 48GB.

For the first time, the app could answer a question like "What's in my way?" with something more useful than a list of bounding boxes. It could tell a story about the path ahead — and that made all the difference.

Closing Thought

NavCane won't replace the white cane. But if it can give someone walking across campus a little more confidence — one whispered direction at a time — that's enough.

The integral, I hope, is positive.

Built With

  • gemini
  • moondream
  • nextjs
  • opencode
  • tailwind
  • vlm
Share this project:

Updates