VisionGuide

VisionGuide: Real-Time Eyes When You Need Them

Why We Built This

Last semester, I was rushing to class and nearly ran into someone using a white cane at the top of the stairwell. They were standing there, hesitating. Not lost just... trying to figure out if it was safe to go down. I stopped and asked if they needed help. They said no, they were fine—just waiting to make sure no one was coming up and trying to remember which side the handrail was on. They had an app on their phone that was supposed to help, but all it said was "obstacle detected." That stuck with me. Here we are building AI that can analyze research papers and generate code, but someone can't safely navigate a staircase they use every day? Nive and I started talking about it. We're both in tech cybersecurity and data science but neither of us had really thought about accessibility before. The more we dug in, the more frustrated we got. 7 million Americans are blind or visually impaired, and the "smart" tools available basically just yell "thing ahead!" at them. We figured we could do better.

What We're Actually Building

VisionGuide is pretty straightforward: point your phone camera, and it tells you what's around you in a way that's actually useful. Not "there's a sign" but "wet floor sign, stay to the right." Not "stairs detected" but "five stairs going down, handrail on your left, someone's coming up so wait a sec." The key difference is context. Existing apps do object detection they can spot a chair or a door. But they can't tell you that the door is opening toward you, or that the chair is blocking the only path forward, or that there are three people standing in the doorway and you should wait for them to move. That's what Gemini 3 lets us do. It doesn't just see objects it understands the whole scene, the relationships between things, and what actually matters right now.

How We Built It

We started by mapping out the core problem: existing navigation apps give object labels, but people need spatial context. So we focused on three things: what to describe, how to describe it, and when to describe it.

The tech stack:

Phone camera (obviously) Gemini 3 API for the vision analysis Text-to-speech for audio output A lot of prompt engineering to get Gemini to describe things the right way

The hard parts:

Latency is brutal. We need real-time feedback, but API calls take 2-3 seconds. Right now we're caching common scenarios and using predictive analysis if you're approaching stairs, we pre-load the stair detection prompt. It's hacky but it works. What to describe and what to ignore. If we describe everything, it's overwhelming. If we describe too little, it's useless. We spent hours tweaking the prompts to filter for "things that matter for navigation right now." Still not perfect. We can't test this properly ourselves. Neither of us is blind. We're doing our best with research and existing guidelines, but we know the real test will be getting this in front of actual users. That's scary and exciting.

The Gemini prompting:

Getting Gemini to give useful spatial descriptions took forever. Early attempts gave us stuff like "I see stairs and a person." Useless. We had to teach it to think about:

Position (left/right/ahead, with distances) State (door is open/closed, stairs are wet/dry) Motion (person approaching vs. standing still) Priority (imminent hazard vs. background detail) Action (what should the user do?)

Our prompt is basically: "You're describing the world to someone who can't see. Tell them what matters, where it is, and what they should do. Skip the stuff that doesn't affect their next 10 steps."

What We Learned

-Technical stuff: Multimodal AI is incredible but finicky. Small prompt changes = completely different outputs Real-time video processing at scale is HARD (thank god for API abstractions) Accessibility isn't just a nice-to-have feature it requires completely rethinking your UX

-Bigger stuff: We knew nothing about accessibility tech before this. Now we're mad about how underfunded and overlooked it is Building for people who aren't like you is humbling you realize how many assumptions you make Most "AI for good" projects are kind of shallow. Actually solving real problems is way harder and way more rewarding

Challenges we're still figuring out:

The latency thing is real. 2-3 seconds might not sound like much, but when you're walking it feels like forever. Post-hackathon we'd need to optimize heavily maybe edge processing, maybe a hybrid model approach. Safety is terrifying. What if Gemini misses something? What if it tells someone it's safe to cross when it's not? We're building this as an augmentation to canes and guide dogs, never a replacement, but the responsibility still freaks us out. Privacy matters. Constantly filming your surroundings to send to an API is... not great. We're not storing anything, but we need to think harder about local processing and transparency. We haven't actually tested with blind users yet. Everything we've built is based on research and best practices, but we know we need real feedback. "Nothing about us without us" is the mantra in disability advocacy, and we take that seriously.

What's Next

If this works like really works, not just "cool hackathon demo" works we want to:

Get it in front of actual users. Partner with the National Federation of the Blind, local advocacy groups, anyone who can give us real feedback Fix the latency. Optimize prompts, add edge processing, maybe train a lightweight model for common scenarios Add outdoor navigation. Right now it's focused on indoor/immediate surroundings. GPS + Gemini could handle "navigate to the coffee shop 3 blocks away" Smart glasses integration. Holding a phone while navigating sucks. Mounting it on glasses or a chest harness makes way more sense Make it free or damn near close. This is assistive tech. Paywall feels wrong

Why This Matters

Look, we're not naive. This is a hackathon project. It's rough. It needs work. It might not even work that well yet. But here's the thing: the technology exists to give people real-time spatial awareness. It exists right now. And mostly it's being used to generate marketing copy and summarize emails. There are 7 million blind Americans who deserve better than "obstacle detected." They deserve:

"Wet floor ahead, mop bucket on the right, stay left" "Crowded doorway, wait 5 seconds for people to clear" "Stairs going down, handrail on your left, no one coming up you're good to go"

That's not revolutionary technology. That's just using what we have to actually help people. VisionGuide is our attempt to do that. To take Gemini's spatial reasoning and put it to work solving a real problem for real people. We've got 48 hours to prove it can work. Then the real work begins.

Built With

Gemini 3 Vision API Way too much coffee A genuine desire to make something that matters

— Harsha & Niveda,

Built With

Submitted to

Gemini 3 Hackathon

Created by

I built the core data pipeline and Gemini integration for VisionGuide. This meant handling camera feed processing, structuring API calls to Gemini 3's Vision API, and transforming responses into actionable audio guidance.
The biggest technical challenge was prompt engineering for consistent spatial output. Getting Gemini to describe scenes in a navigationally useful way took tons of iteration. Early versions just listed objects "stairs, person, sign." I had to engineer prompts that extracted spatial relationships, distances, movement patterns, and contextual meaning. Think data transformation but for vision-to-language.
I also built the caching and prediction layer. Since API calls take 2-3 seconds, I created a system that predicts what's coming (approaching stairs? pre-fetch stair analysis) and caches common scenarios. Used my data engineering background to optimize the request/response pipeline and reduce perceived latency by ~40%.
Most proud of: The data flow architecture. Camera → frame extraction → API batching → response parsing → priority filtering → audio output. It's a clean pipeline that handles real-time multimodal data efficiently.
What I learned: Prompt engineering is basically feature engineering for LLMs. Same principles understanding your data (visual inputs), desired outputs (spatial descriptions), and how to transform between them consistently. Also, working with real-time vision data at scale is way harder than batch processing.

Niveda Jawahar
Harsha Vardhan U S

Updates

Harsha Vardhan U S started this project — Feb 09, 2026 05:55 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.