Accessivision

Inspiration

285 million people worldwide live with visual impairment. That's nearly the population of the United States, unable to see the faces of their loved ones, read a menu at a restaurant, or navigate a new environment without assistance and new technology from Google can help them.

What it does

AccessiVision transforms any smartphone camera into an intelligent visual assistant for blind and visually impaired users. It offers multiple modes:

Scene Description Mode Point your camera anywhere, and AccessiVision provides a comprehensive, spatially-aware description: "You're in a coffee shop. There's an empty table to your left, about 3 steps away. The counter is straight ahead. One person is in line, and the barista appears ready to take orders."

Text Reader Mode Instantly reads signs, menus, labels, and documents aloud. No more struggling to find someone to read a prescription bottle or restaurant menu.

Question & Answer Mode Ask natural questions about your environment: "Is there an empty seat nearby?" or "What does the sign on the door say?" and get immediate, contextual answers.

** Smart Alerts** Proactive warnings about obstacles, approaching people, stairs, and other hazards that matter for safe navigation.

The key innovation is context. AccessiVision doesn't just list objects, it understands and describes the environment the way a helpful friend would.

How we built it

User Device │ ───▶ │ Cloud Run │ ───▶ │ Gemini 3 API │ │ (Camera/Mic) │ ◀─── │ Backend │ ◀─── │ (Multimodal)

Frontend: Web application with camera access via WebRTC
Backend: Deployed on Google Cloud Run for scalability
AI Engine: Gemini 3 Pro API with multimodal capabilities
Speech: Text-to-Speech for audio output

Challenges we ran into

Balancing Detail vs. Speed Visually impaired users need comprehensive descriptions, but too much information becomes overwhelming. We iterated on our prompts to find the right balance, detailed enough to be useful, concise enough to be quick.
Spatial Language Translating visual positions into useful verbal directions was tricky. "On the left side of the image" doesn't help someone navigate. We refined our prompts to use user-centric language: "3 steps to your left" instead of "on the left of the frame."
Real-time Performance Achieving low-latency responses was critical for practical use. We optimized image compression and leveraged Gemini 3's improved response times to minimize delays.

Accomplishments that we're proud of

Built a working accessibility tool in under a week that demonstrates real-world impact Leveraged Gemini 3's unique multimodal capabilities in a way that wasn't possible with previous AI models. Created contextual, human-like descriptions that go beyond basic object detection. Designed with accessibility-first principles that we can apply to future projects

What we learned

Gemini 3's multimodal reasoning is genuinely transformative, it doesn't just see objects, it understands situations. The 1M token context window enables conversational experiences that maintain coherence over extended interactions. User-centric language (directions relative to the user, not the camera) dramatically improves usefulness

What's next for Accessivision

-Continuous narration mode: Real-time streaming that only speaks when something important changes -Voice command integration: Full hands-free operation

The Dream

We envision a world where visual impairment no longer limits independence. Where anyone can walk into a new building, a foreign country, or an unfamiliar situation with confidence, because their AI companion can see for them.

Gemini 3 brings us one giant step closer to that world.

Built With

canvas-api
css
esm
gemini
geminiapi
genai
google
html5
mediadevices
react
sdk
tailwind
typescript
web-speech-api

Updates

Sai Kalyan Yalla started this project — Feb 02, 2026 02:29 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.