Inspiration
285 million people worldwide live with visual impairment. That's nearly the population of the United States, unable to see the faces of their loved ones, read a menu at a restaurant, or navigate a new environment without assistance and new technology from Google can help them.
What it does
AccessiVision transforms any smartphone camera into an intelligent visual assistant for blind and visually impaired users. It offers multiple modes:
Scene Description Mode Point your camera anywhere, and AccessiVision provides a comprehensive, spatially-aware description: "You're in a coffee shop. There's an empty table to your left, about 3 steps away. The counter is straight ahead. One person is in line, and the barista appears ready to take orders."
Text Reader Mode Instantly reads signs, menus, labels, and documents aloud. No more struggling to find someone to read a prescription bottle or restaurant menu.
Question & Answer Mode Ask natural questions about your environment: "Is there an empty seat nearby?" or "What does the sign on the door say?" and get immediate, contextual answers.
** Smart Alerts** Proactive warnings about obstacles, approaching people, stairs, and other hazards that matter for safe navigation.
The key innovation is context. AccessiVision doesn't just list objects, it understands and describes the environment the way a helpful friend would.
How we built it
User Device │ ───▶ │ Cloud Run │ ───▶ │ Gemini 3 API │ │ (Camera/Mic) │ ◀─── │ Backend │ ◀─── │ (Multimodal)
- Frontend: Web application with camera access via WebRTC
- Backend: Deployed on Google Cloud Run for scalability
- AI Engine: Gemini 3 Pro API with multimodal capabilities
- Speech: Text-to-Speech for audio output
Challenges we ran into
Balancing Detail vs. Speed Visually impaired users need comprehensive descriptions, but too much information becomes overwhelming. We iterated on our prompts to find the right balance, detailed enough to be useful, concise enough to be quick.
Spatial Language Translating visual positions into useful verbal directions was tricky. "On the left side of the image" doesn't help someone navigate. We refined our prompts to use user-centric language: "3 steps to your left" instead of "on the left of the frame."
Real-time Performance Achieving low-latency responses was critical for practical use. We optimized image compression and leveraged Gemini 3's improved response times to minimize delays.
Accomplishments that we're proud of
Built a working accessibility tool in under a week that demonstrates real-world impact Leveraged Gemini 3's unique multimodal capabilities in a way that wasn't possible with previous AI models. Created contextual, human-like descriptions that go beyond basic object detection. Designed with accessibility-first principles that we can apply to future projects
What we learned
Gemini 3's multimodal reasoning is genuinely transformative, it doesn't just see objects, it understands situations. The 1M token context window enables conversational experiences that maintain coherence over extended interactions. User-centric language (directions relative to the user, not the camera) dramatically improves usefulness
What's next for Accessivision
-Continuous narration mode: Real-time streaming that only speaks when something important changes -Voice command integration: Full hands-free operation
The Dream
We envision a world where visual impairment no longer limits independence. Where anyone can walk into a new building, a foreign country, or an unfamiliar situation with confidence, because their AI companion can see for them.
Gemini 3 brings us one giant step closer to that world.
Log in or sign up for Devpost to join the conversation.