VisualAID – AAI (AI Accessibility Initiative)

Empowering Independence Through AI Vision

Mission Statement

To make technology that doesn't just assist people—but truly empowers them.

Inspiration

Digital accessibility has often been ignored in technology. In 1973, The Rehabilitation Act focused on disability rights, and in 1990, the Americans with Disabilities Act (ADA) improved access in public spaces. But when people think of accessibility, they usually imagine ramps, elevators, or wide doors—not the internet. Ironically, 1990 was also the year the World Wide Web was created. Since then, the internet has become the world's largest public space, yet it still leaves millions of people behind.

Once, while traveling on a train in India, I met a young man who was blind. We were about the same age. His parents were constantly watching over him, while I could move around freely. When we talked, he said he had heard the word "internet" but didn't really know what it was. That moment stayed with me. As a software engineer, I had always worked for money—but this time, I wanted to build something with a purpose.

Today, there are more than 2.2 billion people with visual impairments. They don't have a choice in how they see the world, but they should still be able to experience it fully. That is why I built VisualAID—to give them independence, awareness, and confidence through technology that sees for them.

VisualAID isn't only for the visually impaired—it can also help anyone who wants to understand their surroundings better and be more efficient.

What It Does

VisualAID is an AI-powered assistant that helps visually impaired users navigate safely and independently. It uses live camera input, voice interaction, and real-time AI analysis to describe what's around the user, identify obstacles, and respond naturally through conversation.

Key Features

Voice Navigation: The user can control the whole app with voice commands like "Be my eye," "Read page," "Stop assistance," or "Navigate me to the signup page." Even signing up can be done entirely through voice or text.
Real-time Vision Analysis: Uses a mix of Gemini 2.0 Flash and GPT-4o mini for object detection, obstacle recognition, motion and depth detection, emotion understanding, and overall scene analysis.
Conversational AI: Built on OpenAI's Realtime API for smooth, context-aware conversations.
Safety Awareness: Alerts users to dangers like fast-approaching vehicles or obstacles in their path.
Session Management: Stores analyzed frames, user preferences, and emergency contact information for continuity and safety.

In short, VisualAID acts like a smart companion that can see, speak, and guide.

How We Built It

VisualAID is built using a full-stack architecture that supports real-time communication, AI integration, and accessibility-first design.

Frontend

React 19 with TypeScript 5.9 for a flexible and efficient interface
Web Speech API for speech recognition and text-to-speech
Canvas API for capturing and processing video frames in real time
Socket.io Client for fast communication with the backend
Custom React hooks to manage audio, camera, and AI interactions

Backend

Node.js + Express with Socket.io for real-time video and audio streaming
Gemini 2.0 Flash for AI vision analysis
OpenAI Realtime API for conversational responses
PostgreSQL (via Supabase) for storing sessions, frame data, and AI results

Infrastructure

Frontend Hosting: Netlify
Backend Hosting: Railway
Database: Supabase PostgreSQL

To reduce delay, the system sends compressed video frames through WebSocket and processes them in parallel using multiple AI services.

Core AI and Service Architecture

A) OpenAI Services (openaiService.js)

Used for visual analysis and scene understanding.

Key Functions:

analyzeFrame(base64Image, metadata, previousAnalysis) – uses GPT-4 Vision to analyze the image and return descriptions, obstacles, and safety alerts
detectCriticalObstacles(analysis) – filters urgent or dangerous obstacles
generateVoiceDescription(analysis, isFirstFrame) – creates simple spoken descriptions for the user

Supports context awareness (it knows if this is the first frame or a later one).

B) OpenAI Realtime Service (openaiRealtimeService.js)

Manages real-time voice conversations.

Maintains WebSocket connections for continuous audio exchange
Processes user voice inputs and returns audio responses
Integrates directly with vision analysis to maintain context while talking

C) OpenAI TTS Service (openaiTTSService.js)

Handles text-to-speech conversion.

Key Functions:

textToSpeech(text, options) – converts text into audio using tts-1 or tts-1-hd models
textToSpeechBase64(text, options) – returns audio in Base64 format

Supports multiple voices such as alloy, echo, fable, onyx, nova, shimmer.

D) Gemini Service (geminiService.js)

Acts as an alternative vision analyzer.

Uses Gemini 2.0 Flash Experimental model
Follows a similar process to OpenAI's vision service
Performs image analysis and creates voice descriptions

E) Supporting Services

fastScanService.js: Performs quick safety scans between frames for faster alerts
responseCacheService.js: Caches AI responses to reduce redundant calls
geminiConversationService.js: Handles conversational logic if Gemini chat is enabled

Challenges We Faced

Real-time Speed: Running two AI models (Gemini and OpenAI) while keeping conversations fast required optimization of frame encoding and processing.
Audio Overlap: We built a priority-based audio system to make sure multiple voice outputs never speak over each other.
Voice-Only Design: Creating an app that works completely by voice required rethinking every part of the user experience.
Synchronization: Coordinating camera input, AI responses, and user interaction at the same time was complex.
Cost Optimization: We used a pre-trained YOLOE model to compare consecutive frames and only call AI APIs when changes were detected. This reduced costs and improved speed.
Immediate Safety Response: When the user is walking, the app must alert them instantly to avoid danger. This required using faster, lightweight AI models.

Accomplishments We're Proud Of

Built a working MVP that connects live camera input, AI vision, and natural conversation into one experience
Created a voice-first interface that allows full hands-free use
Combined Gemini 2.0 Flash and OpenAI Realtime API seamlessly for both seeing and speaking
Designed a React hook and Socket.io system that can scale to mobile and desktop versions
Most importantly, built a tool that can help visually impaired users live more independently

What We Learned

Accessibility is not a feature—it's empathy turned into design.
Real-time AI systems need precise timing between vision, speech, and network events
A good user experience depends on how quickly and clearly the system responds
Testing in real-world conditions like noise and poor lighting taught us more than simulations ever could
Building for accessibility changes how we think about technology—it becomes about inclusion, not innovation alone

What's Next for VisualAID – AAI

The goal of VisualAID goes beyond building an MVP. We want to create a complete AI Accessibility Initiative (AAI) that helps visually impaired users in all aspects of daily life.

Upcoming Features

Navigation Assistance: Integration with mapping APIs for route guidance and obstacle detection
Emergency Alerts: Automatically notify contacts in case of danger
Multi-language Support: Make the system usable globally
Mobile Applications: Native apps for Android and iOS for real-world accessibility
Offline Mode: On-device AI for areas with poor internet connection

Our mission is simple: to make technology that doesn't just assist people—but truly empowers them.

Built With

api
gemini
openai
postgre
postgresq
react
typescript

Updates

Shivam Shivam started this project — Oct 26, 2025 01:01 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.