Inspiration
We wanted to build a tool that empowers visually impaired individuals to understand their surroundings in real time. Inspired by real-world accessibility challenges, we aimed to combine AI-powered image understanding with voice control and audio feedback—making a fully hands-free assistant.
What it does
VisionAid lets users say "capture" to take a picture from a live camera feed. It then uses AI to generate a description of what’s in the image and speaks the description aloud—enabling visually impaired users to “see” through sound.
How we built it
We used:
- Flask for building the backend API
- Python for backend logic and AI integration
- Gemini’s vision model to generate image descriptions
- MediaDevices API (via JavaScript) to access the webcam
- Web Speech API for speech recognition (to detect "capture") and speech synthesis
- Gunicorn for production-level deployment
- Render for backend hosting
Challenges we ran into
- Integrating voice input with real-time camera capture
- Ensuring browser permissions for microphone and camera worked reliably across platforms
- Deploying the backend and frontend to work seamlessly together
- Managing API calls to return timely and accurate descriptions
Accomplishments that we're proud of
- Achieving fully hands-free functionality with a single voice command
- Creating a real-time assistive experience using camera, AI, and audio
- Seamless interaction between frontend and backend services
- A usable solution for people who rely on sound over sight
What we learned
- How to integrate voice, vision, and audio feedback into one smooth workflow
- How to handle asynchronous browser APIs like webcam and voice
- Real-world accessibility testing principles
What's next for VisionAid
- Adding OCR to read printed or handwritten text
- Object detection to highlight specific items in the frame
- Packaging as a mobile app for real-world portability
- Multi-language voice support for accessibility across regions

Log in or sign up for Devpost to join the conversation.