Inspiration
The inspiration for Vision-Assist-AI came from a desire to empower visually impaired individuals by providing seamless, voice-controlled assistance for everyday tasks. By leveraging advanced video analysis and conversational AI, we aim to enhance independence and accessibility for users.
What it does
Vision-Assist-AI is a multi-functional assistant designed for visually impaired users. It features:
- Real-Time Video Analyzer: Uses live camera feeds and Gemini Flash 2.0 to analyze surroundings in real time and provide audio feedback about the environment.
- Voice-Based Navigation: Allows users to control the app hands-free through voice commands, such as switching between features with phrases like "Go to Scan" or "Go to GPT."
- Voice Assistant (GPT): A conversational assistant powered by GPT, capable of answering user queries, providing information, and assisting with general questions.
How we built it
We developed Vision-Assist-AI using:
- Front-end: Built with HTML, JavaScript, and Tailwind CSS for a clean, responsive, and accessible user interface.
- Back-end: Powered by FastAPI for fast and lightweight server-side processing.
- Voice and Video Integration:
- Gemini Flash 2.0: Handles real-time video analysis with exceptional speed and accuracy.
- Web Speech Kit: Processes voice commands for navigation and speech synthesis for responses.
- Gemini Flash 2.0: Handles real-time video analysis with exceptional speed and accuracy.
- AI Integration: Utilizes GPT to provide conversational assistance and answer user queries.
Features
1. Real-Time Video Analyzer
The real-time video analyzer leverages Gemini Flash 2.0 to:
- Analyze the live camera feed and describe the user's surroundings in real time.
- Provide instant and accurate feedback about the environment, such as "There are three people around you, and a chair in front of you."
- Respond quickly with high precision, ensuring a seamless experience for users.
This feature is designed to give visually impaired individuals greater awareness of their environment through audio feedback.
2. Voice-Based Navigation
Using Web Speech Kit, Vision-Assist-AI allows users to navigate the app through simple voice commands:
- "Go to Scan": Activates the real-time video analyzer to describe surroundings.
- "Go to GPT": Switches to the voice assistant for answering queries.
The voice-based navigation eliminates the need for physical interaction, making it highly accessible for visually impaired users.
3. Voice Assistant (GPT)
The GPT-powered voice assistant answers user queries and provides helpful information. It:
- Responds to a wide range of questions, from general knowledge to conversational queries.
- Assists users with problem-solving or information gathering using natural and intuitive communication.
- Provides all feedback through audio responses, ensuring ease of use.
Challenges we ran into
- Ensuring the real-time video analyzer performed accurately and quickly using the Gemini Flash 2.0 API.
- Building a seamless voice-based navigation system that integrates well with all features.
- Designing an accessible and intuitive interface for users with visual impairments.
Accomplishments that we're proud of
- Successfully implementing real-time video analysis and voice-based navigation.
- Creating a voice assistant that delivers accurate and helpful responses to user queries.
- Building a cohesive system that integrates voice and video technologies effectively.
What we learned
- How to optimize real-time video analysis for smooth and fast performance.
- The importance of designing for accessibility and ensuring the app is user-friendly for visually impaired individuals.
- How to integrate multiple APIs (Gemini Flash 2.0 and Web Speech Kit) into a unified experience.
What's next for Vision-Assist-AI
- Enhancing the real-time video analyzer for even more detailed scene analysis.
- Adding multi-language support for voice commands and audio responses.
- Collaborating with organizations to make Vision-Assist-AI accessible to a broader audience.
Note
This project is built entirely using the Gemini Flash 2.0 model, which was released just one month ago. It is the latest state-of-the-art model for analyzing video footage in real time with exceptional speed and accuracy. Gemini Flash 2.0 serves as the backbone for the real-time video analyzer, enabling precise and responsive feedback for users.
Built With
- flask
- gemini
- html
- javascript
- python
- speech-to-text
- tailwind


Log in or sign up for Devpost to join the conversation.