Vision-Assist-AI

Inspiration

The inspiration for Vision-Assist-AI came from a desire to empower visually impaired individuals by providing seamless, voice-controlled assistance for everyday tasks. By leveraging advanced video analysis and conversational AI, we aim to enhance independence and accessibility for users.

What it does

Vision-Assist-AI is a multi-functional assistant designed for visually impaired users. It features:

Real-Time Video Analyzer: Uses live camera feeds and Gemini Flash 2.0 to analyze surroundings in real time and provide audio feedback about the environment.
Voice-Based Navigation: Allows users to control the app hands-free through voice commands, such as switching between features with phrases like "Go to Scan" or "Go to GPT."
Voice Assistant (GPT): A conversational assistant powered by GPT, capable of answering user queries, providing information, and assisting with general questions.

How we built it

We developed Vision-Assist-AI using:

Front-end: Built with HTML, JavaScript, and Tailwind CSS for a clean, responsive, and accessible user interface.
Back-end: Powered by FastAPI for fast and lightweight server-side processing.
Voice and Video Integration:
- Gemini Flash 2.0: Handles real-time video analysis with exceptional speed and accuracy.
- Web Speech Kit: Processes voice commands for navigation and speech synthesis for responses.
AI Integration: Utilizes GPT to provide conversational assistance and answer user queries.

Features

1. Real-Time Video Analyzer

The real-time video analyzer leverages Gemini Flash 2.0 to:

Analyze the live camera feed and describe the user's surroundings in real time.
Provide instant and accurate feedback about the environment, such as "There are three people around you, and a chair in front of you."
Respond quickly with high precision, ensuring a seamless experience for users.

This feature is designed to give visually impaired individuals greater awareness of their environment through audio feedback.

2. Voice-Based Navigation

Using Web Speech Kit, Vision-Assist-AI allows users to navigate the app through simple voice commands:

"Go to Scan": Activates the real-time video analyzer to describe surroundings.
"Go to GPT": Switches to the voice assistant for answering queries.

The voice-based navigation eliminates the need for physical interaction, making it highly accessible for visually impaired users.

3. Voice Assistant (GPT)

The GPT-powered voice assistant answers user queries and provides helpful information. It:

Responds to a wide range of questions, from general knowledge to conversational queries.
Assists users with problem-solving or information gathering using natural and intuitive communication.
Provides all feedback through audio responses, ensuring ease of use.

Challenges we ran into

Ensuring the real-time video analyzer performed accurately and quickly using the Gemini Flash 2.0 API.
Building a seamless voice-based navigation system that integrates well with all features.
Designing an accessible and intuitive interface for users with visual impairments.

Accomplishments that we're proud of

Successfully implementing real-time video analysis and voice-based navigation.
Creating a voice assistant that delivers accurate and helpful responses to user queries.
Building a cohesive system that integrates voice and video technologies effectively.

What we learned

How to optimize real-time video analysis for smooth and fast performance.
The importance of designing for accessibility and ensuring the app is user-friendly for visually impaired individuals.
How to integrate multiple APIs (Gemini Flash 2.0 and Web Speech Kit) into a unified experience.

What's next for Vision-Assist-AI

Enhancing the real-time video analyzer for even more detailed scene analysis.
Adding multi-language support for voice commands and audio responses.
Collaborating with organizations to make Vision-Assist-AI accessible to a broader audience.

Note

This project is built entirely using the Gemini Flash 2.0 model, which was released just one month ago. It is the latest state-of-the-art model for analyzing video footage in real time with exceptional speed and accuracy. Gemini Flash 2.0 serves as the backbone for the real-time video analyzer, enabling precise and responsive feedback for users.

Built With

flask
gemini
html
javascript
python
speech-to-text
tailwind

Updates

Mukesh A started this project — Jan 19, 2025 11:50 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.