Vision-Assist-AI

Homepage
Voice assistant
Video analyser

Inspiration

The inspiration for Vision-Assist-AI came from a desire to empower visually impaired individuals by providing seamless, voice-controlled assistance for everyday tasks. By leveraging advanced video analysis and conversational AI, we aim to enhance independence and accessibility for users.

What it does

Vision-Assist-AI is a multi-functional assistant designed for visually impaired users. It features:

Real-Time Video Analyzer: Uses live camera feeds and Gemini Flash 2.0 to analyze surroundings in real time and provide audio feedback about the environment.
Voice-Based Navigation: Allows users to control the app hands-free through voice commands, such as switching between features with phrases like "Go to Scan" or "Go to GPT."
Voice Assistant (GPT): A conversational assistant powered by GPT, capable of answering user queries, providing information, and assisting with general questions.

How we built it

We developed Vision-Assist-AI using:

Front-end: Built with HTML, JavaScript, and Tailwind CSS for a clean, responsive, and accessible user interface.
Back-end: Powered by FastAPI for fast and lightweight server-side processing.
Voice and Video Integration:
- Gemini Flash 2.0: Handles real-time video analysis with exceptional speed and accuracy.
- Web Speech Kit: Processes voice commands for navigation and speech synthesis for responses.
AI Integration: Utilizes GPT to provide conversational assistance and answer user queries.

Features

1. Real-Time Video Analyzer

The real-time video analyzer leverages Gemini Flash 2.0 to:

Analyze the live camera feed and describe the user's surroundings in real time.
Provide instant and accurate feedback about the environment, such as "There are three people around you, and a chair in front of you."
Respond quickly with high precision, ensuring a seamless experience for users.

This feature is designed to give visually impaired individuals greater awareness of their environment through audio feedback.

2. Voice-Based Navigation

Using Web Speech Kit, Vision-Assist-AI allows users to navigate the app through simple voice commands:

"Go to Scan": Activates the real-time video analyzer to describe surroundings.
"Go to GPT": Switches to the voice assistant for answering queries.

The voice-based navigation eliminates the need for physical interaction, making it highly accessible for visually impaired users.

3. Voice Assistant (GPT)

The GPT-powered voice assistant answers user queries and provides helpful information. It:

Responds to a wide range of questions, from general knowledge to conversational queries.
Assists users with problem-solving or information gathering using natural and intuitive communication.
Provides all feedback through audio responses, ensuring ease of use.

Challenges we ran into

Ensuring the real-time video analyzer performed accurately and quickly using the Gemini Flash 2.0 API.
Building a seamless voice-based navigation system that integrates well with all features.
Designing an accessible and intuitive interface for users with visual impairments.

Accomplishments that we're proud of

Successfully implementing real-time video analysis and voice-based navigation.
Creating a voice assistant that delivers accurate and helpful responses to user queries.
Building a cohesive system that integrates voice and video technologies effectively.

What we learned

How to optimize real-time video analysis for smooth and fast performance.
The importance of designing for accessibility and ensuring the app is user-friendly for visually impaired individuals.
How to integrate multiple APIs (Gemini Flash 2.0 and Web Speech Kit) into a unified experience.

What's next for Vision-Assist-AI

Hardware Integration: We're developing a dedicated, affordable smart glasses device powered by Gemini Live for seamless, hands-free assistance.

Affordable Smart Glasses: Our primary focus is creating low-cost, dedicated smart glasses hardware powered by Gemini Live, making assistive tech accessible to all.

Note

This project is built entirely using the Gemini Flash 2.0 model, which was released just two month ago. It is the latest model for analyzing video footage in real time with exceptional speed and accuracy. Gemini Flash 2.0 serves as the backbone for the real-time video analyzer, enabling precise and responsive feedback for users.

Built With

api
flask
gemini
html5
javascript
python
speech-to-text
tailwind
vercel

Updates

Mukesh A started this project — Feb 02, 2025 01:13 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.