Inspiration

The inspiration for Vision-Assist-AI came from a desire to empower visually impaired individuals by providing seamless, voice-controlled assistance for everyday tasks. By leveraging advanced video analysis and conversational AI, we aim to enhance independence and accessibility for users.

What it does

Vision-Assist-AI is a multi-functional assistant designed for visually impaired users. It features:

  1. Real-Time Video Analyzer: Uses live camera feeds and Gemini Flash 2.0 to analyze surroundings in real time and provide audio feedback about the environment.
  2. Voice-Based Navigation: Allows users to control the app hands-free through voice commands, such as switching between features with phrases like "Go to Scan" or "Go to GPT."
  3. Voice Assistant (GPT): A conversational assistant powered by GPT, capable of answering user queries, providing information, and assisting with general questions.

How we built it

We developed Vision-Assist-AI using:

  • Front-end: Built with HTML, JavaScript, and Tailwind CSS for a clean, responsive, and accessible user interface.
  • Back-end: Powered by FastAPI for fast and lightweight server-side processing.
  • Voice and Video Integration:
    • Gemini Flash 2.0: Handles real-time video analysis with exceptional speed and accuracy.
    • Web Speech Kit: Processes voice commands for navigation and speech synthesis for responses.
  • AI Integration: Utilizes GPT to provide conversational assistance and answer user queries.

Features

1. Real-Time Video Analyzer

The real-time video analyzer leverages Gemini Flash 2.0 to:

  • Analyze the live camera feed and describe the user's surroundings in real time.
  • Provide instant and accurate feedback about the environment, such as "There are three people around you, and a chair in front of you."
  • Respond quickly with high precision, ensuring a seamless experience for users.

This feature is designed to give visually impaired individuals greater awareness of their environment through audio feedback.

2. Voice-Based Navigation

Using Web Speech Kit, Vision-Assist-AI allows users to navigate the app through simple voice commands:

  • "Go to Scan": Activates the real-time video analyzer to describe surroundings.
  • "Go to GPT": Switches to the voice assistant for answering queries.

The voice-based navigation eliminates the need for physical interaction, making it highly accessible for visually impaired users.

3. Voice Assistant (GPT)

The GPT-powered voice assistant answers user queries and provides helpful information. It:

  • Responds to a wide range of questions, from general knowledge to conversational queries.
  • Assists users with problem-solving or information gathering using natural and intuitive communication.
  • Provides all feedback through audio responses, ensuring ease of use.

Challenges we ran into

  • Ensuring the real-time video analyzer performed accurately and quickly using the Gemini Flash 2.0 API.
  • Building a seamless voice-based navigation system that integrates well with all features.
  • Designing an accessible and intuitive interface for users with visual impairments.

Accomplishments that we're proud of

  • Successfully implementing real-time video analysis and voice-based navigation.
  • Creating a voice assistant that delivers accurate and helpful responses to user queries.
  • Building a cohesive system that integrates voice and video technologies effectively.

What we learned

  • How to optimize real-time video analysis for smooth and fast performance.
  • The importance of designing for accessibility and ensuring the app is user-friendly for visually impaired individuals.
  • How to integrate multiple APIs (Gemini Flash 2.0 and Web Speech Kit) into a unified experience.

What's next for Vision-Assist-AI

  • Enhancing the real-time video analyzer for even more detailed scene analysis.
  • Adding multi-language support for voice commands and audio responses.
  • Collaborating with organizations to make Vision-Assist-AI accessible to a broader audience.

Note

This project is built entirely using the Gemini Flash 2.0 model, which was released just one month ago. It is the latest state-of-the-art model for analyzing video footage in real time with exceptional speed and accuracy. Gemini Flash 2.0 serves as the backbone for the real-time video analyzer, enabling precise and responsive feedback for users.

Built With

Share this project:

Updates