WhatsAI - Agentic GenAI Interface with Google Gemini for Blind Users
Overview
whatsAI is an AI-powered interactive assistant designed for blind and low-vision users, leveraging Google Gemini and Meta Ray-Ban smart glasses. The system enables hands-free interaction with real-world objects using finger counting gestures to dynamically guide the AI's responses.
Key Features
- Finger Counting Interaction: Uses the built-in camera to detect the number of fingers shown and triggers different AI-driven responses based on the count.
- Meta Ray-Ban Integration: Connects with Meta Ray-Ban smart glasses to capture real-time scenes, allowing users to interact with their surroundings seamlessly.
- Google Gemini AI: Processes images and audio to generate intelligent, context-aware responses about objects in the scene.
- Hands-Free Assistance: Provides descriptions, locations, sizes, and distances of objects, empowering blind users to navigate their environment independently.
How It Works
- The Meta Ray-Ban glasses capture the real-world scene and stream it to a connected PC.
- The PC processes the video feed using OpenCV and detects the number of fingers shown.
- The detected finger count maps to a predefined set of AI queries:
- 1 Finger: Describe the color of objects.
- 2 Fingers: Provide the egocentric location of objects.
- 3 Fingers: Identify object names.
- 4 Fingers: Estimate the size of objects.
- 5 Fingers: Determine the distance of objects.
- The image and selected query are sent to Google Gemini for processing.
- The AI-generated response is converted to speech and played back to the user.
Technology Stack
- Hardware: Meta Ray-Ban Smart Glasses, PC with camera
- Software:
- Google Gemini API (Vision & Audio Processing)
- OpenCV (Image Processing)
- PyAudio (Audio Input & Output)
- Pillow & mss (Screen Capture & Image Processing)
- Python (Async Processing & API Integration)
Challenges Faced
- Ensuring accurate and real-time finger counting in varying lighting conditions.
- Reducing latency in processing AI responses to provide a smooth user experience.
- Handling different camera angles and occlusions for consistent hand tracking.
Future Improvements
- Integrate haptic feedback for enhanced interaction.
- Expand gesture-based commands beyond finger counting.
- Improve response latency with edge processing techniques.
- Develop a mobile version for standalone use with smart glasses.
How to Run
Installation
pip install google-genai opencv-python pyaudio pillow mss dotenv
Setup
Ensure you have a Google Gemini API key set as an environment variable in the .env file.
Run the Application
python gemini_live.py
Team
Developed as a hackathon project to improve accessibility for blind and low-vision users using AI-driven real-world interactions. 🚀
Built With
- deepmind
- windsurf
Log in or sign up for Devpost to join the conversation.