Foresight

Foresight is an accessibility accelerator that helps visually impaired users by combining ShareGPT4V for visual analysis, Detectron2 for object grounding, and the Gemma model to reduce hallucinations and boost accuracy. By integrating speech-to-text and text-to-speech multimodal interaction, Foresight delivers context-aware assistance and identifies 12% more information than the base model, making AI support more reliable in real-world scenarios. The project was awarded first place at the innovation fair at my undergraduate university.

Key Features

Image Analysis: Capture photos and ask questions about them
Multi-modal AI: Uses advanced AI models for accurate scene understanding
Object Grounding: Highlights objects in images with distinct colors
Voice Interface: Supports both speech-to-text for queries and text-to-speech for responses
Dynamic Conversations: Users can ask follow-up questions about captured images
Reduced Hallucinations: Combines multiple AI models to provide more accurate responses

Technology Stack

ShareGPT4V for visual analysis
Detectron2 for object detection and grounding
Gemma model for response generation
Speech-to-text and text-to-speech capabilities
Open-source architecture for customization

Impact

Foresight demonstrates an 11% improvement in image information identification compared to base models, making it a powerful tool for assisting visually impaired individuals in their daily lives.