Visionary & Multimodal Mate

Inspiration

Visionary & Multimodal Mate is a comprehensive web application designed to empower visually impaired individuals by providing real-time, context-aware assistance. Through a combination of voice interaction and environmental analysis, Multimodal Mate harnesses data from various sources to offer a personalized and inclusive user experience.

What it does

Provides voice interaction for initiating/stopping recording and gesture-based controls.
Analyzes captured images with descriptions and safety concerns.
Retrieves real-time information with Perplexity API integration.
Offers walking directions with Google Maps integration.
Supports multilingual interaction with language detection and WaveNet voices.
Provides haptic and auditory feedback for accessibility.
Manages errors and API failures for a seamless user experience.
Combines audio segments for smooth audio playback. ## How we built it
Frontend: HTML, Tailwind CSS, JavaScript
Backend: FastAPI (Python)
APIs and Services:
- Gemini 1.5 Flash Model (voice & image processing)
- Perplexity API (real-time information retrieval)
- Google Cloud Text-to-Speech (WaveNet)
- Google Maps Geolocation API
Deployment: Replit ## Challenges we ran into Voice-only navigation guidance system: Implementing a robust and accurate voice-only navigation system that can provide clear and concise directions to visually impaired users. MapBox API limitations: The MapBox API requires precise longitude and latitude coordinates for both the current location and the destination. This could be challenging for users who are unfamiliar with geographic coordinates. Google Places API integration: While the Google Places API can convert nearby destinations to their geographic location via natural language, it may not always provide the most accurate or relevant results, especially in areas with limited geographic data.

Accomplishments that we're proud of

Leveraged AI to optimize computations based on unique selling points, resulting in significantly reduced server load by enabling client side computations.

What we learned

How to work with AI Agents & how to vectorize a variety of file formats for efficient AI processing using RAG.

What's next for Visionary & Multimodal Mate

Integrate navigation directly within the browser: This will minimize context switching for users and provide a more seamless navigation experience. Expand multimodal capabilities: Explore incorporating additional sensory inputs, such as haptic feedback or sound cues, to enhance the user experience for visually impaired individuals. Enhance language support: Increase the number of languages supported to cater to a wider range of users. Improve accuracy and reliability: Continuously refine the AI models and algorithms to enhance the accuracy and reliability of voice recognition, image analysis, and navigation. Explore new features: Consider adding features like object identification, scene understanding, or real-time alerts to provide even more comprehensive assistance.