Inspiration

I've had this idea of creating an app which will let the visually impaired to sense the beauty of this world from the very beginning of my university life. After creating the basic version in Android studio I found that the Tensorflow model used back then was not capable enough. After the boom of LLM models and the multi modal capabilities of Gemini the dream of having a friend who never gets tired, fed up, angry or annoyed by the visually impaired people's side has come to a reality. The multi modal capabilities of Gemini is really impressive.

What it does

It helps a visually impaired people to navigate by their own and avoid obstacles and have a descriptor by their side 24/7

How we built it

The application relies on three core Gemini capabilities: Native Audio Streaming: We eliminate the latency of traditional Speech-to-Text and Text-to-Speech pipelines. VisionGuide sends raw PCM audio and video inputs to Gemini, and the model responds with Native Audio output. This allows for immediate, human-like verbal warnings ("Stop! Pole ahead") which are critical for user safety. Function Calling (Tool Use): The model is integrated with a custom tool, updateObstacleProximity. While analyzing the video stream, Gemini autonomously calls this function to update the application state based on the distance of obstacles (Safe, Caution, Danger). This enables the model to control a sonar-like "Beep System" in the UI without needing to speak, keeping the audio channel clear for important commands. Multimodal Context Switching: The app leverages Gemini’s massive context window to perform two distinct tasks in one session. It seamlessly switches between a low-bandwidth "Safety Monitor" mode—where it scans for hazards—and a high-fidelity "Description Mode," where it processes a 2-second video burst to provide vivid, poetic descriptions of the user's surroundings.

Challenges we ran into

The web is good for the prototype but the app is needed for all the functionalities.

What we learned

Learned a lot about how to use the aistudio, Gemini CLI. How to promote and get the most out of the model.

What's next for Third Eye

Build the app using Gemini multimodal capabilities with the help of experts

Built With

Share this project:

Updates