The Third Eye — Light for every life

Inspiration

Shopping in a supermarket is a simple, relaxing routine for most people, but for individuals who are blind, it is often stressful, disorienting, and dependent on external assistance. Aisles filled with unfamiliar products, changing layouts, and silent obstacles create a world that is hard to navigate independently. We believe that technology should remove barriers, not create new ones.

Our inspiration came from imagining a world where blind users could simply listen to their surroundings: a world where they can hear, in real time, what objects are around them such as “apples on your left,” “milk three steps ahead,” “checkout counter in front.”

We want to give blind individuals the same sense of confidence, freedom, and enjoyment that sighted people experience when exploring a store. Our goal is not just to identify objects, but to enable independent, dignified, and joyful navigation of everyday environments.

What it does

Our system enables blind users to “hear” their surroundings. It uses computer vision and voice interaction to:

Identify nearby objects in real time, e.g., “There are bananas, bread, and yogurt in front of you.”
Locate specific items the user wants
- The user can say: “Where is the milk?”
- The system responds with: “Milk is ahead on the right, three meters away.”
Provide safe navigation guidance, e.g., “Shelf on your left,” “Checkout counter ahead,” “Obstacle in front.”
Support hands-free interaction: The user can fully control the system by voice, no screen needed.
Create a more relaxing, independent shopping experience: By turning the visual world into audio cues, the system helps blind users navigate with confidence.

How we built it

We designed and built the system as an end-to-end multimodal pipeline that connects speech understanding, real-time vision perception, and spatial reasoning. Our goal is to enable a blind user to speak their needs and instantly receive audio guidance to the desired object.

1. Voice Command → Text Understanding

The user begins by speaking the name of the item they want to find. This audio stream is sent to our first AI agent, which performs:

Speech-to-Text (ASR): Converts the raw audio into clean text.
Keyword Extraction: Identifies the essential object name (e.g., “apple”, “milk”, “toothpaste”). This extracted keyword becomes the text prompt for the vision model.

This keeps the user interaction hands-free and fully accessible.

2. Real-Time Visual Perception

The smartphone continuously captures images or video frames of the user’s surroundings. We feed sampled frames together with the extracted text prompt into a vision-language model (e.g., Grounding-DINO). The model outputs:

Whether the target object is detected
The bounding box and confidence score
Its approximate pixel-space location in the frame

These results are passed into our second AI agent for decision-making.

3. Intelligent Decision Agent

The second agent evaluates the detection results:

If no object is found, the agent does not interrupt the user and simply waits for the next frame.
If the object is found, the agent triggers the navigation stage and sends a concise TTS message → “I found milk.”

This modular design allows the model to act only when relevant information is available.

4. Spatial Localization via 3D Reconstruction

Once the target object is found in the image, we estimate the 3D position of both:

The user (via camera pose estimation)
The target object (via bounding box → depth estimation → 3D projection)

We experiment with lightweight 3D reconstruction/depth networks to convert bounding box detection into approximate world-space coordinates.

This step enables the system to provide accurate spatial guidance, rather than simply indicating “object detected.”

5. Relative Position Computation

Using the camera’s intrinsic parameters and estimated depth, we compute object direction, object distance, and left/right / ahead orientation.

Example output of the geometric module:

Object direction: 27° to the right Estimated distance: 2.1 meters

This is then sent to our final audio agent.

6. Audio Guidance via Speech Generation

The navigation agent converts spatial information into natural, actionable voice instructions:

“The apples are two meters ahead on your right.”
“Move slightly left.”
“You are very close—reach forward.”

This closes the loop and helps the user physically reach the object.

Challenges we ran into

Building an audio-guided object-finding assistant for blind users proved far more challenging than we expected. We had to overcome difficulties in speech understanding, real-time detection, frame-to-frame stability, 3D localization, system performance, and accessible voice guidance. Each challenge pushed us to refine the pipeline and improve the user experience.

Accomplishments that we're proud of

A working prototype that detects obstacles in real-time
Successfully tested with users in indoor navigation scenarios.
Designed a low-cost and scalable solution.
Built with a strong focus on accessibility and human-centered design.

What we learned

Accessibility requires listening, we iterated based on real user needs
Edge AI optimization is key to real-world adoption
The best technology disappears into everyday life

What's next for The third Eye — Light for every life

Expand support for indoor navigation scenarios with higher accuracy Enhance data processing and privacy protection to ensure secure and reliable operation Optimize real-time performance and computing efficiency for smoother user experiences