💡 Inspiration For the 2.2 billion people globally with vision impairment, "seeing" the world through technology often means waiting. Existing accessibility tools rely on a "capture-and-wait" model: snap a photo, upload it, and wait seconds for a generic description. In dynamic environments—like a busy intersection or a cluttered kitchen—that latency isn't just frustrating; it's unsafe.

We wanted to build something faster, more natural, and conversational. We wanted to move from static captioning to real-time visual awareness.

🚀 What it Does Second Sight transforms a smartphone camera into an "always-on" visual assistant. Instead of analyzing single photos, it processes a continuous video stream, allowing visually impaired users to have a natural dialogue with their environment.

Users can point their camera and ask complex, context-dependent questions in real-time without pausing to take a picture:

"Is the crosswalk clear right now?"

"Read the warning label on this bottle."

"Where did I set my keys down on this messy table?"

Second Sight provides instant, concise audio feedback, acting as a sighted guide in your pocket.

⚙️ How We Built It (The Tech Stack) The crucial enabler for this project is Google Gemini 1.5 Flash. We specifically chose Flash over Pro because its low latency is essential for a real-time safety application.

Multimodal AI Engine: Gemini 1.5 Flash API. We continuously sample frames from the device's camera feed and pair them with transcribed user audio queries. Gemini interprets the visual and audio context simultaneously to provide an answer.

Frontend: [Insert your choice here, e.g., Flutter / React Native / Swift] for robust mobile camera handling and a simple, accessible UI.

Audio Output: Native on-device Text-to-Speech (TTS) engine to ensure the lowest possible delay between Gemini's response and the user hearing it.

đź§  Challenges We Ran Into Latency vs. Accuracy: Balancing the frame rate sent to the API was tricky. Sending too many frames clogged the network; sending too few made the AI miss context. We found a sweet spot of sampling frames every [X] milliseconds.

Prompt Engineering for Brevity: Gemini loves to be descriptive. We had to heavily engineer the system prompt to force it to be concise, prioritize safety warnings, and stop acting like a chatbot and start acting like a guide.

🌟 What's Next for Second Sight Wearable Integration: Moving the software from handheld smartphones to smart glasses for a truly hands-free experience.

Proactive Notifications: Currently, the user has to ask a question. In the future, we want the agent to proactively announce hazards (e.g., "Obstacle ahead," "Stairs approaching") without being prompted.

Share this project:

Updates