Inspiration

While GPS tools provide macro-navigation (street routes), they fail at micro-navigation—detecting the immediate 2 meters in front of a visually impaired person. We were inspired to bridge this "last-meter" safety gap by using Gemini’s multimodal reasoning to act as a digital sighted assistant that never blinks.

What it does

Symphony Vision AI is a real-time accessibility HUD. It captures a continuous stream of environmental video and uses the Gemini API to: Detect Hazards: Identify uneven pavement, low-hanging branches, or oncoming traffic. Haptic Translation: Convert visual threats into tactile vibration patterns (short pulses for minor obstacles, rapid vibration for immediate danger) via the Web Vibration API. Visual HUD: Display high-contrast, simplified directional cues for users with low vision.

How we built it

We engineered a decoupled full-stack architecture optimized for low-latency inference: Gemini API (gemini-3-flash): Leveraged for its high-velocity multimodal processing. We utilized System Instructions and Structured JSON Output to ensure the AI only returns actionable safety data. Frontend: A React application that handles the hardware-level integration with the device camera and vibration motor. Backend: A Node.js proxy to securely handle API keys and pre-process image buffers before they reach the LLM.

Challenges we ran into

The primary hurdle was latency. Analyzing every single frame was too resource-intensive. We solved this by implementing a smart-sampling throttler that dispatches frames every 2 seconds, providing a balance between real-time safety and API efficiency. We also struggled with "hallucinations" in spatial distance, which we corrected by refining our System Prompts to focus on relative object size and ground-plane analysis.

Accomplishments that we're proud of

Zero-UI Reliability: Creating a system that provides value through touch (vibration) even if the user cannot see the screen. Hardware Synergy: Successfully mapping AI logic to physical phone hardware (haptics) to create a "sixth sense" for the user. Seamless Multimodality: Achieving a clean Image-to-JSON pipeline that handles complex environmental data in under 800ms.

What we learned

We learned that multimodal AI is not just for chat; its true power lies in real-world perception. We also gained deep experience in Prompt Engineering for Safety, learning how to constrain a Large Language Model to behave as a reliable, deterministic sensor.

What's next for Symphony Vision AI

Edge Processing: Migrating to MediaPipe or local Gemini Nano for 100% offline, privacy-first navigation. Voice Overlay: Integrating a natural language "Whisper Mode" for detailed environmental descriptions on demand. Wearable Integration: Porting the logic to smart glasses to provide a hands-free navigation experience.

Built With

Share this project:

Updates