EchoVision

Product Overview
System Design

Inspiration

We were inspired by the need to close the accessibility gap in online education. As online learning continues to scale globally, we noticed that students with visual impairments or reading disabilities are still left behind—especially when it comes to visual materials like charts, diagrams, and infographics. We wanted to build a tool that turns these silent visuals into inclusive learning opportunities.

What it does

EchoVision is an AI-powered browser plugin that converts images shown during online classes into descriptive audio, in real time. It detects visual content (including uploaded slides, shared screens, or in-browser images), extracts meaningful context, and provides accurate, dynamic audio narration. It’s designed to work seamlessly with screen readers and helps all learners better access visual information.

How we built it

EchoVision is a multi-agent system built with the Agent Development Kit (ADK) and deployed using Docker. The frontend collects user input (text, audio, or image), which is routed to the appropriate agent via an ADK-powered Agent Controller. Image input is handled by a Vision Agent, which uses Gemini to generate a text description. Audio input is transcribed by an STT Agent via Gemini. Text questions are processed by a QA Agent that queries Gemini 2.5 Pro for a response. Final text is converted into speech by a TTS Agent and returned to the user. The entire orchestration is modular, runs on Docker, and is optimized for deployment on Google Cloud Run.

Challenges we ran into

Ensuring real-time processing with minimal lag during live lectures
Handling low-resolution or visually dense slides (e.g., handwritten notes or screenshots)
Aligning audio output with existing screen reader workflows without causing overlaps or confusion
Balancing technical accuracy of descriptions with human-like narration for better comprehension

Accomplishments that we're proud of

Built a working prototype within a short timeframe that accurately converts complex visuals into spoken content
Created a plug-and-play experience that works across multiple learning platforms

What we learned

Collaborating across design, frontend, and ML pipelines taught us how to move fast while staying aligned on impact
Accessibility is not just a feature—it’s a mindset. Building for inclusivity from day one leads to better outcomes for everyone

What's next for EchoVision

Expand language support to make the plugin multilingual
Improve visual context detection for more abstract images (e.g., graphs or art)
Pilot the plugin in partnership with schools and edtech platforms to test at scale
Open-source parts of the project to invite broader accessibility innovation

Built With

Updates

Yuwei Li started this project — Jun 23, 2025 06:44 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.