Inspiration
We were inspired by the need to close the accessibility gap in online education. As online learning continues to scale globally, we noticed that students with visual impairments or reading disabilities are still left behind—especially when it comes to visual materials like charts, diagrams, and infographics. We wanted to build a tool that turns these silent visuals into inclusive learning opportunities.
What it does
EchoVision is an AI-powered browser plugin that converts images shown during online classes into descriptive audio, in real time. It detects visual content (including uploaded slides, shared screens, or in-browser images), extracts meaningful context, and provides accurate, dynamic audio narration. It’s designed to work seamlessly with screen readers and helps all learners better access visual information.
How we built it
EchoVision is a multi-agent system built with the Agent Development Kit (ADK) and deployed using Docker. The frontend collects user input (text, audio, or image), which is routed to the appropriate agent via an ADK-powered Agent Controller. Image input is handled by a Vision Agent, which uses Gemini to generate a text description. Audio input is transcribed by an STT Agent via Gemini. Text questions are processed by a QA Agent that queries Gemini 2.5 Pro for a response. Final text is converted into speech by a TTS Agent and returned to the user. The entire orchestration is modular, runs on Docker, and is optimized for deployment on Google Cloud Run.
Challenges we ran into
- Ensuring real-time processing with minimal lag during live lectures
- Handling low-resolution or visually dense slides (e.g., handwritten notes or screenshots)
- Aligning audio output with existing screen reader workflows without causing overlaps or confusion
- Balancing technical accuracy of descriptions with human-like narration for better comprehension
Accomplishments that we're proud of
- Built a working prototype within a short timeframe that accurately converts complex visuals into spoken content
- Created a plug-and-play experience that works across multiple learning platforms
What we learned
- Collaborating across design, frontend, and ML pipelines taught us how to move fast while staying aligned on impact
- Accessibility is not just a feature—it’s a mindset. Building for inclusivity from day one leads to better outcomes for everyone
What's next for EchoVision
- Expand language support to make the plugin multilingual
- Improve visual context detection for more abstract images (e.g., graphs or art)
- Pilot the plugin in partnership with schools and edtech platforms to test at scale
- Open-source parts of the project to invite broader accessibility innovation
Built With
- docker
- gemini
- google-cloud
- tampermonkey
Log in or sign up for Devpost to join the conversation.