Inspiration
We live in a world where new technologies are created every day, technologies that solve a lot of people's issues. We chose a sample from our population of possibly interested users: blind people. We aspired to make the browser see and react based on what a blind user would need, just stating your goal and letting the agent carry it through for you.
The inspiration for Aeyes (pronounced "AI Eyes") came from a simple question: What if your favorite browser could see, think, and act on your behalf? So we moved past the passive category of applications, and created an agent that can help a blind user accomplish goals that other individuals do not need help with, by letting the agent be their eyes.
What it does
Aeyes is an autonomous, voice-controlled browser agent designed for the visually impaired. It transforms the web from a visual landscape into a conversational partner.
- Goal-Oriented Interaction: Users state high-level goals ("Order headphones on Amazon"), not technical steps.
- Autonomous Navigation: The agent "sees" the page via DOM distillation and executes actions like clicking, typing, and scrolling.
- Expressive Feedback: Using neural voices, the agent confirms every step, ensuring the user is never lost in a "silent" navigation.
How we built it
The architecture of Aeyes is a distributed system that formalizes the interaction as a sequential mapping from speech to action. We used the latest "Side panel" technology for best communication experience with the agent.
The final Action is determined by Gemini, which analyzes the user's Transcribed Speech and the distilled version of the current webpage.
- Foundation: Built as a Chrome Extension (Manifest V3) using React and Vite.
- Brain: Gemini 2.0 Flash via Vertex AI handles the reasoning, intent parsing, and element matching.
- Voice: ElevenLabs provides the human-like resonance, while the Web Speech API handles real-time transcription.
- Orchestration: A FastAPI backend secures API keys and handles the complex logic of audio streaming and state management.
Challenges we ran into
- Gemini API -> I had no idea how it worked, or that I needed to create a cloud project for it.
- Chrome's Permission Wall: We discovered that Side Panels cannot request microphone access. We solved this by building a dedicated "Permission Bridge" page.
- The "Chaos" of the DOM: Standard HTML is too noisy for LLMs. Our biggest challenge was creating a Clustered DOM logic that pruned the tree by 85% without losing accessibility context.
- Dynamic State Synchronization: Keeping the agent's internal state synced across page reloads required a sophisticated messaging layer between the content scripts and the persistent sidepanel.
Accomplishments that we're proud of
- True Proactive Agency: We moved beyond screen readers that merely "narrate" the page. Aeyes builds internal action plans and executes them autonomously.
- Low-Latency Audio Pipeline: We implemented a custom streaming bridge that delivers ElevenLabs neural audio in chunks, achieving "speech-while-acting" capabilities.
- Right in your browser: Since the agent is built in a browser extension format, in theory it is supported in any Chromium-based browser that supports the side panel.
- Speech Interruption Logic: The agent can be "hushed" or interrupted, making it feel like a real conversational partner rather than a programmed script.
- Context-Aware DOM Pruning: Our Clustered DOM algorithm achieves a massive reduction in distraction:
This Clustered DOM algorithm reduces page complexity by approximately 87% compared to raw HTML, making it readable for the AI.
What we learned
- Latency is the UX: In voice-first design, the "perception of speed" is everything. We learned that parallelizing STT and DOM analysis is crucial:
Total response time is minimized by parallelizing speech processing, AI reasoning, and voice synthesis to ensure the interaction feels real-time.
- Small Models, Big Impact: Gemini 2.0 Flash proved that rapid-fire reasoning for browser actions is superior to heavy, slow models for this use case.
- Accessibility as Default: Building for the visually impaired revealed fundamental flaws in modern web semantics, reinforcing our commitment to the "Accessibility First" philosophy.
What's next for Aeyes
- Multimodal Visual Grounding: Integrating Gemini's vision capabilities to handle non-semantic elements (Canvas/SVG).
- Personalized Task Macros: Allowing users to save complex voice-activated scripts like "Order my usual Thursday coffee."
- Privacy-First Offline Mode: Exploring Edge AI for basic navigation to ensure user data remains on-device whenever possible.
Built With
- chrome-extension-api
- elevenlabs-api
- fastapi
- gemini-2.0-flash
- google-vertex-ai
- manifest-v3
- python
- react
- typescript
- vite
- web-speech-api
Log in or sign up for Devpost to join the conversation.