Delphi: Voice-Driven Browsing for the Visually Impaired
Inspiration
Most of us take our laptops for granted. Google is always a few easy clicks away. But for some, these basic technologies are unfriendly and inaccessible. Our friend Moksh, who is blind, is one such person.
For him, navigating the web is a maze of tabs and endless keyboard shortcuts to memorize. We witnessed his frustration listening for screen reader prompts and rebuilding his mental map of websites we take for granted every time the interface changed.
Delphi was born from that moment: a promise to transform browsing into a conversation. An experience where every click and keystroke is controlled through your voice.
How it Works
Delphi is a multi-agent network comprised of four unique AI agents that transforms any website into a seamless interface for voice-driven interaction.
- Voice Agent (Hermes): This agent lives in our minimal web interface and converses with users. In addition to holding a friendly conversation, it identifies user’s intents in a browser and confirms actions aloud.
- Orchestrator (Zeus): This agent identifies overarching tasks, maintains states, and delegates tasks between the other agents in the network.
- Vision Agent (Theia): This agent uses computer vision to analyze screenshots, identify essential landmarks and UI elements, and describe what’s happening to the user.
- Browser Agent (Athena): This agent takes action in a browser to execute the user’s goals through clicks, keystrokes, navigation, and more.
Technology
Fetch.ai uAgents + Agentverse Google Gemini 2.0 Flash (regular and live) Browser-use Text-to-speech
Challenges
- Latency: Cutting round-trip time between screenshot, vision inference, and browser action.
- Error recovery: Detecting mis-clicks or failed navigations and auto-retrying with minimal user friction.
- UX tuning: Balancing verbosity vs. clarity in spoken feedback.
- Security & sandboxing: Preventing agents from executing unintended scripts or exposing sensitive data.
Accomplishments
- Achieved action recognition on complex pages.
- Modular design: each agent can be swapped out or upgraded independently.
- Decentralized architecture: multimodal agentic system on the Fetch network
- Persistence across multiple tasks
- Central orchestrator agent with memory and task queue
Lessons Learned
The Power of Multimodal AI: combining vision, text, and voice to unlock unique workflows and enable innovation Agent Orchestration: having a central node with clear goals and autonomy Complexities of Accessibility: there isn’t a one-size-fits-all solution to accessibility — users have a variety of different needs that must be accommodated
What’s Next
- Persistent Context: improve the orchestrator’s long-term memory to create personalized experiences for users
- Document Navigation: support PDFs, tables, and rich media beyond HTML
- Mobile Integration: create Delphi as a companion app on iOS/Android using remote browsers
- Language Support: support a diverse variety of languages to support a global community
- Open Beta: onboard more visually impaired users for live feedback and iterative improvements
Built With
- browser-use
- fetchai
- gemini
- uagent
Log in or sign up for Devpost to join the conversation.