Delphi: Voice-Driven Browsing for the Visually Impaired

Inspiration

Most of us take our laptops for granted. Google is always a few easy clicks away. But for some, these basic technologies are unfriendly and inaccessible. Our friend Moksh, who is blind, is one such person.

For him, navigating the web is a maze of tabs and endless keyboard shortcuts to memorize. We witnessed his frustration listening for screen reader prompts and rebuilding his mental map of websites we take for granted every time the interface changed.

Delphi was born from that moment: a promise to transform browsing into a conversation. An experience where every click and keystroke is controlled through your voice.

How it Works

Delphi is a multi-agent network comprised of four unique AI agents that transforms any website into a seamless interface for voice-driven interaction.

Voice Agent (Hermes): This agent lives in our minimal web interface and converses with users. In addition to holding a friendly conversation, it identifies user’s intents in a browser and confirms actions aloud.
Orchestrator (Zeus): This agent identifies overarching tasks, maintains states, and delegates tasks between the other agents in the network.
Vision Agent (Theia): This agent uses computer vision to analyze screenshots, identify essential landmarks and UI elements, and describe what’s happening to the user.
Browser Agent (Athena): This agent takes action in a browser to execute the user’s goals through clicks, keystrokes, navigation, and more.

Technology

Fetch.ai uAgents + Agentverse Google Gemini 2.0 Flash (regular and live) Browser-use Text-to-speech

Challenges

Latency: Cutting round-trip time between screenshot, vision inference, and browser action.
Error recovery: Detecting mis-clicks or failed navigations and auto-retrying with minimal user friction.
UX tuning: Balancing verbosity vs. clarity in spoken feedback.
Security & sandboxing: Preventing agents from executing unintended scripts or exposing sensitive data.

Accomplishments

Achieved action recognition on complex pages.
Modular design: each agent can be swapped out or upgraded independently.
Decentralized architecture: multimodal agentic system on the Fetch network
Persistence across multiple tasks
Central orchestrator agent with memory and task queue

Lessons Learned

The Power of Multimodal AI: combining vision, text, and voice to unlock unique workflows and enable innovation Agent Orchestration: having a central node with clear goals and autonomy Complexities of Accessibility: there isn’t a one-size-fits-all solution to accessibility — users have a variety of different needs that must be accommodated

What’s Next

Persistent Context: improve the orchestrator’s long-term memory to create personalized experiences for users
Document Navigation: support PDFs, tables, and rich media beyond HTML
Mobile Integration: create Delphi as a companion app on iOS/Android using remote browsers
Language Support: support a diverse variety of languages to support a global community
Open Beta: onboard more visually impaired users for live feedback and iterative improvements