Apollos

Architect

Inspiration

97% of the web is inaccessible to screen readers.

Not because the information isn't there — but because modern web interfaces rely on visual cues that screen readers cannot parse: date pickers rendered as calendar grids, dynamic dropdowns with no ARIA labels, React SPAs where the DOM never reflects what the user actually sees.

A blind user trying to book a flight on Google Flights doesn't fail because the site is broken. They fail because the interface was never designed to be navigated without sight.

We asked: what if the agent could simply look at the page — the way a sighted person does?

What it does

Apollos is a voice-controlled AI agent that navigates the web on behalf of blind and low-vision users.

The user speaks their intent naturally:

"Find me the cheapest flight from Ho Chi Minh City to Tokyo next month"

The agent:

Clarifies ambiguity before starting — asking whether direct or connecting flights are preferred, rough dates, etc.
Navigates autonomously — opening Google Flights, filling in origin, destination, and dates by interpreting screenshots with Gemini Vision
Narrates every step in natural language — "I'm on the search form, entering departure date..." — so the user is never left in silence
Surfaces options mid-task — "I found two options: Vietnam Airlines $298 with one stop, or Japan Airlines $341 direct. Which do you prefer?"
Escalates gracefully when it reaches payment or sensitive forms — handing off to a human assistant via voice call rather than guessing

The result: a user who cannot see the screen completes a real web task, with full awareness of what the agent is doing at every step.

How we built it

The core loop is a Rust-native agentic system running on Google Cloud:

Voice intent
    → Gemini Vision (screenshot reasoning → AgentAction JSON)
    → chromiumoxide (headless Chrome execution)
    → status narration back to user
    → repeat until Done or Escalate

Gemini Vision receives a PNG screenshot of the current browser state plus the user's intent and conversation history. It responds with a single structured action: click, type, navigate, scroll, wait, ask_user, done, or escalate.

The ask_user action was the key architectural insight. Rather than failing silently when the intent is ambiguous, the agent pauses, emits a question through the voice channel, and waits for the user's reply via a oneshot::channel — resuming the loop with the answer incorporated into context.

Safety is handled at the system prompt level for sensitive actions (payment pages, personal data forms → always escalate) and at the architecture level for physical safety: a hard-stop signal cancels the browser agent immediately via CancellationToken, ensuring the AI never delays a safety response.

Challenges

The date picker problem. Google Flights' calendar grid has no aria-label on individual date cells — the classic accessibility failure we were trying to solve, now our own obstacle. The solution was to instruct Gemini to type dates as text into the field rather than click the grid. Screen readers can't do this. Gemini, looking at a screenshot, can.

Pause/resume across an async boundary. Implementing ask_user required bridging an HTTP endpoint into a running tokio task without polling. The solution was a oneshot::channel stored in session state — the HTTP handler sends the user's answer, the loop unblocks.

Knowing when to stop. The most important capability isn't navigation — it's recognizing when not to proceed. Payment confirmation, OTP fields, irreversible actions: the agent must escalate these to a human, not guess. This is enforced in both the system prompt and a lexical guard in Rust.

What we learned

The 97% statistic is real and immediate. Testing on Google Flights — a site that has invested in accessibility — revealed multiple interactions that remain genuinely impossible for screen readers but trivial for a vision model. The gap isn't about broken sites. It's structural.

The most powerful feature isn't automation. It's narration. Blind users interacting with an agent that silently does things reported significantly higher anxiety than one that narrates each step. The agent saying "I'm scrolling down to see more options" is not a UX nicety — it's trust.

Built With

axum
chromiumoxide
gemini
rust
tokio
vision

Updates

NT Bảo started this project — Mar 16, 2026 07:58 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.