SilverSurfer

Flowchart

SilverSurfer — The Story Behind the Agent What Inspired Us My grandmother called me one evening, frustrated. She had spent 45 minutes trying to book a doctor's appointment online. The button was too small. The calendar wouldn't respond to her touch. She gave up. She waited three more days for someone to help her.

That phone call became SilverSurfer.

There are 54 million seniors in the United States alone — and hundreds of millions more worldwide — who face this same invisible wall every single day. The internet was built for 28-year-old developers, not 74-year-old retired teachers. Every app has a different UI. Every checkout flow has a different pattern. Every form demands a different skill. We asked ourselves: what if the web could just… listen?

What We Built SilverSurfer is a voice-first AI accessibility agent that navigates the web on your behalf. You speak naturally:

"Book a doctor appointment with Dr. Smith for next Tuesday, and order a gallon of milk from Walmart."

Sylvia — our agent persona — listens, understands, opens browsers, fills forms, clicks buttons, and confirms tasks. She narrates every step in a warm, calm voice. You never touch a keyboard. The agent pipeline At its core, SilverSurfer runs a continuous perception–action loop. Given the current screenshot, the current task, and prior context, Gemini 2.0 Flash outputs the next action as structured JSON: json{ "action": "click", "target": "calendar day 18", "x": 214, "y": 178, "narration": "I can see Tuesday the 18th — selecting it now." }


This loop repeats until the **ADK Verifier agent** confirms task completion by visually inspecting the resulting screen state.

The expected number of loop steps \\( N \\) to complete a task follows roughly:

$$
\mathbb{E}[N] \approx \frac{C}{\rho}
$$

Where \\( C \\) is task complexity (number of required UI interactions) and \\( \rho \\) is Gemini's visual accuracy rate — empirically **0.91** across our 50-task test suite.

---

## How We Built It

### Architecture

User Voice │ ▼ Gemini Live API ──► ADK Orchestrator │ ┌──────────┼──────────┐ ▼ ▼ ▼ Navigator Navigator Verifier Agent (1) Agent (2) Agent │ │ ▼ ▼ Playwright Playwright │ │ Screenshot Screenshot └────┬─────┘ ▼ Gemini 2.0 Flash (Vision → Action JSON) │ ▼ Execute on Browser Tech stack LayerTechnologyVoice inputGemini Live API — real-time, interruptibleIntent parsingGoogle ADK — Orchestrator agentBrowser automationPlaywright + headless ChromiumVisual understandingGemini 2.0 Flash (screenshot → action)Voice outputGoogle Cloud Text-to-Speech (Neural2)BackendFastAPI (Python 3.11)FrontendReact — accessibility-first, large textHostingGoogle Cloud RunSecretsGoogle Cloud Secret ManagerLoggingGoogle Cloud LoggingPersistenceFirestore — saved favourite tasksContainer registryArtifact RegistryIaC / CICloud Build via cloudbuild.yaml ADK agent structure

Orchestrator agent — receives parsed intent, manages task queue, handles voice interruptions Navigator sub-agent — one per task, owns a Playwright browser session, runs the vision loop Verifier sub-agent — takes a final screenshot, confirms task completion before closing

What We Learned

Empathy is an engineering problem We spent the first week building the agent. We spent the second week making it feel human. The two are completely different problems. Removing the word "Error" from all narration and replacing it with "Hmm, let me try a different way" reduced user anxiety measurably in testing. Slowing the TTS voice to 0.88x speed made elderly users feel heard rather than rushed.

Empathy is not a design afterthought. It is a first-class engineering requirement.

Visual grounding is harder than it looks Gemini Vision is remarkably good at identifying UI elements — but coordinate precision degrades at non-standard viewport sizes. We solved this with a DOM-fallback layer: if Gemini returns an element description, Playwright simultaneously attempts page.locator() as a parallel path. Whichever resolves first wins. This brought our task success rate from 78% → 94% across our test suite.
Google ADK is genuinely powerful ADK handles the hard parts of multi-agent coordination — task queuing, agent handoffs, retry logic, context passing — out of the box. What would have taken weeks to build from scratch took two days with ADK.
Real users taught us about trust We tested with three actual elderly users (ages 68–79). Their feedback was not about features. It was about trust.

"How do I know it's not buying the wrong thing?"

This led us to add a confirmation step before any financial transaction — Sylvia reads the item and price aloud and waits for a "yes" before proceeding. One sentence of narration. Massive improvement in trust.

Challenges We Faced The interruption problem Gemini Live API supports real-time interruption — but our early pipeline would continue executing browser actions even after the user said "stop." We built an interrupt bus: a shared async flag that the voice stream can set at any time, which the browser controller polls between every action step. Stateless browsers on Cloud Run Cloud Run containers are stateless and can be spun down between requests. Playwright browsers don't survive container restarts. We solved this by making every task execution self-contained — the browser is launched fresh per task session, with all context passed in via the task payload rather than stored in process memory. Hallucinated coordinates from Gemini Vision Early versions occasionally hallucinated coordinates for elements that didn't exist on screen. We solved this with grounding validation: after receiving action JSON, we verify that the (x, y) coordinates fall within actual screenshot bounds, and that the element description matches what Playwright's accessibility tree reports at that location. If it doesn't match, we re-prompt with the discrepancy.

What's Next

Voice profiles — learn each user's speech patterns and preferred sites over time via Firestore Proactive reminders — "Margaret, your prescription refill is due Thursday — want me to order it now?" Multi-language support — Tamil, Hindi, Spanish for non-English speaking elderly populations globally WhatsApp integration — send a voice note on WhatsApp, SilverSurfer handles it in the background Offline fallback — cached task templates for common actions when connectivity is poor

Built With

Updates

Madhumitha R started this project — Mar 16, 2026 03:47 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.