Inspiration
2.2 billion people live with vision impairment and 43 million are blind — yet only ~5% of apps fully meet WCAG. VoiceOver helps, but it still assumes you can navigate the screen one element at a time. We wanted something fundamentally different: an agent where you say what you want and your iPhone does it, narrating every step along the way.
"Send my grandson a message." "What's my most recent Instagram DM?" That should be the whole interaction.
What it does
A Claude-powered agent that takes full, eyes-free control of your iPhone:
- Speak any task via the Action Button + a Shortcut
- Claude sees your screen using screenshots + the iOS UI hierarchy, and picks from 20+ tools (tap, scroll, swipe, openApp, composeMessage, searchMaps, webSearch, and more)
- Narrates every step through AVSpeechSynthesizer, with live progress in the Dynamic Island so blind users hear exactly what's happening — and can cancel or confirm at any point
Works across Messages, Mail, Maps, Safari, Calendar, Notes, Phone, and any installed app.
How we built it
The stack spans an iPhone 15 Pro and a MacBook bridged over USB:
- Node.js agent loop (
agent.mjs) — reads the task, calls Claude, picks a tool, executes, repeats - Claude Sonnet 4.5 — vision + tool calling, with system prompt and 20+ tool schemas marked for prompt caching
- XCTest HTTP Runner (port 22087) — the hands on the device: taps, types, scrolls, drags
- Maestro + JVM (port 6001) — USB bridge that boots the XCTest Runner
server.mjs(port 8000) — spawns the agent, exposesPOST /taskandGET /status- Swift companion app — polls status every 500ms and drives the Dynamic Island live activity
- AVSpeechSynthesizer — step-by-step narration
- Action Button + Shortcuts — one-press voice trigger
Control loop: takeScreenshot + getUIElements → Claude picks 1 of 20+ tools → execute → loop until taskComplete / taskFailed / askUser.
Why Claude
- Prompt caching on system prompt + tool schemas cut per-step latency dramatically
- XML-tagged prompting is a first-class, documented pattern for Claude; our system prompt leans on it heavily
- Reliable tool calling over long horizons — mobile tasks run 15–30 steps, and tool choice couldn't drift as context grew
- Long context that doesn't degrade late in the trajectory
Optimizations
The biggest win: a research-backed system prompt rewrite plus a single-image rolling-summary mode — only the latest screenshot stays in context; older ones are stripped and replaced with a one-line summary of what happened that step.
Result: 3.40× speedup vs. the all-screenshots baseline, no drop in task success.
Other tuning: explicit parallel tool-calling, iOS-specific UI hints (tab bars, modal sheets, keyboard quirks), forced screen-understanding before any action.
Challenges
- Reverse-engineering the XCTest HTTP protocol for reliable low-level control without jailbreaking
- Dynamic Island sync at 500ms polling without draining battery or stuttering
- Safety under ambiguity — "text Kenny" with three Kennys in contacts. An
askUsertool pauses the loop and surfaces a disambiguation prompt in the Dynamic Island - Context bloat — raw screenshots every step blew past useful context fast, which drove the rolling-summary redesign
Accomplishments
- End-to-end eyes-free control of a stock iPhone 15 Pro — no jailbreak, no app-specific integrations
- 3.40× speedup from the rolling-summary architecture
- A real accessibility use case, not a demo in disguise
What we learned
Mobile agents live or die on the loop between what the model sees and what it's allowed to do. The tools matter less than the discipline of the loop: cache aggressively, summarize old state, force the model to describe the screen before acting, always give the user an interrupt.
What's next
- Privacy: per-app blocklist, per-action consent, so screenshots of sensitive apps never leave the device
- Identity: voice biometrics or passphrase before destructive actions
- Reliability: pre- and post-action confirmation plus an undo window
- User studies with blind and low-vision participants
Built With
- activitykit
- anthropic-api
- avspeechsynthesizer
- claude
- claude-sonnet-4.5
- dynamic-island
- ios-shortcuts
- javascript
- maestro
- node.js
- prompt-caching
- swift
- swiftui
- usb
- websockets
- xcode
- xctest
Log in or sign up for Devpost to join the conversation.