Agentic Automation on iOS

Inspiration

2.2 billion people live with vision impairment and 43 million are blind — yet only ~5% of apps fully meet WCAG. VoiceOver helps, but it still assumes you can navigate the screen one element at a time. We wanted something fundamentally different: an agent where you say what you want and your iPhone does it, narrating every step along the way.

"Send my grandson a message." "What's my most recent Instagram DM?" That should be the whole interaction.

What it does

A Claude-powered agent that takes full, eyes-free control of your iPhone:

Speak any task via the Action Button + a Shortcut
Claude sees your screen using screenshots + the iOS UI hierarchy, and picks from 20+ tools (tap, scroll, swipe, openApp, composeMessage, searchMaps, webSearch, and more)
Narrates every step through AVSpeechSynthesizer, with live progress in the Dynamic Island so blind users hear exactly what's happening — and can cancel or confirm at any point

Works across Messages, Mail, Maps, Safari, Calendar, Notes, Phone, and any installed app.

How we built it

The stack spans an iPhone 15 Pro and a MacBook bridged over USB:

Node.js agent loop (agent.mjs) — reads the task, calls Claude, picks a tool, executes, repeats
Claude Sonnet 4.5 — vision + tool calling, with system prompt and 20+ tool schemas marked for prompt caching
XCTest HTTP Runner (port 22087) — the hands on the device: taps, types, scrolls, drags
Maestro + JVM (port 6001) — USB bridge that boots the XCTest Runner
server.mjs (port 8000) — spawns the agent, exposes POST /task and GET /status
Swift companion app — polls status every 500ms and drives the Dynamic Island live activity
AVSpeechSynthesizer — step-by-step narration
Action Button + Shortcuts — one-press voice trigger

Control loop: takeScreenshot + getUIElements → Claude picks 1 of 20+ tools → execute → loop until taskComplete / taskFailed / askUser.

Why Claude

Prompt caching on system prompt + tool schemas cut per-step latency dramatically
XML-tagged prompting is a first-class, documented pattern for Claude; our system prompt leans on it heavily
Reliable tool calling over long horizons — mobile tasks run 15–30 steps, and tool choice couldn't drift as context grew
Long context that doesn't degrade late in the trajectory

Optimizations

The biggest win: a research-backed system prompt rewrite plus a single-image rolling-summary mode — only the latest screenshot stays in context; older ones are stripped and replaced with a one-line summary of what happened that step.

Result: 3.40× speedup vs. the all-screenshots baseline, no drop in task success.

Other tuning: explicit parallel tool-calling, iOS-specific UI hints (tab bars, modal sheets, keyboard quirks), forced screen-understanding before any action.

Challenges

Reverse-engineering the XCTest HTTP protocol for reliable low-level control without jailbreaking
Dynamic Island sync at 500ms polling without draining battery or stuttering
Safety under ambiguity — "text Kenny" with three Kennys in contacts. An askUser tool pauses the loop and surfaces a disambiguation prompt in the Dynamic Island
Context bloat — raw screenshots every step blew past useful context fast, which drove the rolling-summary redesign

Accomplishments

End-to-end eyes-free control of a stock iPhone 15 Pro — no jailbreak, no app-specific integrations
3.40× speedup from the rolling-summary architecture
A real accessibility use case, not a demo in disguise

What we learned

Mobile agents live or die on the loop between what the model sees and what it's allowed to do. The tools matter less than the discipline of the loop: cache aggressively, summarize old state, force the model to describe the screen before acting, always give the user an interrupt.

What's next

Privacy: per-app blocklist, per-action consent, so screenshots of sensitive apps never leave the device
Identity: voice biometrics or passphrase before destructive actions
Reliability: pre- and post-action confirmation plus an undo window
User studies with blind and low-vision participants

Built With

activitykit
anthropic-api
avspeechsynthesizer
claude
claude-sonnet-4.5
dynamic-island
ios-shortcuts
javascript
maestro
node.js
prompt-caching
swift
swiftui
usb
websockets
xcode
xctest

Updates

Bryan Ramirez-Gonzalez started this project — Apr 19, 2026 06:53 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.