Inspiration

We kept running into the same frustration. You're on your laptop, you need to check something on your phone, and suddenly you're unlocking it, finding the app, tapping through four screens just to get a price or send a message. It felt like a solved problem that nobody had actually solved.

The other thing that bothered us: every phone already exposes a complete semantic description of its UI for accessibility purposes. VoiceOver on iOS, TalkBack on Android, the OS literally narrates every button, every label, every field. We thought that was a perception layer that nobody was using for automation.

OpenClaw worked on your computer but not on your phone.

What it does

Spectra is an agent that directly engages with iOS apps through the accessibility API rather than taking screenshots and inferring what to do from pixels. You give it a natural language instruction, something like "search for the best ramen spots nearby", "plot directions to the airport", or "text John that I am running late", and it reads the live semantic tree of whatever is on screen, reasons over it, and acts. It sees "Button: Send" and "Cell: Wi-Fi, Connected" as direct targets rather than regions of an image it has to interpret.

It can execute tasks across multiple apps in sequence. It pauses before sensitive actions like sending, paying, or deleting, and returns control to the user for inputs it should not handle autonomously, like passwords or payment details. A SwiftUI client shows live progress and surfaces approval requests directly in the notification banner so you can confirm or cancel without opening the app.

We also built a Safari extension that applies the same approach to the browser. It reads Safari's live accessibility tree using the macOS AX API, the same layer that powers VoiceOver on Mac, compresses it into the format Spectra already understands, and sends it to the backend over WebSocket. Spectra can then read, click, type, and navigate in any Safari tab using the same action vocabulary it uses on iOS.

How we built it

The backend is Python on FastAPI, communicating with the iOS simulator over WebDriverAgent and with clients over WebSockets. The agent loop follows three steps: read the current screen, compress the accessibility tree down to the elements that matter, then ask Gemini for the next action as a structured tool call. Before executing, the system checks whether the action requires user approval or a manual handoff.

The iOS client is SwiftUI. The Safari agent is a native macOS app extension built entirely on AXUIElement with no content scripts and no DOM access.

We built two memory layers: a short-term scratchpad carries context across apps within a single task, and a persistent store records lessons from past failures and injects them into future prompts so the agent does not repeat the same mistakes across sessions.

Challenges

Latency was a recurring constraint throughout the build. Each step in the agent loop involves reading the accessibility tree, making a model call, and executing an action, and those costs compound quickly across a multi-step task. We spent a significant amount of time on tree compression, reducing hundreds of raw nodes down to the 20 or 30 that the planner actually needs, because every token in the context costs time.

Context triggers were also tough. Rather than a fixed schedule, these fire when the agent recognizes familiar flow from a previous session. If a user has navigated to the same destination before and opens Maps, Spectra surfaces a suggestion based on what it learned from that prior run. Getting the pattern matching right so it surfaces relevant suggestions without false positives, and doing so without storing more user data than necessary, took considerable work.

What we learned

Engaging directly with the accessibility tree rather than reasoning from screenshots changes the quality of the agent's decisions in a meaningful way. "Tap the Send button" is a different instruction than "click at coordinate (312, 748)." The model reasons more reliably, the action traces are easier to follow, and when something goes wrong it is much clearer why.

What's next

Physical device support beyond the simulator, a wider app registry, and tighter integration between the Safari and iOS agents so tasks can move between browser and phone mid-workflow.

Built With

  • fastapi
  • faster-whisper
  • gemini-api
  • iossimulator
  • macos-axuielement-api
  • python
  • swift
  • swiftui
  • webdriveragent
  • websockets
Share this project:

Updates