Inspiration
Many multimodal GUI agents are accurate but computationally slow. In a baseline agent loop, the visual input sequence X at step t grows linearly as it stacks every past high-resolution screenshot and action:
$$X_t = {I_1, A_1, I_2, A_2, \dots, I_t}$$
Because a transformer's attention mechanism scales quadratically with token length, this O(N) linear growth in image history scales exponentially. We wanted to do better. We were inspired by recent research in semantic context to completely decouple the agent's token footprint from the episode length, shifting from an expensive visual memory to a lightweight textual one.
What it does
Our framework allows an LLM to navigate an iOS phone natively. By extracting the underlying UI code, filtering out the visual noise, and hashing the interactive elements, we compress the environment into a deterministic text buffer.
Instead of sending an expanding array of screenshots to a heavy multimodal model, we maintain a simple state update:
$$S_t = S_{t-1} + A_{t-1} + \text{Flattened_UI}_t$$
This drops our visual token complexity from O(N) down to a strict O(1) per state (single image). The result is a 3.4x latency speedup over traditional baseline agents on arxiv.
How we built it
Goal: compress linear history to a lightweight text buffer. Our optimization does the following:
Capture deterministic iOS accessibility hierarchy (JSON)
Flatten previous state into a structured natural language context message to be sent in the subsequent payload
Algorithmically strip the screenshot in frame of noise
Overlap screenshot with grid for precise movements
(Optionally) zoom into unclear areas/additionally prompt user if dead-end approached
Challenges we ran into
- Noise : The raw Maestro JSON contains a massive amount of structural boilerplate. Fine-tuning our algorithm to drop layout containers without accidentally stripping critical icon-only buttons (which rely on hidden accessibility labels) was tough.
- Text-Only Hallucinations: Early iterations of the text-only routing sometimes hallucinated targets.
- Apple workarounds: Getting permissions to run automations natively on iPhone required workarounds that took a long time to figure out and verify.
What we learned
We learned so much about the end-to-end flow of how agents respond to prompts. We explored various papers on the topic and brainstormed where we could inject our own micro-improvements to increase our speedup against the baseline method.
Log in or sign up for Devpost to join the conversation.