Inspiration
Test the limits of current computer-use agents on realistic desktop tasks.
What it does
Executes high-level natural language instructions (e.g. “download dataset, unzip, open in Excel, make pivot table”) by perceiving the screen and controlling mouse/keyboard.
How we built it
- Screen capture + OCR for visual grounding
- Vision-language model for reasoning
- Action planner for granular interactions
- Safety/retry logic for robustness
Challenges we ran into
- UI variability across themes/states
- Long-horizon task planning
- Latency vs. accuracy in perception
Accomplishments that we're proud of
- Agent completed multi-step workflows end-to-end
- Modular architecture for adding new skills
- Automatic recovery from common failures
What we learned
- Robust grounding is the bottleneck, not just model quality
- Simple guardrails/retries greatly boost success
- Human-like adaptability > perfect execution
What's next for SOTA Computer Use Agent Challenge
- Benchmark on standardized real-world tasks
- Extend to hybrid environments (desktop + web + APIs)
- Release starter framework for community use
Built With
- cua
- hud
- ollama

Log in or sign up for Devpost to join the conversation.