Agent 007

Inspiration

Back when we took our Software Engineering module, one of the most painful parts of the project was manual UI testing. It was repetitive, time-consuming, and honestly, we still ended up missing a lot of bugs because it was easy to overlook steps after hours of testing. At that point, we kept thinking: “It would be so much nicer if there was a smarter way to do this automatically.”

So when this opportunity came, we immediately knew what we wanted to work on, an automated UI tester that doesn’t just click blindly but adapts, detects completion in real-time, and saves both time and mistakes. In a way, Agent 007 is the tool we wish we had back then.

What it does

Agent 007 is an enhanced GUI automation system that can run complex workflows more efficiently by detecting real-time state changes instead of relying on static timers. It automatically switches between fast and accurate parsers, captures completion states using AI, and adapts to dynamic environments like mobile apps, browsers, and embedded UIs. The result: fewer failures, faster execution, and much more human-like automation.

How we built it

We built 007 using Python, FastAPI, and PyTorch, with a backbone of CLIP models, YOLO-based parsing, and binary classifiers. The hardest part was designing the adaptive wait mechanism, we trained and integrated a lightweight classifier that continuously monitors screenshots to decide whether a step is actually complete. We combined that with a unified parser system (OmniParser for accuracy, CLIP Parser for speed), wrapped it into a containerized service with Docker, and made a simple Gradio frontend for demos.

Challenges we ran into

Balancing speed vs. accuracy, OmniParser was great for precision but slow, while CLIP Parser was fast but sometimes confused by noisy UIs.
Building a binary completion detector that worked across very different apps (Google Maps vs TikTok).
Orchestrating multiple services (FastAPI, vector DB, OCR, and parsers) without letting latency creep in.
Debugging Android automation with ADB, unpredictable at times!

Accomplishments that we're proud of

Cutting workflow execution time by up to 53% while keeping accuracy high.
Making our automation robust to dynamic UIs, something that’s a common failure point in industry scripts.
Designing a system that feels scalable and reusable, not just a one-off hack.
The first time we saw 007 complete a 'TikTok scroll & like' workflow dynamically without breaking, that was a memorable moment.

What we learned

Small models (like binary classifiers) can make a huge difference when used in the right place.
Combining multiple parsers is better than betting on one “perfect” solution, hybrid approaches win.
That inference efficiency is just as important as model accuracy in real-world automation.
Sometimes, the bottleneck isn’t the model, it’s the orchestration of services.

What’s next for Agent 007

As we look ahead, our team has identified several directions to make Agent 007 even more powerful and efficient:

Smarter LLM Prompting
Add SRS (Software Requirement Specification)–driven documentation to improve the quality of prompts sent to the LLM platform. This ensures clearer task understanding and more consistent responses from the model.
Parallel Execution of Tasks
Within a workflow, identify independent tasks that can be executed in parallel. By leveraging DAGs (Directed Acyclic Graphs) and orchestration tools like Prefect, we can reduce total execution time significantly.
Concurrent Workflows
Introduce multi-threading to enable the system to run multiple workflows simultaneously, making Agent 007 capable of handling larger-scale automation in real-world environments.
Macro Actions
Implement macro nodes that group simple, repetitive actions into a single action unit. This reduces the need to constantly query the graph database and makes the workflow both faster and more efficient.