AgentProof

Inspiration

As enterprises transition from rigid Robotic Process Automation to flexible, non-deterministic AI agents, the software testing paradigm has broken down. Traditional QA relies on strict string matching and hardcoded assertions. When an LLM-based agent processes an invoice and returns "Acme Corporation" instead of "Acme Corp.", a traditional test fails, even though the AI is semantically correct.

Furthermore, translating raw business requirements into comprehensive test cases is a slow, manual bottleneck that frequently misses edge cases. We realized that if we are deploying AI to do the work, we must deploy AI to test the work. This inspired us to build AgentProof: an autonomous QA system that tests other AI agents on the UiPath platform.

What it does

AgentProof is a self-contained, multi-agent evaluation pipeline deployed natively as a UiPath Coded Automation. It performs three critical functions:

Requirements Extraction: It ingests raw business requirement documents and uses a large language model to automatically generate meaningful, edge-case-aware test scenarios.
Test Generation: It structures these scenarios into actionable test cases.
LLM-as-a-Judge Evaluation: It takes the actual outputs produced by a target automation and evaluates them against the generated test cases. Instead of rigid string matching, it evaluates semantic correctness.

Human-in-the-Loop Routing: If the LLM judge evaluates an output and the result is uncertain (for instance, if the target agent hallucinated a value that is borderline acceptable), AgentProof automatically creates a task in UiPath Action Center. This ensures humans remain accountable for high-impact decisions while the agents handle the repetitive evaluation workload.

How we built it

We developed AgentProof entirely in Python using the UiPath SDK and the Google GenAI SDK. The architecture was designed to run serverless within UiPath Orchestrator.

Core Logic: The system relies on three distinct agent classes (Requirements Reader, Test Generator, and Judge). We utilized Google's Gemini models for the reasoning engine.
UiPath Integration: We integrated directly with the Orchestrator REST API. By authenticating via the Python SDK, our automation makes POST requests to the /odata/Tasks/GenericTasks/CreateTask endpoint, generating ExternalTask items directly in Action Center.
Packaging and Deployment: We used hatchling to package the Python project. Using the UiPath CLI (uipath pack), we compiled the source code into a .nupkg file, which was then uploaded to Orchestrator and executed as a standard Process.
Secret Management: We utilized Orchestrator's Environment Configuration to securely inject API keys and tenant variables into the Python process at runtime.

Challenges we ran into

Building a cloud-native agentic system introduced several unexpected technical hurdles:

UiPath Packaging Nuances: We encountered significant friction getting the hatchling build system to properly discover and package our Python code for the UiPath execution environment. By default, the builder attempted to match directory names to project names, which led to build failures in the cloud. We had to carefully configure pyproject.toml to explicitly define the package root to ensure the cloud dependency installation succeeded.
Action Center API Endpoints: Implementing human-in-the-loop without relying on a bulky Maestro BPMN workflow required reverse-engineering parts of the Orchestrator REST API. We initially faced 405 Method Not Allowed errors before identifying the precise endpoint routing and the mandatory ExternalTask JSON payload structure required to programmatically create actionable tasks.
LLM Rate Limiting: Processing dozens of test cases rapidly exhausted the daily quotas on the free-tier Gemini API, throwing 429 RESOURCE_EXHAUSTED errors mid-execution.

Accomplishments that we're proud of

To solve the API rate-limiting issue, we engineered a highly resilient multi-model fallback chain. Our custom Gemini client intercepts quota exhaustion errors and automatically cascades to the next available model in a predefined priority list (e.g., from Gemini 3.5 Flash, down to Gemini 3.1 Flash Lite, etc.). This ensures the testing pipeline never crashes during a large evaluation batch.

We are also incredibly proud of successfully deploying a pure Python, headless LLM orchestration pipeline natively into UiPath Orchestrator that can intelligently pause its own logical flow to ask a human for help via Action Center.

What we learned

We gained a deep understanding of the UiPath Orchestrator REST API, specifically regarding folder-scoped permissions and task management. We also learned how to properly structure, build, and deploy Python-based Coded Automations using modern build backends. On the AI front, we learned that prompting an LLM to act as a strict, binary judge requires highly specific instructions regarding confidence thresholds to prevent it from passing hallucinated data.

What's next for AgentProof

Our immediate next step is to integrate AgentProof directly with UiPath Test Manager. While our current implementation handles evaluation and human review, pushing the final PASS/FAIL verdicts back into Test Manager via its REST API would provide QA teams with the standard dashboarding and reporting they are accustomed to. Eventually, we plan to expand the Judge Agent to accept visual inputs, allowing it to validate UI state changes made by target automations.

Built With

antigravity
gemini-api
hatchling
json
llm
python
uipath
uipath-action-center
uipath-orchestrator

Updates

Kautilya DK started this project — Jun 19, 2026 04:36 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.