Inspiration
As enterprises transition from rigid Robotic Process Automation to flexible, non-deterministic AI agents, the software testing paradigm has broken down. Traditional QA relies on strict string matching and hardcoded assertions. When an LLM-based agent processes an invoice and returns "Acme Corporation" instead of "Acme Corp.", a traditional test fails, even though the AI is semantically correct.
Furthermore, translating raw business requirements into comprehensive test cases is a slow, manual bottleneck that frequently misses edge cases. We realized that if we are deploying AI to do the work, we must deploy AI to test the work. This inspired us to build AgentProof: an autonomous QA system that tests other AI agents on the UiPath platform.
What it does
AgentProof is a self-contained, multi-agent evaluation pipeline deployed natively as a UiPath Coded Automation. It performs three critical functions:
- Requirements Extraction: It ingests raw business requirement documents and uses a large language model to automatically generate meaningful, edge-case-aware test scenarios.
- Test Generation: It structures these scenarios into actionable test cases.
- LLM-as-a-Judge Evaluation: It takes the actual outputs produced by a target automation and evaluates them against the generated test cases. Instead of rigid string matching, it evaluates semantic correctness.
Human-in-the-Loop Routing: If the LLM judge evaluates an output and the result is uncertain (for instance, if the target agent hallucinated a value that is borderline acceptable), AgentProof automatically creates a task in UiPath Action Center. This ensures humans remain accountable for high-impact decisions while the agents handle the repetitive evaluation workload.
How we built it
We developed AgentProof entirely in Python using the UiPath SDK and the Google GenAI SDK. The architecture was designed to run serverless within UiPath Orchestrator.
- Core Logic: The system relies on three distinct agent classes (Requirements Reader, Test Generator, and Judge). We utilized Google's Gemini models for the reasoning engine.
- UiPath Integration: We integrated directly with the Orchestrator REST API. By authenticating via the Python SDK, our automation makes POST requests to the
/odata/Tasks/GenericTasks/CreateTaskendpoint, generatingExternalTaskitems directly in Action Center. - Packaging and Deployment: We used
hatchlingto package the Python project. Using the UiPath CLI (uipath pack), we compiled the source code into a.nupkgfile, which was then uploaded to Orchestrator and executed as a standard Process. - Secret Management: We utilized Orchestrator's Environment Configuration to securely inject API keys and tenant variables into the Python process at runtime.
Challenges we ran into
Building a cloud-native agentic system introduced several unexpected technical hurdles:
- UiPath Packaging Nuances: We encountered significant friction getting the
hatchlingbuild system to properly discover and package our Python code for the UiPath execution environment. By default, the builder attempted to match directory names to project names, which led to build failures in the cloud. We had to carefully configurepyproject.tomlto explicitly define the package root to ensure the cloud dependency installation succeeded. - Action Center API Endpoints: Implementing human-in-the-loop without relying on a bulky Maestro BPMN workflow required reverse-engineering parts of the Orchestrator REST API. We initially faced
405 Method Not Allowederrors before identifying the precise endpoint routing and the mandatoryExternalTaskJSON payload structure required to programmatically create actionable tasks. - LLM Rate Limiting: Processing dozens of test cases rapidly exhausted the daily quotas on the free-tier Gemini API, throwing
429 RESOURCE_EXHAUSTEDerrors mid-execution.
Accomplishments that we're proud of
To solve the API rate-limiting issue, we engineered a highly resilient multi-model fallback chain. Our custom Gemini client intercepts quota exhaustion errors and automatically cascades to the next available model in a predefined priority list (e.g., from Gemini 3.5 Flash, down to Gemini 3.1 Flash Lite, etc.). This ensures the testing pipeline never crashes during a large evaluation batch.
We are also incredibly proud of successfully deploying a pure Python, headless LLM orchestration pipeline natively into UiPath Orchestrator that can intelligently pause its own logical flow to ask a human for help via Action Center.
What we learned
We gained a deep understanding of the UiPath Orchestrator REST API, specifically regarding folder-scoped permissions and task management. We also learned how to properly structure, build, and deploy Python-based Coded Automations using modern build backends. On the AI front, we learned that prompting an LLM to act as a strict, binary judge requires highly specific instructions regarding confidence thresholds to prevent it from passing hallucinated data.
What's next for AgentProof
Our immediate next step is to integrate AgentProof directly with UiPath Test Manager. While our current implementation handles evaluation and human review, pushing the final PASS/FAIL verdicts back into Test Manager via its REST API would provide QA teams with the standard dashboarding and reporting they are accustomed to. Eventually, we plan to expand the Judge Agent to accept visual inputs, allowing it to validate UI state changes made by target automations.
Log in or sign up for Devpost to join the conversation.