Inspiration

The proliferation of AI-powered features means frequent changes to LLM/AI Agents prompts and AI written code. Manual review of AI outputs for every prompt modification is time-consuming, subjective, and prone to errors, leading to quality regressions, LLM hallucinations and developer friction. I was inspired to bring objective, automated "testing" principles directly into the developer workflow to ensure AI feature/code quality.

What it does

The LLM Judge Agent is a GitLab Duo agent that automatically evaluates the quality of LLM outputs when prompt files are changed in a Merge Request. It runs predefined test cases, uses a powerful Github Duo LLMs to objectively judge the outputs against strict criteria (Accuracy, Completeness, Hallucination, Clarity), posts a detailed pass/fail report to the MR, and acts as a CI pipeline gate to prevent low-quality prompt changes from merging.

How I built it

I developed the agent in Python, leveraging Anthropic Claude for local testing. The agent's logic is modular, consisting of a TestRunner, LLMJudge, and Reporter. It integrates seamlessly with GitLab CI/CD via .gitlab-ci.yml and is defined as a GitLab Duo Agent and Flow. For local testing and development, I created a simulated target LLM environment.

Challenges we ran into

Integrating with the evolving GitLab Duo Agent Platform involved understanding its specific agent and flow definition structures.

Accomplishments that I am proud of

I successfully implemented a fully automated, LLM-powered quality gate for AI development, capable of providing objective feedback directly within the developer's workflow. Creating a robust testing environment for the agent itself was a significant accomplishment. The solution directly addresses a major bottleneck in AI feature development by enforcing quality through automation.

What I learned

I gained deep insights into the GitLab Duo Agent Platform's capabilities for event-driven automation and workflow integration. I also reinforced best practices for LLM prompt engineering for evaluation tasks and learned to navigate the nuances of integrating external AI services like within a CI/CD context.

What's next for LLM Test Orchestrator

I plan to enhance the TestRunner with more sophisticated mock LLM capabilities for complex local testing scenarios, develop a more advanced regression tracking system, and explore integrating additional GitLab built-in tools for richer automated actions (e.g., automatically creating an issue if a critical regression is detected). I also aim to make the agent model-agnostic to support other LLM providers.

Built With

Share this project:

Updates