Inspiration

My inspiration came from a staggering statistic: 85% of GenAI projects fail to reach production. As an AI engineer myself, I've lived the pain of this "last mile" problem. I was frustrated with the fragmented landscape of evaluation tools that required me to stitch together 3-5 different platforms for testing, security, and monitoring.

The biggest challenge was the "black box" nature of debugging. When an evaluation failed, I'd get a low score but no clear reason why. I'd spend countless hours—sometimes 40% of my development time—on trial-and-error debugging, trying to figure out if the root cause was a bad prompt, a slow database query, a failed API call, or a token limit issue. I knew there had to be a better way to get actionable insights and ship with confidence.

What it does

AgentEval is the first multi-agent evaluation platform that pinpoints the exact root cause of failures in your GenAI application. It goes beyond just giving you a score by answering the critical question: "WHERE did my system fail?"

My platform coordinates three types of AI agents to conduct a comprehensive evaluation:

  1. Persona Agents simulate realistic user conversations (like a frustrated customer or a technical expert) to test for UX and quality issues across hundreds of turns.
  2. Red Team Agents automatically execute over 450 attack patterns, covering the OWASP LLM Top 10, to identify security vulnerabilities before attackers do.
  3. Judge Agents analyze every interaction for quality, safety, and correctness, assigning detailed scores.

The "secret sauce" is my Trace-Based Root Cause Analysis. AgentEval correlates every low score from the Judge Agent directly with the underlying distributed trace from your application, instantly identifying the bottleneck—whether it's a slow database query, an LLM hitting its token limit, or a faulty tool execution.

How I built it

I built AgentEval as an AWS-native solution to ensure scalability, reliability, and deep integration for the hackathon.

  • Core Architecture: The system is built on a serverless, event-driven architecture using AWS EventBridge to coordinate the three agents (Persona, Red Team, Judge). State management for conversations and agent memory is handled by Amazon DynamoDB, and all evaluation reports are stored in Amazon S3.

  • AI & Language Models: All agent reasoning and evaluation logic is powered by Amazon Bedrock, using models like Claude Sonnet for their advanced reasoning capabilities.

  • Trace Correlation (The Magic): This was the most critical piece. My agents propagate W3C Trace Context headers to the target application. I used the OpenTelemetry standard for instrumentation. The target application's traces are captured by AWS X-Ray. My Judge Agent then queries the X-Ray API to fetch the corresponding trace for any failed interaction. A custom analysis algorithm parses the trace, identifies anomalies (like high latency or errors), and constructs a "failure chain" to pinpoint the root cause and generate actionable recommendations.

Challenges I ran into

My biggest challenge was building the Trace Correlation Algorithm. It was one thing to get the trace data, but another to reliably link a qualitative failure (e.g., "the AI's response was incomplete") to a specific technical event in the trace (e.g., "the max_tokens parameter was set too low"). It required building a sophisticated set of heuristics and failure chain templates to make the analysis consistently accurate and, most importantly, actionable for a developer.

Another challenge was managing the project scope within the tight hackathon timeline. My vision is huge, so I had to be ruthless in my prioritization using the MoSCoW method to focus only on the "Must Have" features that delivered my core value proposition. This meant temporarily shelving features like a fancy UI dashboard to ensure the backend evaluation and tracing engine was rock-solid.

Accomplishments that I'm proud of

I am incredibly proud of successfully implementing the Trace-Based Root Cause Analysis. To my knowledge, this is the only evaluation platform that moves beyond scores to provide this level of deep, actionable insight. Seeing it correctly identify a 3500ms database query timeout as the root cause for an incomplete LLM response was a huge "aha!" moment.

I'm also proud of the multi-agent architecture. Getting three distinct AI agents to coordinate seamlessly through an event-driven system was a significant architectural achievement. Finally, building a fully reproducible, infrastructure-as-code deployment using AWS CloudFormation is an accomplishment that makes my solution enterprise-ready from day one.

What I learned

The biggest lesson was that for GenAI applications, a simple pass/fail score is not enough. The complexity of these systems, with their chains of LLMs, tools, and data sources, means that failures are often emergent and non-obvious. You cannot debug what you cannot see.

I learned that distributed tracing, a concept traditionally used in microservices for performance monitoring, is the key to unlocking observability for complex AI agentic systems. Applying it to evaluation is a powerful paradigm shift. I also validated that a comprehensive evaluation must be multi-faceted, combining UX (personas), security (red-teaming), and quality (judging) to truly ensure an application is production-ready.

What's next for AgentEval

Winning the hackathon is just the beginning! My roadmap is guided by the McKinsey Three Horizons framework:

  • Horizon 1 (0-12 months): Launch & Scale. My immediate next step is to launch the open-source core to build a community. I will onboard my first 100 beta customers, get listed on the AWS Marketplace, and aim to achieve $100K in ARR within six months.

  • Horizon 2 (12-24 months): Build New Capabilities. I plan to expand into multi-modal evaluation (images, video, audio), develop industry-specific evaluation frameworks (for finance, healthcare), and introduce white-label solutions for partners.

  • Horizon 3 (24-36 months): Create New Markets. My long-term vision is to build an AI agent marketplace and expand into AI safety and regulatory compliance automation, positioning AgentEval as the category leader in ensuring safe, reliable, and effective AI.

Built With

Share this project:

Updates