Inspiration

We kept running into the same thing: you ask an AI support agent for help, it confidently says "all set", and a minute later you're typing "talk to a human". Technically the agent succeeded. The ticket says resolved, the dashboards are green, nothing errored. But as a customer you walked away worse off. We realized there's a whole class of failures that never throw an error, so no monitoring or eval ever catches them. Someone has to go back and ask the one question nobody's measuring: did this actually serve the customer?

What it does

GapAudit is an autonomous auditor for customer-facing AI agents. On a regular basis, It reads accumulated agent traces from Arize Phoenix and judges each one across five service-quality lenses : false resolution, customer effort inflation, trust-damaging handling, ignored context, and recurring operational drift. For every finding it infers, from the trace alone, what the agent should have produced, why the behavior hurts the customer, and the exact spans that prove it. Findings form a history, which is stored in MongoDB database memory of GapAudit agent. It pulls the history, check if misbehavior repeats, and red-flags them with higher severity. Findings and patterns land in a review dashboard where a human can confirm or dismiss them, then turn a confirmed failure into a regression test that's pushed straight back into Phoenix. Supervisor can get a sense of how their agents are working sub-optimally in silent ways, develop them, and re-test them on real-world cases, so the same frustration can't ship twice. We support a detect, review, develop, prevent cycle for operation-level problem for your autonomous customer-facing AI Agents.

How we built it

GapAudit is built as a Google ADK Agent powered by Gemini and connected entirely through MCP. The audit agent pulls traces from Arize Phoenix, retrieves historical findings and related context from MongoDB, and reasons end-to-end about whether an interaction may have harmed the customer experience. Another key part of the system is recurrence analysis. Through MongoDB MCP, the audit agent piles up the behavioral signals of your AI agents, query historical findings and related current interaction traces directly. Instead of reviewing each trace in isolation, it investigates whether a behavior represents a repeated pattern across time and across multiple customers and incidents. This allows GapAudit to surface systemic risks rather than one-off mistakes : The hidden bad habit of your agent. The human review layer is a Next.js dashboard deployed on Vercel. Reviewers can inspect findings, validate them, and generate regression tests with confirmed failure cases that are pushed directly back into Phoenix.

Challenges we ran into

One of the hardest challenges was obtaining realistic failure cases. Real production traces rarely come with labels saying “this customer was frustrated". To evaluate GapAudit, we built mock customer-facing agents and intentionally equipped them with flawed tools. We deployed a simple agent written in Typescript loop on google-cloud given mock tools, then generated realistic interactions and exported their traces into Phoenix. Designing failures that felt natural and authentic to real-world customer-facing agent production scenarios were especially hard.

Accomplishments that we're proud of

We’re proud that GapAudit goes beyond trace review and reasons about recurring service failures. Rather than treating every trace independently, the agent actively investigates historical context to determine whether a behavior is an isolated incident or part of a larger operational pattern. This allows it to identify customer-impacting risks that traditional evaluations and observability systems often miss. We’re also proud of maintaining a separation between observation and judgment. The audit agent only sees what exists in the trace: inputs, outputs, tool calls, and execution history. From that evidence alone, it infers what likely went wrong, why it matters to customers, and how the issue should be addressed. This allows user to analyze and develop their agent in service operation-level, not trivial task solver.

What we learned

We learned that building a capable agent is only the beginning. Evaluations and observability are essential, but deploying and operating autonomous agents in production introduces an entirely different set of challenges. Many of the most important failures don’t appear as errors, exceptions, or failed evaluations. They emerge through repeated customer interactions and slowly accumulate into trust, satisfaction, and operational problems. We also learned that long-term agent operations require systems that continuously monitor behavior after deployment, not just before release. The gap between “the task was completed” and “the customer was actually helped” becomes increasingly important as agents take on more real-world responsibilities.

What's next for GapAudit

Our next step is to apply GapAudit to real production agents. We plan to connect deployed agent systems to Arize Phoenix Cloud through OpenTelemetry and use GapAudit to continuously audit real customer interactions in production environments. We’re also interested in expanding beyond support workflows into sales, recruiting, and operational agents, where customer-facing failures often remain invisible despite successful task completion. Our long-term goal is to become the customer-experience surveillance layer for autonomous agents, helping teams identify and prevent silent failures before they become business problems.

Built With

Share this project:

Updates