Phonix Audit Banner

Phoenix Audit

Inspiration

The EU AI Act starts enforcement on August 2, 2026. The penalty for non-compliance is €15M or 3% of global turnover. Today, getting a production AI agent audited means hiring a Big-4 firm: around €80,000 and up to 18 months. By the time the report lands, the agent has been redeployed fifty times and the document describes software that no longer exists.

Meanwhile, the people actually on the hook — there are over 2,000 open AI Governance roles on LinkedIn right now — have no tooling. Observability platforms capture traces but don't produce attestations. AI insurance products underwrite certificates but don't run the audit.

So the question became: what if the audit itself was an agent? An agent doesn't get tired after probe number 40, it can read execution traces no human auditor would ever open, and it can re-run the entire audit every time you redeploy.

What it does

Point Phoenix Audit at any production AI agent — a customer-support bot, a healthcare prior-auth system, a coding assistant. It then does four things:

Connect. It inspects the agent's shape and picks the regulatory framework to test against (EU AI Act, NIST AI RMF, HIPAA, SOC 2).
Test. It runs an adversarial battery drawn from HarmBench, OWASP LLM Top 10, MITRE ATLAS and CARES — prompt injection, role confusion, data-exfiltration probes, tool misuse. Every probe runs as a real Arize Phoenix experiment against the live target.
Cluster. When tests fail, it reads the trace trees back and clusters failures by root cause. Three failures that share one upstream span become one finding, not three.
Report. It renders a cryptographically signed PDF and JSON evidence pack in EU AI Act Annex IV format, keyed to a commit SHA, and can open a hardening-recipe merge request against the agent's repo.

The whole run takes about 90 seconds. The signature comes from the customer's own Cloud KMS key — it's evidence their compliance officer signs, not a certificate we issue. That's deliberate: an auditor that also sells you the passing grade has a conflict of interest.

How we built it

The orchestrator is a Google ADK SequentialAgent with four sub-agents — Inspector, Tester, Judge, Reporter — with Gemini 3.5 Flash as the judge LLM. Arize Phoenix is the evidence layer: every probe is a Phoenix experiment, and the clustering step reads span trees back through Phoenix MCP. The frontend is Next.js 16 with a live audit chamber that streams probe results over SSE as they happen. Everything runs on three Cloud Run services with blue/green deploys: build one image, smoke-test the candidate, then shift traffic.

The build was strict TDD: every story's acceptance criteria became failing tests before any source code existed. The suite is at 964 tests. One rule mattered more than any other: tests assert on the Phoenix span tree structure, not on the natural-language output. An LLM's wording changes; the shape of what it actually did doesn't.

Challenges we ran into

The silent-pass problem. Early on we found that if a probe's input went missing, an empty string got forwarded to the judge LLM — which happily returned "passed," because there was nothing to object to. For an audit tool, a silent pass is worse than a crash. This turned into a project-wide rule: no empty-string fallbacks on audit inputs, and any code path that uses a fallback must stamp the output metadata so a regulator can tell real findings from filler.

The deploy that lied. A missing dependency in the Docker image made every staging deploy fail quietly for four days — health checks stayed green because the old revision kept serving traffic. The live demo was running four-day-old code and nobody noticed. The fix was an import smoke test inside the Docker build itself: if the app can't import, the image doesn't exist.

Phoenix MCP only covers part of the API. Running experiments and logging span annotations aren't exposed as MCP tools, so we wrapped Phoenix's async client as custom ADK FunctionTools for those paths. Related lesson: the client library docs advertised a method the published version didn't have. Reading the installed source became standard practice.

Accomplishments that we're proud of

Nothing in the hot path is mocked. Real Phoenix experiments, real Gemini judge, real adversarial traffic against a real deployed target. The demo moment we built everything around: 47 tests run live, 3 fail, and the 3 failures collapse into one root cause on screen.

We're also proud of what the product refuses to do. A fresh account shows zero audits, not fake ones — sample data sits behind an explicit toggle. Report links only appear once a signed report actually exists. A compliance product that fakes its own UI has already failed its first audit.

What we learned

Building an auditor audits you back. Every silent-failure pattern we shipped a probe for, we then found a version of in our own code. That list of patterns became a standing checklist for every code review on the project.

The bigger lesson: observability and attestation are different products. Traces tell engineers what happened; an audit tells a regulator what happened, signed, at a specific commit, in a format the law recognizes. The gap between those two is the entire product.

What's next for Phoenix Audit

Continuous audits are already wired in — schedules re-run the battery and flag regressions, turning compliance from an annual event into a heartbeat. Next: open-sourcing the engine so it's self-hostable end to end, deepening the framework adapters (LangChain, CrewAI, OpenAI Agents SDK are instrumented today; black-box HTTP agents work via behavioral fingerprinting), and a BYO-key mode for regulated industries where trace evidence never leaves their tenancy.

And eventually, the obvious closing of the loop: pointing Phoenix Audit at itself.