Adversary

Inspiration

Every company is now shipping AI agents that take real actions — issuing refunds, moving money, calling internal tools. But the open-source tools that are supposed to test them have a blind spot: they fire static prompts at a single model and grade the text it returns. They don't test what the agent actually does at the tool layer, and they never get smarter.

That gap is where real breaches live. An agent can say all the right things and still call issue_refund on an order it was never authorized to touch. We wanted a red-teamer that attacks the action layer, proves breaches with ground truth instead of an LLM's opinion, and — most importantly — learns from its own behavior over time.

What it does

Adversary is an autonomous red-team agent that attacks other AI agents at the tool/action layer and improves itself from its own Arize Phoenix traces.

It runs a campaign against a target support agent across multiple attack classes (indirect prompt injection, direct jailbreak, tool abuse, system-prompt leak). For each class it runs a closed loop:

plan → craft → fire → judge → re-plan → breach → report

Four Gemini sub-agents, built on Google's Agent Development Kit (ADK), drive it:

Strategist (Gemini 2.5 Pro) — the only agent with tools. Before each move it queries Phoenix over MCP to introspect its own past traces, finds techniques that worked on prior targets, and escalates accordingly.
Attacker — crafts one concrete adversarial payload (e.g. a malicious instruction hidden inside a customer email).
Target — a realistic, flawed support agent with a refund tool. We ship a vulnerable build and a patched build.
Analyst — judges each attempt, but its opinion never overrides ground truth.

The "it learned" moment: the Strategist tries a plain command, gets blocked, then queries Phoenix and finds a prior campaign against a different company's agent where reframing the request as company authority broke through. No record exists for this target — so it transfers the pattern, escalates to authority framing, and the target issues the refund. Breach.

Why you can trust the breach: a breach is only recorded when the refund ledger physically grows. The orchestrator snapshots len(REFUND_LEDGER) before every attempt and compares after; a real, unauthorized issue_refund tool call is the only thing that flips the verdict to BREACH — regardless of what the judge's prose says. The target's words can lie; its tool calls can't.

It closes the loop: re-run against the patched build and the same escalation that broke the agent is refused — zero refunds across every class. The regression diff (breach → blocked) is exactly what a CI gate would block a release on.

How we built it

Agents & orchestration: Google ADK with four LlmAgents coordinated by an explicit, inspectable Python campaign loop (no hidden auto-orchestration — the demo has to show the loop). Each target attempt runs against a fresh ADK Runner and session for isolation.
Models: Gemini 2.5 Pro for the Strategist's reasoning; Gemini for the Attacker, Analyst, and Reporter.
Self-improvement loop: the Arize Phoenix MCP server (@arizeai/phoenix-mcp), launched by ADK as a subprocess, gives the Strategist read access to its own traces, evaluations, and experiments at runtime.
Observability: every model call, MCP query, and tool call is traced into Arize Phoenix via OpenTelemetry / OpenInference. We wrap each target invocation in our own OTel span and write the verdict annotation back onto the exact span — so a reviewer can open one trace and replay the whole decision: plan → memory it read → payload → unauthorized refund.
Ground-truth eval: a module-level refund ledger acts as the source of truth; the breach detector is a before/after snapshot, not an LLM judge.
Backend: FastAPI streaming the campaign as Server-Sent Events (SSE), plus /report and /report/regression endpoints.
Frontend: a Next.js + React + TypeScript single-screen SOC console that renders the live stream — Attack Surface, Reasoning, and Scorecard panels — and a guided, AI-narrated tour.
Deployment: containerized and deployed to Google Cloud Run, frontend and API behind a single URL. A deterministic replay mode (DEMO_MODE) guarantees a flawless run for judging even on a quota-limited project; ?replay=false runs a fresh live campaign.

Challenges we ran into

Making "learning" real, not a lookup. The seed history is a different target (a different company, different order ids). The agent transfers a pattern to a fresh target and states on screen that no record exists for this target — so it's genuine generalization, not a cache hit.
Two eval-correctness bugs, fixed with regression tests. The worst-verdict rollup and the ground-truth ledger snapshot timing (OQ-7) both had to be exactly right — snapshot before the target runs, or the breach detector silently never fires.
Landing annotations on the correct trace. We had to capture the real OTel span id of each target invocation so the Phoenix verdict annotation attaches to the right span.
Surviving a throttled Vertex project. New projects have tiny per-minute Gemini quotas, so we built a patient, separate 429/RESOURCE_EXHAUSTED backoff that lets a throttled project still finish a full campaign instead of crashing mid-run.
Cloud Run routing. /healthz is intercepted by Cloud Run's edge, so we expose /health; and we serve the static Next.js export without a greedy mount that would hijack API/SSE routes.

What we learned

Ground truth beats LLM-as-judge for anything that takes a real action. The cheapest, most convincing signal in the whole system is one integer: the length of the refund ledger.
Observability isn't a dashboard you add at the end — it is the product. Because every decision is traced into Phoenix, the agent has a memory to learn from and a reviewer has receipts to trust.
An agent that red-teams other agents is only as credible as its target is realistic — so the target had to refuse bare demands and fall only to genuine social-engineering escalation.