Inspiration

As AI agents get deployed in high-stakes domains like banking, compliance, and identity verification, a natural question arises: how do you know your agent pipeline actually behaves reliably when it's being attacked or manipulated? Unit tests and LLM evals test the happy path, but real-world adversaries don't send clean inputs.

What it does

AgentProbe is an automated adversarial red-teaming platform for multi-agent AI workflows. It takes a running agent pipeline, pulls real production traces, and systematically attacks it across five adversarial dimensions:

  1. Injection: embeds hidden instructions inside document field values to hijack agent behavior
  2. Boundary: sends graduated variants of legitimate inputs that incrementally add red flags, testing where agents draw the line
  3. Sandbagging: sends the same request with formal vs. casual framing to detect whether agents give different decisions based on tone rather than content
  4. Cascade: injects plausible-but-wrong upstream outputs (e.g. a wrong date of birth) and checks whether downstream agents propagate the error or catch it
  5. Consistency: sends cosmetically varied inputs (different field order, synonyms) to test whether agents return stable decisions.

Each attack is judged by a separate LLM evaluator that assigns PASS / PARTIAL / FAIL with a numerical score and flags whether a failure would cause real-world harm in production. Scores are aggregated per agent and combined into a single workflow reliability score with a weakest-link penalty (a pipeline is only as safe as its most vulnerable stage).

Results stream live to a React dashboard over WebSocket, with a live attack feed, per-agent score cards, an attack heatmap, and a sandbagging delta view.

The demo target is a 4-agent bank account opening pipeline: Document Extraction → KYC Verification → Risk Assessment → Compliance Decision.

How we built it

Backend (Python / FastAPI)

  • TraceIngester pulls real execution traces from Langfuse and stores them in Snowflake
  • AttackGenerator uses Claude Sonnet to synthesize adversarial scenarios grounded in real production behavior
  • AttackRunner replays each scenario through the live agent pipeline
  • JudgeEvaluator uses a second Claude instance as an impartial security evaluator
  • ReliabilityScorer computes weighted per-agent scores and a workflow-level score with a weakest-link penalty
  • All results persist to Snowflake for historical analysis

Agent pipeline (Claude Sonnet)

  • Each of the 4 agents is a Claude-backed function with structured JSON output
  • Traces are captured with Langfuse and Laminar for observability

Frontend (React / Vite)

  • WebSocket-driven live feed showing each attack as it runs
  • Dashboard with AgentScoreCard, AttackHeatmap, and SandbaggingView components

Challenges we ran into

  • Getting the attack generator to produce reliably parseable JSON at scale required careful prompt engineering and robust extraction logic.
  • The sandbagging metric is inherently noisy, small wording changes can cause spurious decision flips unrelated to actual bias, so we combined decision delta with reasoning-depth delta and weighted them
  • Streaming WebSocket results while maintaining Snowflake writes in sequence without blocking the UI required careful async design in FastAPI

Accomplishments that we're proud of

  • The "judge as a separate LLM" pattern works surprisingly well, it catches subtle failures that a simple string-match would miss.
  • The weakest-link workflow scoring gives a much more honest picture of pipeline safety than a naive average.
  • The live streaming UI makes the red-teaming process feel visceral rather than just a table of results.

What we learned

  • Multi-agent systems have failure modes that are fundamentally different from single-model systems, cascade and consistency attacks are nearly invisible without end-to-end testing
  • LLM-as-judge requires careful system prompt design to avoid it being too lenient (giving PARTIAL when FAIL is warranted)
  • Sandbagging is underappreciated as a reliability risk, agents genuinely do respond differently to formal vs. casual framing in ways that matter for regulated decisions.

What's next for AgentProbe

  • Support for arbitrary user-defined agent pipelines (not just the banking demo) via a YAML config
  • Continuous monitoring mode: run a probe sweep on a schedule and alert when reliability scores drop
  • Attack library expansion: hallucination probes, tool-call injection, multi-turn jailbreaks
  • A hosted version so teams can connect their Langfuse/LangSmith project and get a red-team report without writing any code

Built With

  • claude(anthropic)
  • fastapi
  • langfuse
  • python
  • react
  • snowflake
  • websocket
Share this project:

Updates