Inspiration
As AI agents get deployed in high-stakes domains like banking, compliance, and identity verification, a natural question arises: how do you know your agent pipeline actually behaves reliably when it's being attacked or manipulated? Unit tests and LLM evals test the happy path, but real-world adversaries don't send clean inputs.
What it does
AgentProbe is an automated adversarial red-teaming platform for multi-agent AI workflows. It takes a running agent pipeline, pulls real production traces, and systematically attacks it across five adversarial dimensions:
- Injection: embeds hidden instructions inside document field values to hijack agent behavior
- Boundary: sends graduated variants of legitimate inputs that incrementally add red flags, testing where agents draw the line
- Sandbagging: sends the same request with formal vs. casual framing to detect whether agents give different decisions based on tone rather than content
- Cascade: injects plausible-but-wrong upstream outputs (e.g. a wrong date of birth) and checks whether downstream agents propagate the error or catch it
- Consistency: sends cosmetically varied inputs (different field order, synonyms) to test whether agents return stable decisions.
Each attack is judged by a separate LLM evaluator that assigns PASS / PARTIAL / FAIL with a numerical score and flags whether a failure would cause real-world harm in production. Scores are aggregated per agent and combined into a single workflow reliability score with a weakest-link penalty (a pipeline is only as safe as its most vulnerable stage).
Results stream live to a React dashboard over WebSocket, with a live attack feed, per-agent score cards, an attack heatmap, and a sandbagging delta view.
The demo target is a 4-agent bank account opening pipeline: Document Extraction → KYC Verification → Risk Assessment → Compliance Decision.
How we built it
Backend (Python / FastAPI)
- TraceIngester pulls real execution traces from Langfuse and stores them in Snowflake
- AttackGenerator uses Claude Sonnet to synthesize adversarial scenarios grounded in real production behavior
- AttackRunner replays each scenario through the live agent pipeline
- JudgeEvaluator uses a second Claude instance as an impartial security evaluator
- ReliabilityScorer computes weighted per-agent scores and a workflow-level score with a weakest-link penalty
- All results persist to Snowflake for historical analysis
Agent pipeline (Claude Sonnet)
- Each of the 4 agents is a Claude-backed function with structured JSON output
- Traces are captured with Langfuse and Laminar for observability
Frontend (React / Vite)
- WebSocket-driven live feed showing each attack as it runs
- Dashboard with AgentScoreCard, AttackHeatmap, and SandbaggingView components
Challenges we ran into
- Getting the attack generator to produce reliably parseable JSON at scale required careful prompt engineering and robust extraction logic.
- The sandbagging metric is inherently noisy, small wording changes can cause spurious decision flips unrelated to actual bias, so we combined decision delta with reasoning-depth delta and weighted them
- Streaming WebSocket results while maintaining Snowflake writes in sequence without blocking the UI required careful async design in FastAPI
Accomplishments that we're proud of
- The "judge as a separate LLM" pattern works surprisingly well, it catches subtle failures that a simple string-match would miss.
- The weakest-link workflow scoring gives a much more honest picture of pipeline safety than a naive average.
- The live streaming UI makes the red-teaming process feel visceral rather than just a table of results.
What we learned
- Multi-agent systems have failure modes that are fundamentally different from single-model systems, cascade and consistency attacks are nearly invisible without end-to-end testing
- LLM-as-judge requires careful system prompt design to avoid it being too lenient (giving PARTIAL when FAIL is warranted)
- Sandbagging is underappreciated as a reliability risk, agents genuinely do respond differently to formal vs. casual framing in ways that matter for regulated decisions.
What's next for AgentProbe
- Support for arbitrary user-defined agent pipelines (not just the banking demo) via a YAML config
- Continuous monitoring mode: run a probe sweep on a schedule and alert when reliability scores drop
- Attack library expansion: hallucination probes, tool-call injection, multi-turn jailbreaks
- A hosted version so teams can connect their Langfuse/LangSmith project and get a red-team report without writing any code
Log in or sign up for Devpost to join the conversation.