AgentProbe

Import the Traces CSV
Paste the pipeline endpoint URL
Choose specifications
Result Pt. 1
Result Pt. 2

Inspiration

As AI agents get deployed in high-stakes domains like banking, compliance, and identity verification, a natural question arises: how do you know your agent pipeline actually behaves reliably when it's being attacked or manipulated? Unit tests and LLM evals test the happy path, but real-world adversaries don't send clean inputs.

What it does

AgentProbe is an automated adversarial red-teaming platform for multi-agent AI workflows. It takes a running agent pipeline, pulls real production traces, and systematically attacks it across five adversarial dimensions:

Injection: embeds hidden instructions inside document field values to hijack agent behavior
Boundary: sends graduated variants of legitimate inputs that incrementally add red flags, testing where agents draw the line
Sandbagging: sends the same request with formal vs. casual framing to detect whether agents give different decisions based on tone rather than content
Cascade: injects plausible-but-wrong upstream outputs (e.g. a wrong date of birth) and checks whether downstream agents propagate the error or catch it
Consistency: sends cosmetically varied inputs (different field order, synonyms) to test whether agents return stable decisions.

Each attack is judged by a separate LLM evaluator that assigns PASS / PARTIAL / FAIL with a numerical score and flags whether a failure would cause real-world harm in production. Scores are aggregated per agent and combined into a single workflow reliability score with a weakest-link penalty (a pipeline is only as safe as its most vulnerable stage).

Results stream live to a React dashboard over WebSocket, with a live attack feed, per-agent score cards, an attack heatmap, and a sandbagging delta view.

The demo target is a 4-agent bank account opening pipeline: Document Extraction → KYC Verification → Risk Assessment → Compliance Decision.

How we built it

Backend (Python / FastAPI)

TraceIngester pulls real execution traces from Langfuse and stores them in Snowflake
AttackGenerator uses Claude Sonnet to synthesize adversarial scenarios grounded in real production behavior
AttackRunner replays each scenario through the live agent pipeline
JudgeEvaluator uses a second Claude instance as an impartial security evaluator
ReliabilityScorer computes weighted per-agent scores and a workflow-level score with a weakest-link penalty
All results persist to Snowflake for historical analysis

Agent pipeline (Claude Sonnet)

Each of the 4 agents is a Claude-backed function with structured JSON output
Traces are captured with Langfuse and Laminar for observability

Frontend (React / Vite)

WebSocket-driven live feed showing each attack as it runs
Dashboard with AgentScoreCard, AttackHeatmap, and SandbaggingView components

Challenges we ran into

Getting the attack generator to produce reliably parseable JSON at scale required careful prompt engineering and robust extraction logic.
The sandbagging metric is inherently noisy, small wording changes can cause spurious decision flips unrelated to actual bias, so we combined decision delta with reasoning-depth delta and weighted them
Streaming WebSocket results while maintaining Snowflake writes in sequence without blocking the UI required careful async design in FastAPI

Accomplishments that we're proud of

The "judge as a separate LLM" pattern works surprisingly well, it catches subtle failures that a simple string-match would miss.
The weakest-link workflow scoring gives a much more honest picture of pipeline safety than a naive average.
The live streaming UI makes the red-teaming process feel visceral rather than just a table of results.

What we learned

Multi-agent systems have failure modes that are fundamentally different from single-model systems, cascade and consistency attacks are nearly invisible without end-to-end testing
LLM-as-judge requires careful system prompt design to avoid it being too lenient (giving PARTIAL when FAIL is warranted)
Sandbagging is underappreciated as a reliability risk, agents genuinely do respond differently to formal vs. casual framing in ways that matter for regulated decisions.

What's next for AgentProbe

Support for arbitrary user-defined agent pipelines (not just the banking demo) via a YAML config
Continuous monitoring mode: run a probe sweep on a schedule and alert when reliability scores drop
Attack library expansion: hallucination probes, tool-call injection, multi-turn jailbreaks
A hosted version so teams can connect their Langfuse/LangSmith project and get a red-team report without writing any code

Built With

claude(anthropic)
fastapi
langfuse
python
react
snowflake
websocket

Updates

Devyani Rastogi started this project — May 17, 2026 10:44 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.