Inspiration

Companies are shipping chatbots everywhere, and almost all of them carry something sensitive in their prompt — an API key, an internal policy, a "verified staff only" rule. The dangerous flaw is never the obvious request; it's the plausible one — the reasonable-sounding exception that becomes the exact door an attacker walks through. I wanted a tool that thinks like that attacker, then does what no scanner does: closes the hole and proves the fix.

What it does

Point Gauntlet at a chatbot and give it a goal. It runs an autonomous, escalating attack campaign — five social-engineering tactics (impersonation, fake incidents, encoding tricks, fiction, prompt injection) across three escalation levels. An LLM judge scores how much actually leaked, a reflector decides whether to push harder or pivot, and the moment it breaks the bot it self-heals: rewrites the prompt to close the loophole, re-fires the exact winning attack, and confirms the bot now refuses. It works against an API bot or a real chat widget in a live browser.

How we built it

A five-agent loop — strategist → attacker → target → judge → reflector — orchestrated with LangGraph, every agent powered by Claude Opus 4.8. The key idea was separating control from creativity: a deterministic policy owns the decisions (escalate/pivot/stop, what's untried) so the loop always makes progress and terminates, while Claude owns the attack craft. Around it: a FastAPI backend streaming the session over SSE, a Redis + RedisVL vector store that remembers winning exploits as few-shot examples, Browserbase + Playwright for live-browser mode, Phoenix/OpenTelemetry tracing, and a Next.js dashboard.

Challenges we ran into

Python 3.14 broke OpenTelemetry — Phoenix's auto-instrumentation crashed with a cryptic generator error before our heal loop could run; i switched to manual spans and moved healing outside the tracing context. A stale production build cost me hours — the dashboard showed nothing because next start was serving a build compiled before the feature existed; no refresh could fix it. Streaming through a 30-second silence — the patcher's long LLM call dropped the SSE connection, so i added keepalives plus a guaranteed final fetch. Honest metrics — the scoreboard was a frozen snapshot; i rebuilt it to compute live from real run outcomes.

Accomplishments that we're proud of

A red-teaming loop that's both creative and reliable — it never stalls, never repeats a failed approach, and always terminates. A semantic judge that catches paraphrased and encoded leaks, not just exact strings. And self-healing that's real, not cosmetic — it re-runs the actual winning attack against the patched bot and proves it holds. Plus a live dashboard that makes the whole attack legible to a non-expert in real time.

What we learned

The most dangerous prompt is never the obviously dangerous one — it's the reasonable exception. Semantic judging beats string matching decisively. LLM agents need a deterministic skeleton to be trustworthy. And a "fix" only means something if you re-test the original attack against it.

What's next for Team Gauntlet

Expand the tactic library and support multi-secret, multi-turn objectives; let teams point Gauntlet at their own bots and prompts directly; add regression mode (re-run every past exploit on each deploy as a CI gate); generate prioritized hardening reports; and grow the exploit memory into a shared, continuously-learning threat library across runs.

Built With

  • anthropic
  • browserbase
  • claude
  • fastapi
  • langgraph
  • next.js
  • opentelemetry
  • phoenix
  • playwright
  • python
  • react
  • redis
  • redisvl
  • sentence-transformers
  • server-sent-events
  • tailwindcss
  • typescript
Share this project:

Updates