SecondGuess

Base UI
Attacker Walkthrough
Freeform Tester
Benchmark Results on More derived prompts

What it does

SecondGuess is a cascade allocator for jailbreak detection. Instead of running every safety classifier on every prompt, it runs a cheap detector first (Llama Prompt Guard 2, 86M params). If that detector is confident, the system stops there. If it's uncertain, it escalates to two stronger detectors (ShieldGemma 2B and WildGuard 7B). If those two disagree, it escalates once more to Claude as a tiebreaker.

On the official JailbreakBench benchmark (200 prompts), this cascade catches more real jailbreaks than running all four detectors on every input (95% recall vs 93%), while using 1.6x less compute on average.

It does this at a real cost: precision drops from 0.83 to 0.75, meaning it flags more benign prompts than the always-on baseline. We're reporting that tradeoff honestly rather than hiding it.

A live demo walks through six rounds of an increasingly disguised jailbreak attempt, showing a single cheap detector missing the disguised versions while the cascade catches them by escalating. We're upfront that this 6-round sequence is self-authored to illustrate the mechanism, not an independent red-team benchmark.

Inspiration

I placed 1st in the Amazon Trusted AI Challenge for adversarial jailbreak detection, beating Claude 3.7 Sonnet's own detection on the benchmark task. That work showed something specific: a safety classifier that always runs the same way is predictable, and predictable defenses get broken by attackers who can iterate. Recent research backs this up directly, "The Attacker Moves Second" (arXiv 2510.09023) showed most published LLM defenses get broken once an attacker adapts to them, despite reporting near-zero attack success in their own papers.

How we built it

Four detector arms: Llama Prompt Guard 2 (86M, cheap floor), ShieldGemma 2B (precision specialist), WildGuard 7B (adversarial generalist), and Claude Sonnet 4.6 (tiebreaker judge, via the Anthropic API). The cascade runs the floor arm first, escalates only on uncertainty, and only calls Claude when the mid-tier arms disagree. We benchmarked against JailbreakBench's official 200-prompt set on a Prime Intellect H100 GPU pod.

Challenges we ran into

We found a real bug: Llama Prompt Guard 2 scored indirectly-framed harmful prompts (fictional, hypothetical, or "educational" wrappers) at nearly the same low score as genuinely benign prompts. No threshold could separate them. We fixed this by removing the floor arm's ability to mark something confidently benign on its own. This fix changed our numbers, an earlier toy-benchmark result showed larger compute savings before we caught this bug and before we ran the real JailbreakBench set. The smaller, real number (1.6x) is the one we stand behind.

What's next

Real online/bandit-based allocation instead of fixed thresholds. A genuine adaptive attacker built with iterative refinement, not hand-written prompts. Measuring whether escalation increases inconsistent refusals on benign edge cases.

Built With

anthropic-api
claude
fastapi
huggingface
llama-guard
python
pytorch
shieldgemma
transformers
wildguard

Updates

Rohan Chavan0701 started this project — Jun 20, 2026 08:10 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.