Please sign up or log in to continue.

Inspiration

Almost every team is now shipping AI features, and very few of them test those features for attacks. Prompt injection sits at the top of the OWASP LLM Top 10, and most of these attacks need nothing more than text. The tools that exist are built for security engineers working on the command line, so the person actually shipping the feature has no quick way to find out whether their app is exposed. I wanted to close that gap.

What it does

You point Gauntlet at a target and press Run. An autonomous attacker fires OWASP-mapped probes while each attempt streams in with a verdict, and when the app leaks its planted secret the scorecard drops to F. One click on Apply Guard adds a runtime defense and re-runs the same attacks, and the grade climbs to A with a per-attack record of what changed.

How I built it

Gauntlet is a streaming agent loop built on Next.js 16 and served to the browser over Server-Sent Events. A planner selects the OWASP attack families, an attacker generates the probes (a seeded corpus by default, or model-generated with a key), a target adapter stands in for the app under test, a layered guard provides the runtime defense, and a canary oracle decides compromise: the planted secret has to appear and the response cannot be a refusal. A scorer then maps each finding to the OWASP LLM Top 10 and assigns the grade. It is deployed on Vercel and covered by Vitest and Playwright.

Challenges I ran into

The honest one is that a current frontier model is genuinely hard to break. When we pointed Gauntlet at a real model, the direct, indirect, and multi-turn attacks all failed, and the only thing that leaked was a prompt whose own instructions contradicted each other, such as telling the assistant to keep a secret while also being completely transparent. Rather than hide that, we built the product around it: a strong model holding is the result you want, and the real exposure is usually your own prompt and configuration. The other challenge was making compromise measurable instead of a guess, which is why the oracle is gated on refusals and anchored to a planted canary.

Accomplishments that I'm proud of

The before-and-after is real and reproducible rather than staged: run the eval and the grades land the same way every time. The oracle holds at zero false positives on our labeled set, and the deployed site scores 96 on Lighthouse performance with 100 on accessibility. And the whole find-and-fix loop stays legible to someone who is not a security engineer, which was the goal from the start.

What I learned

Frontier models have become robust to single-shot prompt injection, so the interesting risk has moved from the model itself to how it is configured and what context it is handed. I also learned that measuring an attack honestly is harder than running one. Deciding whether a response is truly compromised, rather than a polite refusal that happens to repeat a keyword, is where most of the care went.

What's next for Gauntlet

A trained local classifier in the guard by default, deeper multi-turn and indirect-injection coverage, shareable run reports, and a one-line CI check so Gauntlet runs on every pull request.

Built With

Share this project:

Updates