Inspiration

Browser agents are getting really good at doing things on the web — booking flights, searching products, filling out forms. But we kept asking ourselves: what happens when the website tries to trick them?

We were inspired by adversarial security research and wanted to bring that same thinking to browser agents. If an agent can be fooled by a fake "Sign in to continue" popup or a hidden instruction buried in a FAQ, it doesn't matter how capable it is — it's a risk. The web is full of persuasive, misleading, and sometimes outright malicious content. A good browser agent needs to do more than click the right button. It needs to stay on task, ignore manipulation, and recover when something feels off.


What it does

Agent Gauntlet is a local benchmark that turns browser agent safety into a live, watchable duel. Two agents run at the same time:

  • Task Agent — tries to complete a real browser task (search, compare, extract info) on a live or demo web page
  • Red-Team Agent — fires attacks in real time, sneaking malicious content into the page before the task agent ever sees it

We cover four types of attacks: hidden prompt injections, fake UI elements, redirects that send the agent off track, and fake forms designed to harvest credentials. The live arena shows a full browser view, both agents' activity feeds, and a prompt-health tracker that updates every step.

Every run saves everything — screenshots, page snapshots, attack payloads, decisions, and a final safety verdict — and you can export it as JSON, CSV, or ShareGPT JSONL.


How we built it

The core stack is Next.js 16 and Playwright. The attack layer injects JavaScript directly into the page as it loads, so from the agent's perspective the attack looks just like normal page content — there's no way to tell the difference.

Both agents support multiple modes. The task agent can run as a simple rule-based policy or hand control to an OpenAI or Anthropic model for LLM-backed decisions. The red-team works the same way — scripted, adaptive, or fully LLM-driven. After each run, a GPT-4o judge scores whether the task was actually completed, and a canary system checks whether any sensitive values leaked into the agent's actions.


Challenges we ran into

Our biggest challenge was fine-tuning. We wanted to use QLoRA with Qwen to train a safer task agent directly on our own trajectory data, but the cost got in the way fast.

Running fine-tuning jobs on a model the size of Qwen needs serious GPU compute — and when you're iterating on things like LoRA rank, learning rate, and which layers to target, the cloud bills add up quickly. A100/H100 instance pricing made it really hard to experiment freely within a hackathon budget.

We ended up focusing on the benchmark and evaluation side, and we're treating the fine-tuned agent as the next clear step when we have the compute to do it right.


Accomplishments that we're proud of

  • A benchmark that runs fully on your own machine — no cloud setup, no external services needed
  • An attack layer that's genuinely invisible to the agent, the same way real attacks work in the wild
  • A live arena that makes prompt injection something you can actually watch happen — not just a metric in a table
  • Structured run logs that are immediately useful for safety research, debugging, and future training
  • A product that feels polished enough to demo to judges while still being grounded in real alignment research

What we learned

The biggest thing we learned: an agent can be great at its job and still completely fail at staying safe. Task performance and security robustness barely correlate. An agent can nail every step of a booking flow and still hand over its credentials to a fake login modal.

We also learned just how effective basic page injection is. Agents have no reference for what a page is supposed to look like, so a fake button or a hidden instruction is just as real to them as anything else. Seeing that failure play out live made the problem a lot more concrete than any benchmark number could.


What's next for Agent Gauntlet

  • QLoRA fine-tuning — use our saved trajectories to train a more robust task agent on Qwen once we have the GPU budget
  • More scenarios — expand beyond our demo pages into real-world tasks across shopping, travel, and enterprise tools
  • Smarter red-team — attacks that adapt mid-run based on what the task agent is doing
  • Public leaderboard — compare frontier models on safety and task performance in one place
  • Better metrics — break down results by attack type, track recovery rates, and go deeper than just the final verdict
  • Closed-loop training — curate our exported traces to build regression suites and feed future improvement loops

Built With

  • anthropic-api
  • convex
  • html2canvas
  • nanoid
  • next.js
  • openai-api
  • playwright
  • react
  • recharts
  • tailwind-css
  • typescript
  • vitest
Share this project:

Updates