Inspiration
Every AI security demo is the same: a model finds a bug in a toy app full of // VULNERABLE HERE comments, and a slide tells you it "passed." Nobody watches. Nobody learns. And you never see the other half of security, the defense.
We wanted a benchmark you could actually watch. So we built one as a sport. Two AI agents go head to head over a real-looking web app: one attacks, one defends. An LLM referee calls every goal. A football commentator screams the play-by-play over stadium noise. Model vs model, live.
## What it does
The Arena is an adversarial security benchmark framed as a soccer match.
- A deliberately vulnerable SaaS app, "Tasklight" (workspaces, tasks, billing, integrations, admin), seeded with 20 real vulnerabilities + 4 decoys across auth, IDOR, SQLi, XSS, SSRF, path traversal, business logic, and race conditions. No hints, no markers in the code.
- RED is a Claude Code session wired to an attacker MCP with 19 real pentest tools (raw HTTP, blind-SQLi differential probes, timing probes, a TOCTOU race prober, JWT forging, an out-of-band collaborator, a headless-Chromium XSS prover, IDOR multi-identity checks).
- BLUE is another Claude Code session living inside the source code. When RED strikes, BLUE is told the general area under attack and has to find and patch the flaw, by hand, before the clock runs out.
- An LLM judge holds the full answer sheet and re-proves every claim against the live app. An exploit only counts if it actually lands. A patch only counts if it kills the exploit and the feature still works. Bad patches get penalized and leak the target back to RED.
- The whole match streams to a live TV broadcast with a force-graph attack map, Red/Blue rails, a FIFA-style stat board, and an AI caster that voices continuous play-by-play with synced captions and a crowd that roars on goals.
## How we built it
Eight services, all on localhost, glued by one shared event contract.
- The target ("Tasklight") runs on Bun +
bun:sqlitewith a tiny hand-rolled Express-style layer, fully dependency-free. The vulnerabilities are genuine flaws woven into normal-looking route handlers. - The judge is an executable answer key: every vuln has an exploit probe that must land and a functional probe that must keep working. Deterministic probes are ground truth; an optional Claude pass reads the patch diff for color commentary. The same probe file generates the private answer sheet, the public vulnerability manifest that drives the visualization, and RED's payload-free strike list.
- The orchestrator (control plane) owns the board, scoring, and a hard reset-to-pristine after every single turn, so each round is clean and patches don't bleed across turns.
- The attacker and defender MCP servers (Python) run as dual transports: stdio for the Claude session, HTTP for orchestration. Both enforce a per-turn tool-call budget and a monotonic match clock with a hard stop gate.
- The runner is the match brain. It spawns the real RED and BLUE
claudesessions, wires each to its own MCP, runs a warmup recon window, then drives turn alternation. A stream tap parses the live agent output to broadcast each side's reasoning and file edits in real time. - The narrator is an inline relay: it ingests the event stream, runs an always-talking OpenAI
gpt-5.3-chat-latestcommentary loop, voices each line through ElevenLabs (word-level alignment for synced captions), classifies emotion to drive crowd intensity, and forwards everything to the TV. - The TV is a Next.js app. Producers POST contract events to one endpoint; an in-memory hub with a tape delay fans them out over SSE, and a pure reducer folds the stream into the live UI.
Challenges we ran into
- Making vulnerable look production. The hard part wasn't the bugs, it was hiding them. A defender reading the source gets zero free signals, so every flaw had to live inside code that looks like a real product.
- Scoring you can trust. "The model said it worked" is not a benchmark. We made deterministic probes the ground truth and forced the judge to re-prove every exploit against the live app, and to verify patches didn't just break the feature to kill the bug.
- Two real agents, live, on a clock. Keeping RED and BLUE fast and decisive under a tight per-turn budget, spawning them sandboxed, and capturing their reasoning without leaking the answer sheet took real plumbing.
- A broadcast that doesn't lie. Every pixel on the TV comes from the real event stream, no mock data. We added a tape delay so the AI caster and TTS could narrate in sync, like real live TV.
- The environment fought us. A stray
~/package.jsonhijacked dependency installs (so we went dependency-free), and iCloud kept evicting the Python venv mid-run (so the boot script heals the symlink every start).
What we learned
- A benchmark people watch is a different artifact than a benchmark people cite. The visualization forced honesty: you can't fake a goal on screen.
- Defense is the harder, more interesting half, and almost nobody benchmarks it. Watching a model race to patch a live exploit is genuinely tense.
- Separating ground truth (deterministic probes) from narration (LLMs) is what makes the whole thing credible. The fast model talks; the probes decide.
What's next
More vulnerabilities and difficulty tiers, more model matchups, and an ELO-style leaderboard so labs can see who actually attacks and defends best, not just who claims to.
Built With
- bun
- fastapi
- httpx
- nextjs
- playwright
- python
- react
- typescript

Log in or sign up for Devpost to join the conversation.