details of a finding
attacker1
findings list
fixes being deployed
attacker2
UI
done!

Siege: Autonomous Penetration Testing for GitLab

Inspiration

Security testing is the most expensive bottleneck in the software development lifecycle. A single penetration test costs $10,000–$50,000, takes weeks to schedule, and happens quarterly at best. Meanwhile, developers ship code daily that has never been adversarially tested. Vulnerabilities are discovered in production — after the damage is done.

We asked: what if every merge request got its own pentest, automatically, for free?

The idea crystallized around a simple loop that mirrors how real red team/blue team engagements work: attack the application, fix what breaks, then verify the fixes hold. Three phases, fully autonomous, triggered by a GitLab webhook.

What it does

Siege runs an Attack → Defend → Verify loop on every merge request:

1. Analyze

A hybrid static analyzer instantly maps the application's attack surface: endpoints, auth gates, data stores, and data flows. It supports Express, Flask, FastAPI, and Django natively (<0.1s), with a Claude Code fallback for unknown frameworks.

2. Attack

Three specialized AI agents run in parallel, each powered by Claude Code driving real Playwright browsers:

XSS Hunter — Stored, reflected, and DOM-based cross-site scripting
SQLi Probe — UNION, blind, error-based SQL injection + data exfiltration
Auth Bypass — IDOR, missing authentication, rate limiting, JWT tampering

These agents cover the OWASP Top 10 across their combined scope.

3. Defend

A defender agent reads each vulnerability finding, edits the source code to apply minimal targeted fixes, and runs tests to verify nothing breaks.

4. Verify

The same attackers re-run against the patched code. If the attacks fail, the fixes are confirmed. If any succeed, the loop repeats (up to 3 iterations).

A real-time 3D visualization shows the entire war playing out: nodes light up as they're scanned, projectiles travel along edges during attacks, shockwaves fire on breach, and shields materialize when fixes land.

How we built it

Architecture

GitLab MR Webhook → Python Orchestrator → Claude Code CLI (subprocess)
├── Static Analyzer (instant, regex-based)
├── 3 Attacker Agents (parallel, Playwright)
├── Defender Agent (code editing)
└── WebSocket → React Three Fiber Visualizer

Key technical decisions

Claude Code CLI over the API
Instead of using the Anthropic SDK, we invoke claude as a subprocess. This gives agents full access to built-in tools (Read, Edit, Write, Bash, Glob, Grep) without building a custom tool-use loop.
Hybrid static + AI analysis
Regex handles deterministic tasks in ~30ms, while AI handles reasoning-heavy tasks.
Python orchestrator over Node.js
A ~250-line Python script using asyncio + subprocess proved simpler and faster.
Deliberately vulnerable demo app
Built an Express.js app with 5 planted vulnerabilities for consistent testing.

Tech stack

Python (orchestrator) + TypeScript (demo app, visualizer)
React Three Fiber + drei + postprocessing
Playwright
Claude Code CLI
WebSocket
Zustand
Google Cloud Run
GitLab CI/CD Components

Challenges we faced

1. Agent output parsing

Claude output isn’t deterministic. We built a multi-strategy parser:

Try ```json blocks first
Fall back to raw JSON extraction
Last resort: parse structured data from text

2. Streaming vs structured output

JSON mode = structure, no streaming
Text mode = streaming, messy parsing
Solution: stream text, extract JSON afterward

3. Analyzer bottleneck

Initial AI-based analyzer took 2–3 minutes. Replaced with regex-based version (30ms), cutting total runtime nearly in half.

4. Visual coherence

Multiple redesigns to fix:

Overlapping labels
Conflicting color systems
Confusing 3D layouts

Final approach separated visual layers and simplified geometry.

5. Parallel agent coordination

Used asyncio.gather for attackers
Required UI separation (tabs) for readability
Defender runs sequentially to avoid conflicting fixes

What we learned

Claude Code as a subprocess is extremely powerful
Agents can autonomously read, edit, and fix vulnerabilities with minimal prompting.
Static analysis + AI is the winning combo
Deterministic + reasoning = fast and accurate (up to 4000x faster analysis phase).

Built With

Updates

Taranveer Anand started this project — Mar 25, 2026 01:43 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.