Siege: Autonomous Penetration Testing for GitLab
Inspiration
Security testing is the most expensive bottleneck in the software development lifecycle. A single penetration test costs $10,000–$50,000, takes weeks to schedule, and happens quarterly at best. Meanwhile, developers ship code daily that has never been adversarially tested. Vulnerabilities are discovered in production — after the damage is done.
We asked: what if every merge request got its own pentest, automatically, for free?
The idea crystallized around a simple loop that mirrors how real red team/blue team engagements work: attack the application, fix what breaks, then verify the fixes hold. Three phases, fully autonomous, triggered by a GitLab webhook.
What it does
Siege runs an Attack → Defend → Verify loop on every merge request:
1. Analyze
A hybrid static analyzer instantly maps the application's attack surface: endpoints, auth gates, data stores, and data flows. It supports Express, Flask, FastAPI, and Django natively (<0.1s), with a Claude Code fallback for unknown frameworks.
2. Attack
Three specialized AI agents run in parallel, each powered by Claude Code driving real Playwright browsers:
- XSS Hunter — Stored, reflected, and DOM-based cross-site scripting
- SQLi Probe — UNION, blind, error-based SQL injection + data exfiltration
- Auth Bypass — IDOR, missing authentication, rate limiting, JWT tampering
These agents cover the OWASP Top 10 across their combined scope.
3. Defend
A defender agent reads each vulnerability finding, edits the source code to apply minimal targeted fixes, and runs tests to verify nothing breaks.
4. Verify
The same attackers re-run against the patched code. If the attacks fail, the fixes are confirmed. If any succeed, the loop repeats (up to 3 iterations).
A real-time 3D visualization shows the entire war playing out: nodes light up as they're scanned, projectiles travel along edges during attacks, shockwaves fire on breach, and shields materialize when fixes land.
How we built it
Architecture
GitLab MR Webhook → Python Orchestrator → Claude Code CLI (subprocess)
├── Static Analyzer (instant, regex-based)
├── 3 Attacker Agents (parallel, Playwright)
├── Defender Agent (code editing)
└── WebSocket → React Three Fiber Visualizer
Key technical decisions
Claude Code CLI over the API
Instead of using the Anthropic SDK, we invokeclaudeas a subprocess. This gives agents full access to built-in tools (Read, Edit, Write, Bash, Glob, Grep) without building a custom tool-use loop.Hybrid static + AI analysis
Regex handles deterministic tasks in ~30ms, while AI handles reasoning-heavy tasks.Python orchestrator over Node.js
A ~250-line Python script usingasyncio+subprocessproved simpler and faster.Deliberately vulnerable demo app
Built an Express.js app with 5 planted vulnerabilities for consistent testing.
Tech stack
- Python (orchestrator) + TypeScript (demo app, visualizer)
- React Three Fiber + drei + postprocessing
- Playwright
- Claude Code CLI
- WebSocket
- Zustand
- Google Cloud Run
- GitLab CI/CD Components
Challenges we faced
1. Agent output parsing
Claude output isn’t deterministic. We built a multi-strategy parser:
- Try ```json blocks first
- Fall back to raw JSON extraction
- Last resort: parse structured data from text
2. Streaming vs structured output
- JSON mode = structure, no streaming
- Text mode = streaming, messy parsing
- Solution: stream text, extract JSON afterward
3. Analyzer bottleneck
Initial AI-based analyzer took 2–3 minutes. Replaced with regex-based version (30ms), cutting total runtime nearly in half.
4. Visual coherence
Multiple redesigns to fix:
- Overlapping labels
- Conflicting color systems
- Confusing 3D layouts
Final approach separated visual layers and simplified geometry.
5. Parallel agent coordination
- Used
asyncio.gatherfor attackers - Required UI separation (tabs) for readability
- Defender runs sequentially to avoid conflicting fixes
What we learned
Claude Code as a subprocess is extremely powerful
Agents can autonomously read, edit, and fix vulnerabilities with minimal prompting.Static analysis + AI is the winning combo
Deterministic + reasoning = fast and accurate (up to 4000x faster analysis phase).
Log in or sign up for Devpost to join the conversation.