Inspiration
Pull requests move fast, but database-layer mistakes (unsafe SQL, risky ORM usage, migrations that change query surfaces) still slip through review. Static grepping misses context; serious validation means hitting a running endpoint with tools that understand injection—not just reading a diff.
We wanted a teammate that behaves like a focused red-team loop: notice DB-relevant changes, gather enough context to aim scanners, run proven tooling with hard timeouts, then explain results to humans in one evolving GitHub comment. Sponsor tracks around classification, WAF-aware scanning, and automated incentives gave us a concrete shape for that loop.
What it does
Orchestration beats prompts. A reliable agent is mostly state + boundaries: a single PRScanState, LangGraph-style nodes, and dependency injection so Brain (orchestration + LLM), Muscle (subprocess scanners), and Context (HTTP clients) don’t become a tangle of imports.
Mocks are a product feature. A —mock path with stubbed GitHub and stubbed scan output let us iterate on the summarization and graph without burning APIs—critical for a weekend build.
The “last mile” is the URL. Classifying a diff is useful; pointing sqlmap/nuclei at the right HTTP surface is what makes the demo real. We added an extract-context step and an operator override (TARGET_URL) when inference isn’t enough.
Honesty about tools. sqlmap is the deep SQL injection engine; nuclei adds broad template coverage (YAML-based), and our design excludes SQLi templates from nuclei so we don’t duplicate sqlmap’s lane.
How we built it
Brain (core/): LangGraph flow—pending comment, fetch diff, DB classifier, extract tech stack / endpoint hints, fetch WAF bypass ideas, run scans in parallel, optional bounty trigger, LLM-written final comment that updates the same GitHub comment.
Muscle (tools/): Async sqlmap and nuclei wrappers with asyncio.wait_for timeouts and structured {status, stdout, stderr, findings} results—no exceptions escaping to the graph.
Context (integrations/): GitHub (diff + comment lifecycle), Nia (bypass lists), AllScale (signed payout), and an OpenAI-backed DB-impact classifier (Greptile-shaped track; API limitations led to a pragmatic fallback).
DX: .env.example, fixture state for mocks, pytest for the runner contracts.
Challenges we ran into
Live path completeness: Early on, the graph classified diffs but never set target_url / tech_stack, so downstream steps were hollow. Fixing that meant a real context-extraction node and clearer docs.
Subprocess semantics: Long sqlmap runs vs hackathon kill switches forced a tradeoff. We aligned on a short timeout and treat non-zero exits with no parsed findings as failure, not silent success.
Demo vs security reality: Automated payouts are powerful and dangerous; we keep the recipient as an explicit BOUNTY_WALLET configuration and gate on structured findings—not on vibes.
Accomplishments that we're proud of
End-to-end PR workflow — one evolving GitHub comment (pending message, then final report) instead of spamming the thread. Clean stream split — Brain (core/) orchestrates via PRScanState and BrainDeps; Muscle (tools/) owns subprocess scanners; Context (integrations/) owns outbound APIs — mock mode swaps stubs without rewriting the graph. Live path that actually aims the scanners — extract_context plus optional TARGET_URL so sqlmap and nuclei get a real tech stack and HTTP surface, not empty strings. Operational honesty — sqlmap 60s kill switch, non-zero exit without findings treated as error, and guard when comment_id is 0 so failures are visible in logs. Working demo loop — python3 main.py --mock exercises summarization and payouts with zero GitHub traffic, perfect for judges and iteration.
What we learned
Orchestration > raw LLM — the win is state machines, contracts, and injection, not a bigger prompt. The URL is the product — diff classification is table stakes; binding PR changes to a testable endpoint is what separates “interesting” from “useful.” Tool roles matter — sqlmap owns deep SQLi; nuclei (YAML templates) widens coverage; excluding overlapping SQLi templates avoids noisy duplication. Sponsor APIs drift — a “Greptile-shaped” track still needs a reliable classifier; pragmatic fallbacks beat a broken integration on demo day. Automated money needs hard gates — payouts should stay explicit (wallet in config, structured findings only) until confidence and policy catch up.
What's next for Aegis Agent
Smarter targeting — POST bodies, cookies, auth headers, and multi-endpoint graphs from the same PR (not only GET target_url). Stronger bounty policy — confidence thresholds, caps, optional human approval, and per-repo rules before any production payout. CI-native runs — GitHub Action with secrets, pinned tool versions, and documented patterns for reachable TARGET_URL (staging, tunnel, or self-hosted runner). Nuclei + LLM — rank or generate YAML template subsets from stack + diff context to shorten scan time without losing signal. Real Greptile path — swap or complement the classifier when a stable query API is available, keeping the same Brain contract.
Log in or sign up for Devpost to join the conversation.