PROJECT DETAILS PAGE

About the Project (ShieldFlow)

Inspiration

Security reviews are the biggest bottleneck in the software development lifecycle. Developers wait hours or days for someone to manually review their merge requests for vulnerabilities. Compliance teams scramble to generate evidence for SOC 2 and GDPR audits. And when deadlines hit, security reviews get skipped entirely.

We asked: what if every single merge request got an instant, thorough security review — automatically?

What it does

ShieldFlow is a trigger-driven, multi-agent pipeline that autonomously reviews every GitLab Merge Request for security vulnerabilities, compliance risks, code quality issues, and energy efficiency — then posts a structured review comment directly on the MR within 60 seconds.

When a developer opens or updates an MR, ShieldFlow:

  1. Intercepts the event via GitLab webhook
  2. Fetches the MR diff via GitLab REST API
  3. Runs 8 specialized AI agents in a sequential + parallel pipeline:
    • 🔍 Change Risk Agent — classifies security risks, assigns a score (1-10)
    • ⚠️ Threat Model Agent — generates attack scenarios per finding
    • 📋 Compliance Agent — maps findings to SOC 2, GDPR, HIPAA, NIST, ISO 27001 controls
    • 🔧 Remediation Agent — suggests exact code fixes with vulnerable vs. fixed code
    • 🌱 GreenCode Agent — calculates a Green Score (0-100) using Software Carbon Intensity methodology
    • 🤖 Claude Code Review Agent — reviews for bugs, security, and quality using Anthropic Claude via GitLab
    • 📝 Auto Documentation Agent — generates docstrings and changelog entries, commits them to the MR branch
    • 📊 Audit Trail Agent — synthesizes everything into a unified MR comment and generates audit artifacts
  4. Posts a structured security review comment on the MR with scores, findings, threat analysis, compliance mappings, remediation code, and documentation status
  5. Commits auto-generated documentation directly to the MR branch
  6. Generates audit-ready evidence artifacts for compliance teams

How we built it

Backend: Python 3.12 + FastAPI, deployed on Google Cloud Run

AI/LLM: Google Gemini (via Google AI Studio) powers 6 agents. Anthropic Claude (via GitLab integration) powers the code review agent. Dual-LLM review catches what a single model might miss.

Pipeline Architecture: Agents run in an optimized sequence — Change Risk and Threat Model run sequentially (each needs the previous output), then Compliance, Remediation, GreenCode, and Claude Code Review run in parallel via asyncio.gather with individual timeouts. Auto Documentation runs after the parallel group. This keeps total pipeline time under 60 seconds.

Resilience: Every agent is wrapped in a safe_run() utility with per-agent timeouts and graceful fallbacks. If any single agent fails or times out, the pipeline continues and the MR comment shows partial results. The pipeline never crashes silently — even on failure, it posts an error comment to the MR.

GitLab Integration: Webhook-triggered (no polling), validates X-Gitlab-Token for security, fetches diffs via GitLab REST API v4, posts comments as MR notes, commits documentation changes via the Repository Commits API.

Compliance Engine: Hybrid rules + AI approach. A deterministic CONTROL_MAP maps finding types to specific framework controls (SOC 2 CC6.1, GDPR Article 32, HIPAA 164.312, etc.), then Gemini explains why each mapping matters and what evidence is needed.

Green Scoring: Based on Software Carbon Intensity (SCI) methodology. Detects N+1 queries, blocking I/O in async contexts, nested loops, missing pagination, and rewards caching, async patterns, connection pooling, and resource cleanup.

Development Process: Built entirely using Kiro's spec-driven development — 14 specs, each going through Requirements → Design → Tasks before implementation. 120 tests, all passing.

Challenges we ran into

  • gcloud ADC conflict: The google.generativeai library was picking up Application Default Credentials instead of our API key, causing all Gemini agents to fail silently. Fixed by calling genai.configure(api_key=...) fresh before each agent call and clearing GOOGLE_APPLICATION_CREDENTIALS.

  • GitLab Markdown rendering: <details> blocks require exact blank line placement — one missing blank line and the entire collapsible section renders as raw HTML. Took multiple iterations to get right.

  • Parallel agent timeouts: The Remediation Agent was timing out at 30 seconds on complex diffs, which we initially misidentified as a Claude agent failure. Bumped to 45 seconds with graceful fallback.

  • Dual-LLM orchestration: Running Gemini and Claude in the same parallel group required careful error isolation — a Claude auth failure shouldn't prevent Gemini agents from completing.

Accomplishments that we're proud of

  • 8 agents running in parallel + sequential pipeline in under 60 seconds on real GitLab MRs
  • 120 tests passing with full mocking of external APIs — no real network calls in tests
  • Dual-LLM review — Gemini for security analysis, Claude for code quality — two perspectives on every MR
  • Green Score based on actual SCI methodology, not just buzzwords
  • Auto-documentation committed directly to the MR branch — the MR gets docstrings and changelog entries without the developer lifting a finger
  • Graceful degradation — any agent can fail without crashing the pipeline. The comment shows what succeeded and what was skipped.
  • Real compliance mapping — specific SOC 2, GDPR, HIPAA, NIST 800-53, ISO 27001 control IDs, not generic categories

What we learned

  • Spec-driven development with Kiro is dramatically faster than vibe coding for complex multi-agent systems — having requirements and design documents before writing code prevented countless integration issues
  • LLM output parsing is the hardest part of building agents — structured JSON output from both Gemini and Claude requires careful prompt engineering, retry logic, and markdown fence stripping
  • Security review is a perfect use case for multi-agent architectures — different agents specialize in different aspects (risk, threats, compliance, fixes) and the combined output is more thorough than any single prompt could achieve

What's next for ShieldFlow

  • Cloud Run production deployment with Secret Manager and AlloyDB persistence
  • HTML dashboard artifacts uploaded to GCS with interactive charts
  • GitLab CI/CD integration — run ShieldFlow as a pipeline job, not just a webhook
  • Custom compliance rule sets — let teams define their own control mappings
  • Multi-language support — currently language-agnostic via diff analysis, but language-specific vulnerability patterns would improve detection
  • Slack/Teams notifications — alert security teams when high-risk MRs are opened

Built With

  • anthropic-claude
  • asyncio
  • docker
  • fastapi
  • gitlab-api
  • gitlab-webhooks
  • google-cloud-run
  • google-cloud-secret-manager
  • google-gemini
  • kiro
  • pydantic
  • python
Share this project:

Updates