Inspiration

Every enterprise shipping code to production faces the same compliance nightmare. A SOX audit means a compliance officer manually exports GitLab MR data to a spreadsheet, screenshots pipelines, counts approvals, and assembles a PDF — hours of work per release, repeated every sprint. DORA Article 11 adds another layer of ICT risk documentation on top. I built AuditForge because this problem is completely automatable, and it was the perfect use case for the GitLab Duo Agent Platform.

What it does

AuditForge is a multi-agent compliance orchestration system that hooks into GitLab's native webhook infrastructure. When a merge request is opened, a pipeline completes, or a deployment goes live, AuditForge automatically:

  1. Collects structured audit evidence — MR approvals, pipeline coverage, security scan findings, deployment metadata — from the GitLab API in parallel
  2. Validates the evidence against a YAML-defined rules engine covering SOX Section 404 ITGC controls and DORA Article 11 ICT risk requirements (e.g., SOX-CM-001: approvals >= 2)
  3. Narrates the findings using three focused Claude Sonnet 4.6 prompts simultaneously — a plain-English developer explanation, a formal audit narrative for regulators, and a 3-sentence executive summary for the CISO
  4. Publishes a signed PDF evidence package to Google Cloud Storage, logs the compliance event to BigQuery, posts a structured MR comment, and opens GitLab issues for each violation

The result: developers see compliance feedback the moment their MR is touched. Auditors have an immutable PDF trail. CISOs have a dashboard-ready event log in BigQuery.

How we built it

AuditForge is built on Node.js 20 ESM — no TypeScript build step, judges and evaluators can run it immediately with node. The architecture is a strict 4-phase pipeline orchestrated asynchronously:

  • Gateway: Express.js on Google Cloud Run, validates X-Gitlab-Token with crypto.timingSafeEqual(), responds 202 Accepted immediately to avoid GitLab webhook timeouts, then fires the orchestrator
  • EvidenceCollectorAgent: Uses Promise.allSettled across 4 parallel GitLab API calls — MR metadata, approvals, pipeline + test report, and vulnerability findings. Partial evidence is tolerated; one failing API call never aborts the run
  • PolicyValidatorAgent: Completely synchronous — rules are loaded from YAML at startup, validate() is pure computation with zero I/O. This makes it trivially testable and blazing fast (<1ms for 12 controls)
  • ClaudeNarratorAgent: Three independent, focused Claude Sonnet 4.6 calls running in Promise.all — one per audience. Each prompt is scoped and purposeful, never a mega-prompt
  • ReportPublisherAgent: Parallel publication to GCS (pdfkit PDF), BigQuery (structured event), GitLab MR comment, and GitLab issues. Each destination is non-fatal — one failure never blocks the others

All secrets live in Google Cloud Secret Manager. The Cloud Run service account has least-privilege IAM bindings. The full flow is defined as a GitLab Duo Agent Platform flow in auditforge-duo-flow.yaml.

Challenges we ran into

The async timing problem was the first major challenge. GitLab webhooks time out after ~10 seconds, but a full compliance run — multiple API calls plus three Claude invocations — takes 15–60 seconds. The solution was strict fire-and-forget architecture: the gateway returns 202 Accepted before touching the orchestrator, and the orchestrator runs entirely asynchronously.

Prompt engineering for three audiences required significant iteration. A single prompt trying to produce developer explanations, formal audit prose, and executive summaries always resulted in mediocre output for all three. The breakthrough was treating each audience as a completely separate Claude call with its own system prompt, persona, token budget, and success criteria. The developer prompt is direct and non-blaming. The audit narrative is past-tense and evidence-referencing. The executive summary is exactly three sentences, numbers first.

YAML rule engine generality was harder than expected. The goal was that adding a new regulation should require zero JavaScript changes — only a new YAML file. Getting the operator set (gte, lte, isEmpty, matches, notIn, etc.) right and properly handling null evidence fields (where a GitLab API returned nothing) required careful design of both the rule schema and the evidence traversal logic.

Accomplishments that we're proud of

  • Zero hardcoded compliance logic in JavaScript. Every control — all 12 across SOX and DORA — lives in YAML. Adding HIPAA or PCI-DSS is a single file addition with no code changes
  • Three-audience Claude narration that actually reads correctly for each audience. The developer explanation is actionable. The audit narrative would pass a Big Four review. The executive summary fits on a slide
  • Sub-second policy validation despite evaluating 12 controls across 4 evidence domains — because PolicyValidatorAgent is pure synchronous computation
  • End-to-end enterprise artifact chain: GitLab event → evidence → policy → narrative → PDF in GCS → BigQuery log → MR comment, all automated, all traceable
  • Production-grade security: timing-safe webhook validation, Secret Manager integration, least-privilege service accounts, and an explicit data privacy position on Claude API usage

What we learned

The most important lesson was that focused AI prompts beat general ones by an order of magnitude. Early versions used a single Claude call to "analyze the compliance results and explain them." The output was technically correct but felt generic — like a compliance form, not a human explanation. Splitting into three purposeful calls, each with a tight system prompt and a specific audience persona, transformed the output quality completely.

The second lesson was about graceful degradation architecture. In a system that calls multiple external APIs (GitLab, Anthropic, GCS, BigQuery), the question isn't "what if everything works" but "what is the minimum acceptable output when things fail." Building explicit fallback paths — especially fallbackNarrative.js for when Claude is unavailable — meant the system always posts something to the MR, which is the most important guarantee for developer trust.

What's next for AuditForge

  1. Remediation Agents: Instead of explaining violations, automatically generate the fix. For a missing pipeline security scan job, AuditForge opens a merge request adding the correct .gitlab-ci.yml template
  2. Regulation expansion: Complete HIPAA Security Rule and PCI-DSS v4.0 rulesets (stubs already exist in the codebase)
  3. Multi-source evidence: Pull AWS CloudTrail logs, Datadog SLO metrics, and Jira ticket links into the evidence package for full environmental compliance — not just GitLab-internal data
  4. Compliance dashboard: A BigQuery-backed Looker Studio dashboard showing compliance trends across all projects and regulations over time

Built With

  • anthropic-api
  • claude-sonnet-4.6
  • docker
  • express.js
  • gitlab-api
  • gitlab-duo-agent-platform
  • google-cloud
  • google-cloud-bigquery
  • google-cloud-run
  • google-cloud-secret-manager
  • javascript-(esm)
  • js-yaml
  • node.js
  • pdfkit
  • vitest
  • zod
Share this project:

Updates