MIT Licence TypeScript Tests GitLab Duo Google Cloud Run Anthropic Claude


Why I built this

At university I was part of the sustainability society. We measured everything, flight emissions, food waste, heating bills and tried to make the numbers visible to people who wouldn't otherwise think about them. The belief was simple: you can't change what you can't see.

Then I became a software developer and discovered that the thing I spent most of my day doing, shipping code through CI/CD pipelines had a completely invisible footprint. Every test run, every build, every deploy: compute spinning up, energy burning, CO₂ going somewhere. Nobody talked about it. There was no number on the screen.

At the same time, I watched AI tooling get embedded into pipelines faster than anyone could audit it. Teams had no idea which jobs were calling which APIs, how often, or whether sensitive files were being touched in the process. Risk assessment was an afterthought.

CICero is the tool I wanted to exist.

What it does 🌿

CICero runs as a webhook receiver on Google Cloud Run. When GitLab fires a pipeline event, it:

  1. Estimates energy and carbon — computes kWh and kg CO₂e from job runtimes using configurable emission factors per runner tag and cloud region
  2. Audits AI API usage — scans job logs for calls to Anthropic, Gemini, and OpenAI endpoints using pre-compiled single-pass regex patterns
  3. Scores policy risk — evaluates changed file paths against configurable high-risk glob patterns (payments/**, auth/**, etc.) and assigns a 🟢 / 🟡 / 🔴 traffic-light risk level
  4. Runs Claude AI analysis — sends a compact digest of metrics and log snippets to Claude Haiku, which autonomously decides what it needs to inspect (via tool use: get_file, get_job_log) before writing its final report
  5. Tracks trends and budgets — reads the historical footprint log to compute rolling averages, trend direction (↑↓→), and monthly totals, with configurable budget thresholds
  6. Writes a live dashboard — overwrites footprint/dashboard.md on every run with budget gauges and a recent runs table
  7. Posts to the MR — idempotently upserts a rich structured comment on the merge request

Everything happens automatically. No pipeline config changes required.

How I built it

The stack is Node.js 20 + TypeScript (strict, ESM) on Google Cloud Run. The AI layer uses @anthropic-ai/sdk with a genuine agentic tool-use loop, Claude Haiku autonomously calls get_file and get_job_log up to 3 times before producing its final report. Claude Haiku was chosen specifically because it is the fastest and cheapest model on a webhook hot path where latency matters.

The project integrates with the GitLab Duo Agent Platform in three ways: a /greenguard slash command (agent skill), an event-driven flow that triggers on pipeline completion (greenguard-flow.yml), and an AGENTS.md context file that gives all Duo agents operating in the codebase project-specific knowledge.

Testing uses Vitest with fast-check for property-based tests — 116 tests across 13 files, all running fully offline with mocked GitLab and Anthropic APIs.

Challenges

The GitLab REST API has several undocumented or misleading behaviours I had to reverse-engineer: the job trace endpoint returns raw text/plain, not JSON; the MR pipelines endpoint returns an array (newest-first), not a single object; and file updates require PUT while creates require POST on the repository files endpoint. Getting the webhook validation right (timing-safe HMAC comparison to prevent timing attacks) was also non-obvious from the documentation.

The agentic Claude loop needed careful prompt engineering to produce structured, consistent output while staying within the token budget on a latency-sensitive webhook path.

What I learned

Building a real agentic loop where Claude autonomously decides what information it needs before answering is qualitatively different from a single prompt/response. The tool-use architecture makes the system genuinely more capable than any static prompt could be, but it requires careful design around round limits, token budgets, and graceful degradation when the API is unavailable.

I also learned that making invisible infrastructure costs visible is genuinely motivating for engineering teams. The number on the screen changes behaviour.

Built With

Share this project:

Updates