🐱 AI RaidMeter — Green Coding Coach
From tokenmaxxing to value-aware AI governance. 🌱
💡 Inspiration
Most teams measure AI adoption with one number: how many tokens did you burn? It's the easiest metric to collect — and the most misleading one. It rewards looking busy. 🔥
A developer who burns 1M tokens thrashing through a problem outranks one who solves the same task cleanly in a fifth of the cost. We kept seeing the same waste patterns in AI-assisted coding:
- 📂 reading an entire file before locating the bug
- 🔁 looping on local tests instead of verifying in the cloud
- ♻️ retrying a failing command without ever reading the error
These aren't coding mistakes. They're workflow anti-patterns — and a token leaderboard can't see them.
❓ So we asked a different question: not who used the most AI, but which AI workflows actually created value — and can an agent catch the waste before it happens?
⚙️ What it does
AI RaidMeter inspects how an AI coding session actually unfolded, judges it with a multi-criteria method instead of a single number, and helps a developer improve against their own past baseline. It runs as a closed loop:
🛫 Pre-flight Guardrails — before the work From the patterns a developer fell into before, the agent predicts the next task's risks and issues concrete guardrails up front (cap failed deploys at 2, never paste 30+ lines of raw logs, validate templates before remote deploy). A proactive agent — not a post-mortem dashboard.
🔍 Detection — *during* Sessions are checked against seven token-waste anti-patterns (Full-file Devotion, Local Loop, Blind Retry, Context Hoarding, Sticky Command…), each a machine-readable rule over real trace fields.
🩺 Clinical judgment — *after* A signal is never a verdict. A weighted, multi-criteria scorer combines signals with task type, outcome, time-box and historical baseline, and applies justification credits — a hard production incident isn't punished like idle thrashing. Only when evidence converges does a session reach Level 3.
📊 Scoring & coaching Three scores — Current Value, Delta (vs. your own baseline), Green Efficiency (estimated savings, reported as a proxy) — feed a natural-language coaching report with a diagnosis and specific prescriptions.
🎯 The headline result
Same developer, before & after coaching, on the same class of cloud-deployment bugfix. The judgment behind this is produced live by a Gemini agent reading real traces through the Arize Phoenix MCP server. ✨
| 🔴 Before | 🟢 After | |
|---|---|---|
| Tokens | 1,000K | 420K ⬇️ |
| Time | 95 min | 38 min ⬇️ |
| Anti-patterns | 5 (Level 3) | 0 (Level 0) ✅ |
| Outcome | PR rejected ❌ | PR merged ✅ |
| RaidMeter Score | 0.0 | 56.1 🚀 |
📉 −58% tokens · −60% time · 5 → 0 anti-patterns · rejected → merged — Delta Score 75.4
🛠️ How we built it
Three pillars, all wired to live data:
🧠 Gemini — drives every piece of reasoning: detection narrative, multi-criteria coaching, pre-flight prediction. (3.5 Flash in production for fast warm responses; 3.1 Pro for deeper agent reasoning.)
🏗️ Google Cloud Agent Builder (ADK) — an LlmAgent built with the Agent Development Kit, deployed as its own Cloud Run service with a live dev-ui judges can interact with. Real tools, plans, reports — not a scripted reply.
🔭 Arize Phoenix MCP — the agent queries genuine traces via the Phoenix MCP server. OpenInference auto-captures each Gemini call (tokens, latency, spans); an adapter reads those real spans back into the detector. The loop is closed, not mocked.
The dashboard, coaching API, pre-flight API and live-trace panel run as a Flask app on Cloud Run. Outcome data is mock for the demo — the trace intelligence is real. 🎯
🧗 Challenges we ran into
🗂️ Read-only container filesystem — On Cloud Run only /tmp is writable, so npx couldn't write its npm cache and the MCP tools silently failed to load. Fixed by pre-installing the Phoenix MCP package at Docker build time + pointing the cache at /tmp — nothing fetched at runtime.
⚔️ Dependency conflict — ADK and the Phoenix client pin incompatible OpenTelemetry versions. We split them: Flask service and ADK agent service each run their own set, in isolated deployments.
🥶 Cold-start & timeouts — the observability stack imports heavy. We raised memory, lengthened the gunicorn timeout, and kept a warm instance so the live demo never stalls.
🎓 Keeping judgment honest — the hard design problem was resisting single-signal verdicts. The justification-credit layer is what makes the scoring credible.
🏆 Accomplishments we're proud of
- 🔄 A genuinely closed loop on one page: pre-flight → live session → real Arize trace → detection → clinical scoring → coaching.
- 🤖 The agent runs live and interactive for judges, querying real traces via Phoenix MCP — the partner integration is core, not decoration.
- 🤝 Scoring measures people against their own past, not against each other — a coach, not a surveillance tool.
📚 What we learned
Token count is an incentive, not a measurement — and the wrong incentive produces the wrong behavior. The real signal lives in the shape of a session (the tool calls, the retries, the reads), which is exactly what trace observability exposes. And a useful judgment is multi-criteria: one symptom is never a diagnosis. 🩺
🚀 What's next for AI RaidMeter
- 🔌 Real outcome connectors (PR / CI / issue status) to replace the mock outcome layer.
- 📈 A growing personal baseline so the Delta score reflects a real improvement trend over time.
- 👥 A team-level view that surfaces high-value workflows to template — always at the workflow level, never as a ranking of people.
🧰 Built with
Gemini · Google Cloud Agent Builder (ADK) · Arize Phoenix (observability + MCP) · OpenInference · Cloud Run · Flask · Python
🔗 Live dashboard: https://ai-raidmeter-733974887555.us-central1.run.app 🔗 Interactive agent (dev-ui): https://ai-raidmeter-agent-733974887555.us-central1.run.app/dev-ui/

Log in or sign up for Devpost to join the conversation.