AI RaidMeter — Green Coding Coach

🐱 AI RaidMeter — Green Coding Coach

From tokenmaxxing to value-aware AI governance. 🌱

💡 Inspiration

Most teams measure AI adoption with one number: how many tokens did you burn? It's the easiest metric to collect — and the most misleading one. It rewards looking busy. 🔥

A developer who burns 1M tokens thrashing through a problem outranks one who solves the same task cleanly in a fifth of the cost. We kept seeing the same waste patterns in AI-assisted coding:

📂 reading an entire file before locating the bug
🔁 looping on local tests instead of verifying in the cloud
♻️ retrying a failing command without ever reading the error

These aren't coding mistakes. They're workflow anti-patterns — and a token leaderboard can't see them.

❓ So we asked a different question: not who used the most AI, but which AI workflows actually created value — and can an agent catch the waste before it happens?

⚙️ What it does

AI RaidMeter inspects how an AI coding session actually unfolded, judges it with a multi-criteria method instead of a single number, and helps a developer improve against their own past baseline. It runs as a closed loop:

🛫 Pre-flight Guardrails — before the work From the patterns a developer fell into before, the agent predicts the next task's risks and issues concrete guardrails up front (cap failed deploys at 2, never paste 30+ lines of raw logs, validate templates before remote deploy). A proactive agent — not a post-mortem dashboard.

🔍 Detection — *during* Sessions are checked against seven token-waste anti-patterns (Full-file Devotion, Local Loop, Blind Retry, Context Hoarding, Sticky Command…), each a machine-readable rule over real trace fields.

🩺 Clinical judgment — *after* A signal is never a verdict. A weighted, multi-criteria scorer combines signals with task type, outcome, time-box and historical baseline, and applies justification credits — a hard production incident isn't punished like idle thrashing. Only when evidence converges does a session reach Level 3.

📊 Scoring & coaching Three scores — Current Value, Delta (vs. your own baseline), Green Efficiency (estimated savings, reported as a proxy) — feed a natural-language coaching report with a diagnosis and specific prescriptions.

🎯 The headline result

Same developer, before & after coaching, on the same class of cloud-deployment bugfix. The judgment behind this is produced live by a Gemini agent reading real traces through the Arize Phoenix MCP server. ✨

	🔴 Before	🟢 After
Tokens	1,000K	420K ⬇️
Time	95 min	38 min ⬇️
Anti-patterns	5 (Level 3)	0 (Level 0) ✅
Outcome	PR rejected ❌	PR merged ✅
RaidMeter Score	0.0	56.1 🚀

📉 −58% tokens · −60% time · 5 → 0 anti-patterns · rejected → merged — Delta Score 75.4

🛠️ How we built it

Three pillars, all wired to live data:

🧠 Gemini — drives every piece of reasoning: detection narrative, multi-criteria coaching, pre-flight prediction. (3.5 Flash in production for fast warm responses; 3.1 Pro for deeper agent reasoning.)

🏗️ Google Cloud Agent Builder (ADK) — an LlmAgent built with the Agent Development Kit, deployed as its own Cloud Run service with a live dev-ui judges can interact with. Real tools, plans, reports — not a scripted reply.

🔭 Arize Phoenix MCP — the agent queries genuine traces via the Phoenix MCP server. OpenInference auto-captures each Gemini call (tokens, latency, spans); an adapter reads those real spans back into the detector. The loop is closed, not mocked.

The dashboard, coaching API, pre-flight API and live-trace panel run as a Flask app on Cloud Run. Outcome data is mock for the demo — the trace intelligence is real. 🎯

🧗 Challenges we ran into

🗂️ Read-only container filesystem — On Cloud Run only /tmp is writable, so npx couldn't write its npm cache and the MCP tools silently failed to load. Fixed by pre-installing the Phoenix MCP package at Docker build time + pointing the cache at /tmp — nothing fetched at runtime.

⚔️ Dependency conflict — ADK and the Phoenix client pin incompatible OpenTelemetry versions. We split them: Flask service and ADK agent service each run their own set, in isolated deployments.

🥶 Cold-start & timeouts — the observability stack imports heavy. We raised memory, lengthened the gunicorn timeout, and kept a warm instance so the live demo never stalls.

🎓 Keeping judgment honest — the hard design problem was resisting single-signal verdicts. The justification-credit layer is what makes the scoring credible.

🏆 Accomplishments we're proud of

🔄 A genuinely closed loop on one page: pre-flight → live session → real Arize trace → detection → clinical scoring → coaching.
🤖 The agent runs live and interactive for judges, querying real traces via Phoenix MCP — the partner integration is core, not decoration.
🤝 Scoring measures people against their own past, not against each other — a coach, not a surveillance tool.

📚 What we learned

Token count is an incentive, not a measurement — and the wrong incentive produces the wrong behavior. The real signal lives in the shape of a session (the tool calls, the retries, the reads), which is exactly what trace observability exposes. And a useful judgment is multi-criteria: one symptom is never a diagnosis. 🩺

🚀 What's next for AI RaidMeter

🔌 Real outcome connectors (PR / CI / issue status) to replace the mock outcome layer.
📈 A growing personal baseline so the Delta score reflects a real improvement trend over time.
👥 A team-level view that surfaces high-value workflows to template — always at the workflow level, never as a ranking of people.

🧰 Built with

Gemini · Google Cloud Agent Builder (ADK) · Arize Phoenix (observability + MCP) · OpenInference · Cloud Run · Flask · Python

🔗 Live dashboard: https://ai-raidmeter-733974887555.us-central1.run.app 🔗 Interactive agent (dev-ui): https://ai-raidmeter-agent-733974887555.us-central1.run.app/dev-ui/

Built With

adk
arize
cloudrun
flask
gemini
openinference
python

Submitted to

Google Cloud Rapid Agent Hackathon

Created by

Built by Chloe Kao and MR. DRIVER. Chloe Kao led the product and AI orchestration — defining the core concept (seven AI-coding waste anti-patterns), the clinical multi-criteria judgment model, the three-score system (Value / Delta / Green), and the overall architecture, directing the build through a chat-coding workflow where she owned all design and logic decisions. MR. DRIVER led engineering and deployment — building and shipping the full system on Google Cloud: the Gemini agent (ADK / Agent Builder), the Arize Phoenix MCP integration, the live trace pipeline (OpenInference), the Flask dashboard, and both Cloud Run services.

Chloe Kao
I did not enter this space through an IDE. I entered it through a dialogue box.

Updates

Chloe Kao started this project — Jun 06, 2026 05:02 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.