decarbify.rl

Inspiration

Training a single large language model can emit as much CO$_2$ as five cars over their entire lifetimes. But here's the thing most people miss: the same GPU job produces wildly different emissions depending on where and when you run it. A training run on California solar at noon emits a fraction of what the identical workload produces on Australian coal at night — we're talking 5–10x differences across regions and hours.

The infrastructure to exploit this already exists. Every major cloud provider operates datacenters on multiple continents, each plugged into a different electricity grid. Real-time carbon intensity APIs (like Electricity Maps) tell you exactly how dirty each grid is, right now. The missing piece isn't hardware or data — it's the intelligence to use it.

That's what inspired decarbify.rl. We wanted to build a system that learns, from experience, how to route AI training jobs across the globe to minimise carbon emissions — without sacrificing performance or blowing up costs.

What it does

decarbify.rl is a reinforcement learning system that controls a fleet of five datacenters spread across the globe — California, Germany, Chile, Singapore, and Australia — each powered by a distinct energy mix. Every 15 minutes, a trained Soft Actor-Critic (SAC) agent observes the full system state:

Carbon intensity at each datacenter (gCO$_2$/kWh)
Electricity prices (USD/kWh)
Temperature (affects cooling overhead)
Queue depth and resource utilisation

It then makes two decisions for every batch of incoming AI training tasks:

Route — which datacenter should run this job right now?
Defer — or should we hold it back and wait for a cleaner grid window?

The agent learns these decisions entirely from experience. It doesn't need carbon forecasts or hand-tuned rules. It discovers on its own that California is cleanest during solar hours, Germany swings wildly with wind, and Chile is generally reliable but not always the best option. It shifts workloads aggressively as conditions change.

We benchmark against two baselines:

Local Only — tasks stay at their origin datacenter. This is what most real schedulers do today.
Lowest Carbon (Greedy) — always route to whichever DC has the lowest carbon intensity right now. Sounds optimal, but it overloads that datacenter, causing SLA violations and high transmission costs.

Results from a 1-day simulation (96 timesteps):

Strategy	Total CO$_2$ (kg)	SLA Violations	Total Cost
RL Agent	~203	~2%	\$137
Local Only	~370	~6%	\$161
Lowest Carbon	~240	~10%	\$199

The RL agent achieves the lowest carbon emissions (~45% reduction vs local), the lowest SLA violation rate, and the lowest cost — simultaneously. It doesn't just optimise one metric at the expense of others. The greedy baseline proves that naively chasing the cleanest grid backfires: you overload it, miss deadlines, and pay transmission surcharges.

The dashboard makes all of this visible in real time. You can watch the agent shift workloads across the globe as grid conditions change — an interactive proof that where and when you compute matters.

How we built it

The RL environment is built on SustainCluster, an OpenAI Gym-compatible simulator from HPE Research. It models five datacenters with realistic carbon intensity traces, weather data, electricity pricing, and production workload patterns drawn from Alibaba's 2020 cluster traces. Each datacenter has distinct diurnal carbon cycles:

US-California: Solar-heavy, clean midday (80 gCO$_2$/kWh), dirty evenings with gas peakers (380 gCO$_2$/kWh)
Germany: Wind-dependent, volatile hour-to-hour (100–450 gCO$_2$/kWh)
Chile: Hydro + solar, generally clean but vulnerable to drought spikes (120–300 gCO$_2$/kWh)
Singapore: Natural gas baseline, consistently high (250–520 gCO$_2$/kWh)
Australia (NSW): Coal baseline with a strong solar dip at midday (200–600 gCO$_2$/kWh)

The agent is a Soft Actor-Critic with a two-layer MLP (256 hidden units, LayerNorm, ReLU). It takes a 34-dimensional observation per task — time features (sin/cos encoded), task requirements (cores, GPUs, memory, duration, deadline), and per-DC state (available resources, carbon intensity, price, temperature). It outputs a probability distribution over 6 discrete actions: defer, or dispatch to one of 5 DCs. The reward combines energy price ($w = 0.9$) and carbon emissions ($w = 0.3$).

We trained four checkpoint variants to explore the design space:

Checkpoint	Routing	Deferral	Use case
`multi_action_enable_defer`	Any DC	Yes	Maximum flexibility (recommended)
`multi_action_disable_defer`	Any DC	No	Latency-sensitive workloads
`single_action_enable_defer`	Origin only	Yes	Conservative, low overhead
`single_action_disable_defer`	Origin only	No	Minimal intervention baseline

The backend is FastAPI serving simulation results over REST. It runs in two modes: a deterministic mock data generator (zero external dependencies, instant, reproducible) for demos, and a live mode that wraps the real SustainCluster environment with trained checkpoints. The mock generator faithfully reproduces the narrative — realistic diurnal cycles, random events (cloud cover, wind gusts, drought spikes), and strategy-specific routing logic — so the demo tells the same story as the real simulation without requiring the full RL stack.

The frontend is React 18 + TypeScript with Zustand for state management. The dashboard features:

Global map (MapLibre GL) showing task distribution across DCs with carbon-intensity-scaled markers
Sliding CO$_2$ time-series chart (Plotly.js) with a 12-hour rolling window comparing all strategies
Per-datacenter carbon snapshots with energy, temperature, and utilisation breakdowns
RL savings panel computing real-time deltas vs baselines across CO$_2$, cost, water, and SLA
Action probability heatmap visualising the agent's decision distribution at each timestep
Playback controls with variable speed (0.5x–4x) and manual scrubbing

Everything is styled in a cyber-neon dark theme with strategy-coded colours: green for RL, red for Local, yellow for Lowest Carbon.

Challenges we ran into

Thread safety in the simulator. SustainCluster uses os.chdir() internally, which is process-global. Running multiple simulations concurrently caused race conditions where one evaluation would corrupt another's working directory. We had to serialize live-mode evaluations through a single-worker thread pool — not ideal for throughput, but correct.

Balancing the reward function. Early reward formulations that weighted carbon too heavily caused the agent to defer almost everything — great for emissions, terrible for SLA compliance. We settled on a weighted combination ($0.9 \times \text{price} + 0.3 \times \text{carbon}$) that lets the agent naturally discover the trade-off between deferring for cleaner windows and meeting deadlines. The price signal acts as an implicit SLA proxy since keeping tasks queued incurs ongoing costs.

The greedy baseline trap. We initially expected "always route to the lowest-carbon DC" to be a strong baseline. Instead, it turned out to be the worst strategy for SLA compliance and cost. Every task piles onto the same datacenter, creating contention, queue overflow, and transmission surcharges. This was a genuinely surprising result that validated the need for a learned policy that balances load.

Making the demo self-explanatory. The real simulation takes minutes to run and requires the full sustain-cluster submodule. We needed the dashboard to be instantly compelling at a booth demo. Building a mock data generator that faithfully reproduces the real simulation's narrative — without any ML dependencies — took significant effort, but it means the app loads in seconds and tells the same story every time.

Real-time carbon data reliability. The Electricity Maps API has per-zone rate limits and occasional outages. We implemented per-zone fallback (partial outage doesn't kill the whole dashboard) and default to mock data when the API is unavailable, so the demo never breaks regardless of network conditions.

Accomplishments that we're proud of

A single lightweight RL policy — no forecasts, no oracle knowledge, just learned behaviour — achieves a 45% CO$_2$ reduction over standard scheduling while also improving cost and SLA compliance
The dashboard makes the invisible visible: you can literally watch the agent shift workloads in real time as grid conditions change
The mock data system means anyone can experience the full demo with zero setup — npm run dev and you're watching an RL agent decarbonise a datacenter fleet
Four trained checkpoint variants let users explore the design space interactively: what happens when you remove deferral? What about restricting routing to the origin DC only?

What we learned

Greedy isn't optimal. The most intuitive strategy — always route to the cleanest grid — is actually the worst for overall system health. Load balancing matters as much as carbon intensity.
Deferral is the RL agent's secret weapon. The ability to wait for a cleaner grid window, rather than committing immediately, accounts for a significant portion of the emissions reduction. This is something no static rule can replicate well, because the optimal deferral threshold depends on the current state of all five grids simultaneously.
The carbon opportunity is real and large. A 5–10x variation in carbon intensity across regions and hours means there is enormous room for optimisation — and almost nobody is doing it today. The infrastructure exists; the policy doesn't.
Demo UX matters as much as the model. A trained checkpoint sitting in a folder doesn't convince anyone. An interactive dashboard where you can watch the agent make decisions in real time, see the CO$_2$ trajectories diverge, and scrub through 24 hours of simulation — that's what makes people understand why this matters.

What's next

Multi-objective Pareto front: Train agents with different carbon/cost/SLA weightings and let operators choose their preferred trade-off point
Integration with real cloud schedulers: Wrap the SAC policy as a Kubernetes scheduler plugin that makes real routing decisions for GPU workloads
Longer horizon planning: Extend from reactive 15-minute decisions to planning over hours using carbon intensity forecasts as additional observations
Broader workload types: Expand beyond training jobs to inference serving, where latency constraints are tighter but carbon-aware routing is still possible
Live deployment pilot: Partner with a multi-region cloud operator to measure real-world emissions reductions against production baselines