Inspiration

Training a frontier model can use as much electricity as a small town uses in a year. Inspired by Crusoe's track and call for sustainable solutions, we developed CaSch. We focus on actionable insights for the people running the jobs, driven by live electrical and CO₂ data. You don't need a geothermal data centre to run cleaner. You just need to know when to run.

What it does

You upload your training code to CaSch. It runs a quick profiling pass using Zeus (built on torch.profiler) to measure per-step GPU power draw. Then it pulls a 24-hour carbon intensity forecast from WattTime or Electricity Maps for your grid zone and computes a tradeoff curve: the longer you're willing to wait, the less CO₂ you emit. You pick a wall-clock budget and hit run. CaSch inserts pause windows during the dirtiest grid moments and executes live against your training job.

How we built it

Four components. The profiler uses Zeus on top of torch.profiler to sample power per training step during a short dry run. The carbon fetcher pulls real-time and 24-hour forecast carbon intensity from WattTime and Electricity Maps. The optimizer models your job as a power function f(t) and the grid as g(t), then finds optimal pause positions that minimise ∫f(t)·g(t)dt — total CO₂ emitted. The executor applies the policy live using pynvml with threading locks to safely control GPU state during training. Served via a FastAPI backend with an ngrok tunnel for accessibility, and a dashboard built entirely with Lovable.

Challenges we ran into

Finding analytically good pause positions — not just greedy ones — required careful formulation of the optimisation problem. Grid forecast APIs have inconsistent resolution and zone coverage. Making pynvml state changes thread-safe during live training required careful locking.

Accomplishments that we're proud of

A working end-to-end pipeline in under 12 hours of actual coding. The CO₂ vs duration tradeoff curve makes an abstract problem immediately concrete. The optimizer finds non-obvious pause positions that genuinely align idle time with dirty grid windows.

What we learned

We spent more time brainstorming than coding, and that was the right call. Early ideas were fine but not interesting. Once the core formulation clicked, the code followed fast. Zed Agent cut development time 3 times. Lovable let us ship a real frontend in a single call!

What's next for CaSch

Broader grid zone coverage. Multi-GPU and distributed training support. Continuous throttle levels instead of binary pause/resume — closer to what Crusoe does at the infrastructure level, but accessible to any researcher with a GPU and a training script.

Built With

Share this project:

Updates