๐Ÿš€ CarbonShift: Grid-Aware AI Training Orchestrator

๐Ÿ’ก Inspiration

The idea for CarbonShift hit us when we looked at two simple charts side-by-side.

On the left: AI compute demand, which is growing exponentially and running 24/7. On the right: The energy grid, which fluctuates wildly in price and cleanliness throughout the day.

We realized that training a model at 4 PM (when the grid is dirty and expensive) versus 3 AM (when wind energy is abundant and free) yields the exact same modelโ€”but with vastly different environmental and financial costs.

A single training run can emit hundreds of tons of COโ‚‚, yet our GPU clusters are completely blind to the world outside the data center. We asked ourselves: What if our AI models knew when the sun was shining?

๐Ÿ’ป What it does

CarbonShift is an intelligent orchestration layer that sits between your AI training workload and the power grid. It monitors real-time electricity prices and carbon intensity, then dynamically adjusts GPU power consumption to train during optimal conditions.

  • Green State (Clean/Cheap): The system boosts GPU power to maximum (250W) to speed up training.
  • Red State (Dirty/Expensive): The system pauses training or throttles power down (100W) to wait for better conditions.

It turns AI training from a dumb, constant load into an intelligent, grid-responsive asset.

โš™๏ธ How we built it

1. The Architecture

We built CarbonShift as a modular system with a "Brain" (the decision engine) and a "Body" (the training agents).

The Stack:

  • Hardware: NVIDIA A100 GPUs (via Brev.dev)
  • Backend: Python, FastAPI
  • AI Agents: Fetch.ai uAgents
  • Intelligence: OpenAI GPT-4 & Anthropic Claude 3.5 Sonnet
  • Grid Data: CAISO API (Prices) & WattTime (Carbon)

2. The "Eco-Pulse" Algorithm

The core logic runs every 60 seconds. We developed a state-machine algorithm that balances training velocity against environmental impact.

Mathematical logic for the decision thresholds:

$$Score_{grid} = \alpha \cdot P_{norm} + \beta \cdot C_{norm}$$

Where $P$ is price and $C$ is carbon intensity. When the score exceeds our critical threshold, the system triggers a Hardware Interrupt.

We interact directly with the GPU hardware using nvidia-ml-py to physically cap the wattage:

# Simplified Logic
def optimize_gpu(price, carbon):
    if price > CRITICAL_PRICE or carbon > CRITICAL_CARBON:
        # PAUSE: Save checkpoint, drop power to idle
        gpu.set_power_limit(100) 
        training.pause()
        return "RED_STATE"

    elif price < OPTIMAL_PRICE and carbon < OPTIMAL_CARBON:
        # BOOST: Maximize compute throughput
        gpu.set_power_limit(250)
        training.resume()
        return "GREEN_STATE"

3. Autonomous Agents (Fetch.ai)

Instead of a central server telling every GPU what to do, we wrapped our training jobs in Fetch.ai Agents. This allows for decentralized coordination. Each GPU acts as an independent agent that can "negotiate" with the grid, ensuring we don't accidentally spike demand when resuming.

4. LLM Prediction Layer

We didn't just want to react to the present; we wanted to predict the future.

  • GPT-4 analyzes the last 24 hours of grid data to forecast price spikes over the next 4 hours.
  • Claude 3.5 Sonnet acts as the "Explainability Engine," generating human-readable summaries of why the training was paused (e.g., "Training paused due to a sudden spike in coal generation in the CAISO region.").

๐Ÿšง Challenges we ran into

  • The "Zombie" GPU Process: Pausing a training loop without killing the process is incredibly difficult. We had to engineer a custom PyTorch wrapper that could "sleep" the training loop while keeping the model loaded in VRAM, allowing for instant resumption without reloading 80GB of parameters.
  • Grid Data is Messy: Real-time energy data is noisy. The CAISO API would frequently return null values or XML errors. We had to build a robust data cleaning pipeline with exponential backoff and linear interpolation to fill in the gaps.
  • Hardware Limits: We discovered that not all GPUs allow dynamic power limiting via software. We had to add hardware capability detection to prevent the system from crashing on unsupported cards.

๐Ÿ… Accomplishments that we're proud of

  • Real Impact: In our test run training ResNet-18, we achieved a 37% cost reduction and a 43% reduction in carbon footprint compared to a standard continuous run.
  • Hardware Control: Successfully controlling the physical power draw of an NVIDIA A100 via code felt like magic. Watching the wattage drop on our dashboard in real-time as prices spiked was a huge win.
  • Seamlessness: The user doesn't need to change their PyTorch code significantly. Our wrapper handles the complexity.

๐Ÿง  What we learned

  • Energy is Volatile: We assumed electricity prices were somewhat stable. We were wrong. They can jump from $20/MWh to $150/MWh in minutes.
  • Latency Matters: The grid changes fast. A 5-minute delay in data processing can mean the difference between training on wind vs. coal.
  • AI for AI: Using LLMs to optimize the training of other AI models creates a fascinating feedback loop.

๐Ÿ”ฎ What's next for CarbonShift

  • Multi-Region Hopping: Allowing agents to move training jobs geographically (e.g., move the job from Virginia to Oregon if the wind is blowing in the West).
  • Spot Instance Integration: Automatically bidding on spot instances when prices drop.
  • LLM Fine-tuning: Scaling up from ResNet to Llama-3 training runs where the energy savings would be in the thousands of dollars.

Built With: python pytorch fastapi fetch.ai openai anthropic react vite nvidia-ml-py

Built With

Share this project:

Updates