Inspiration

Datacenters drive AI progress but face major efficiency and cooling challenges. Only tech giants like NVIDIA and Google run intelligent thermal-management systems—and those are closed and opaque. GridGuardian DC was built to open that black box: a transparent, agentic controller that reasons, plans, and self-evaluates datacenter stability in real time.

How It Works

GridGuardian DC simulates a GPU-driven datacenter divided into GPU, CPU, Storage, and Edge clusters. When it detects overheating or power imbalance, a chain of agents restores equilibrium: 1. Monitor Agent – flags thermal or power stress. 2. Planner Agent (Nemotron) – generates a structured JSON plan with actions such as: • increase cooling • discharge batteries • redistribute workloads 3. Executor Agent – applies those actions through tool calls (cooling_tool, battery_tool, redistribute_tool). 4. Verifier Agent – checks if all clusters return within safe limits. 5. Narrator Agent – produces a human-readable trace of the process. 6. Critic & ScenarioGen (Nemotron) – create new stress tests and grade recovery success.

Every loop follows the ReAct reasoning pattern — Reason → Act → Observe — with complete logs for transparency.

Nemotron Integration

Nemotron powers three core functions: • Planning: creates valid control plans obeying numeric limits. • Scenario Generation: builds random thermal or power failure cases. • Critique & Scoring: evaluates system performance and suggests improvements.

If the Nemotron key is unavailable, the system safely falls back to local heuristics, ensuring reproducible offline runs.

Visualization & Evaluation

The Streamlit dashboard shows: • Before/after Power Balance and Temperature charts. • Real-time Agent Trace and Plan JSON. • Evaluator Panel summarizing pass rates. • Nemotron Scenario Generator + Critic results. • Alert banners for any instability.

This makes complex AI control interpretable and demo-ready.

Impact

GridGuardian DC demonstrates how agentic AI can manage physical infrastructure—turning static datacenters into adaptive, self-healing systems. It aligns directly with NVIDIA’s sustainability vision, combining reasoning, tool-calling, and self-evaluation to reduce energy waste while increasing reliability.

Challenges & Learnings • Validating Nemotron JSON plans under strict schemas. • Maintaining consistent multi-agent state (battery, cooling, workloads). • Handling long-latency model calls gracefully. Learned: practical multi-agent orchestration, real-time function calling, and explainable control for datacenter operations.

Built With

Share this project:

Updates