-
-
Real-time dashboard showing per-cluster power balance and thermal load before control actions.
-
AI-generated cooling and redistribution plan (JSON) computed by the multi-agent planner.
-
All clusters stabilized within thresholds; system passes automated safety checks.
-
Step-by-step Reason→Act→Observe loop illustrating how agents reason and verify outcomes.
-
Critical alert and diagnostic trace when GPU cluster overheats, triggering fail-safe response.
Inspiration
Datacenters drive AI progress but face major efficiency and cooling challenges. Only tech giants like NVIDIA and Google run intelligent thermal-management systems—and those are closed and opaque. GridGuardian DC was built to open that black box: a transparent, agentic controller that reasons, plans, and self-evaluates datacenter stability in real time.
⸻
How It Works
GridGuardian DC simulates a GPU-driven datacenter divided into GPU, CPU, Storage, and Edge clusters. When it detects overheating or power imbalance, a chain of agents restores equilibrium: 1. Monitor Agent – flags thermal or power stress. 2. Planner Agent (Nemotron) – generates a structured JSON plan with actions such as: • increase cooling • discharge batteries • redistribute workloads 3. Executor Agent – applies those actions through tool calls (cooling_tool, battery_tool, redistribute_tool). 4. Verifier Agent – checks if all clusters return within safe limits. 5. Narrator Agent – produces a human-readable trace of the process. 6. Critic & ScenarioGen (Nemotron) – create new stress tests and grade recovery success.
Every loop follows the ReAct reasoning pattern — Reason → Act → Observe — with complete logs for transparency.
⸻
Nemotron Integration
Nemotron powers three core functions: • Planning: creates valid control plans obeying numeric limits. • Scenario Generation: builds random thermal or power failure cases. • Critique & Scoring: evaluates system performance and suggests improvements.
If the Nemotron key is unavailable, the system safely falls back to local heuristics, ensuring reproducible offline runs.
⸻
Visualization & Evaluation
The Streamlit dashboard shows: • Before/after Power Balance and Temperature charts. • Real-time Agent Trace and Plan JSON. • Evaluator Panel summarizing pass rates. • Nemotron Scenario Generator + Critic results. • Alert banners for any instability.
This makes complex AI control interpretable and demo-ready.
⸻
Impact
GridGuardian DC demonstrates how agentic AI can manage physical infrastructure—turning static datacenters into adaptive, self-healing systems. It aligns directly with NVIDIA’s sustainability vision, combining reasoning, tool-calling, and self-evaluation to reduce energy waste while increasing reliability.
⸻
Challenges & Learnings • Validating Nemotron JSON plans under strict schemas. • Maintaining consistent multi-agent state (battery, cooling, workloads). • Handling long-latency model calls gracefully. Learned: practical multi-agent orchestration, real-time function calling, and explainable control for datacenter operations.

Log in or sign up for Devpost to join the conversation.