Inspiration In the high-stakes domain of large-scale AI training and inference, infrastructure reliability is everything. Modern AI clusters are not just traditional servers; they are highly interconnected, thermally sensitive, and hardware-dense ecosystems where a single GPU memory leak, network bottleneck, or silent data corruption can stall millions of dollars in compute time.

We set out to build an industrial-grade, highly dense, operational intelligence platform that handles the chaotic realities of multi-node GPU clusters, automates complex job scheduling using advanced bin-packing principles, and diagnoses distributed training failures in real-time. It’s built to replicate the high-density environment of a NASA Mission Control center merged with the raw data throughput of a Bloomberg Terminal.

What it does AI Factory Ops is a cohesive single-page operational center divided into three highly specialized workspaces that handle the lifecycle of an AI datacenter:

  1. App Performance Advisor (Cluster Telemetry & Optimization) Live Infrastructure Telemetry: Monitors real-time fluctuating throughput, p95 latencies, error rates, and aggregate GPU utilization.

Global Cluster Health Index: Calculates an algorithmic score derived from interconnect saturation, thermal patterns, and error propagation to classify the cluster state (e.g., DEGRADED, CRITICAL, HEALTHY).

Deterministic Action Simulator: Allows operators to immediately deploy resources, reroute high-bandwidth traffic, or execute error isolation routines, instantly recalculating downstream telemetry and mitigating cluster risks.

  1. GPU Job Placement (Advanced Workload Orchestration) Topology-Aware Cluster Grid: Visualizes multi-rack node clusters (NODE-A1 through NODE-B4), tracking active thermal zones and memory saturation bars.

Best-Fit Decreasing (BFD) Scheduler: Executes a live scheduling algorithm that prioritizes workloads based on tier importance and packs massive LLM or Vision training pipelines into nodes with the least remaining capability that still satisfies scheduling invariants.

  1. Job Failure Detective (Distributed Incident Forensic System) Time-Series Post-Mortem Explorer: Visualizes the precise microsecond an out-of-memory (OOM) error or NVLink network breakdown occurs, overlaying real-time streaming infrastructure logs.

Ranked Root Cause Analysis Engine: Leverages telemetry fingerprints to rank failure triggers (e.g., NVLink Interconnect Degradation, CUDA Host-Side Memory Allocation Failures) alongside strict confidence boundaries and projected risk outcomes.

How we built it The interface was engineered from the ground up for maximum visual density and minimal latency, utilizing a modern front-end engineering pipeline tailored for enterprise-grade tools:

Framework Architecture: Built entirely using React functional components combined with declarative state hooks (useState, useEffect, useMemo) to coordinate complex multi-subsystem states without dropping frames.

Industrial Utility Styling: Styled completely within Tailwind CSS using a highly specialized industrial design language. We relied on deep slate backgrounds (#0A0E1A, #0D1117), high-contrast diagnostic indicators (Cyan #00D4FF, Crimson #FF4757), and a telemetry font scale centered on mono-spaced readability (JetBrains Mono).

Challenges we ran into

  1. Synchronizing Multi-System State Dependencies When an operator fires a mitigation action (like "Reroute Traffic"), that choice shouldn't just change one chart. It has to ripple cleanly through the Global Health Index, drop error metrics across the telemetry suite, and shift the natural language narrative generated by the SRE Copilot. Synchronizing these downstream side-effects cleanly without causing layout shifts or infinite rendering loops required rigorous data design and declarative state modeling.

  2. Formulating the In-Browser Job Scheduler Implementing a real-time Best-Fit Decreasing GPU Scheduler inside a client-side state engine meant translating complex algorithmic logic into immediate UI responses. We had to ensure that when a multi-node job array was scheduled, node availability vectors were updated instantly, blocking statuses were accurately caught, and before/after metrics were recalculated dynamically without locking up the main rendering thread.

  3. Balancing Information Density with Operator Cognition Enterprise SREs hate wasted space, but they also hate visual chaos. Designing a layout that holds over 50 real-time telemetry points, action logs, system indicators, and node maps required maintaining a precise visual hierarchy. We resolved this by constraining the use of glow properties exclusively to CRITICAL status violations, grouping modular panels by analytical intent, and framing historical context neatly using compact badges.

Accomplishments that we're proud of Zero-Lag Live System Feel: We successfully bypassed the typical "static mockup look" by structuring high-performance background tick simulations. The dashboard truly feels like it is hooked directly into a roaring rack of NVIDIA H100 nodes.

Deterministic Algorithmic Realism: The scheduler isn't random. It actively respects infrastructure capacity constraints and priority queuing rules, allowing judges and infrastructure engineers to quickly see genuine technical logic under the hood.

What we learned State Telemetry Modeling: Designing software for SREs teaches you to value the signal-to-noise ratio over purely decorative aesthetics. Every pixel must serve a diagnostic purpose, immediately answering: What is happening, why is it happening, and what should be done next?

Built With

Share this project:

Updates