Gantry

Inspiration

Industrial equipment fails in silence. Most operators only know something is wrong when a machine stops — by which time the damage is done, the downtime clock is running, and emergency parts are on the way at 3x the price. We wanted to flip that model entirely: what if the machine told you it was dying, an AI agent swarm handled the response, and the human operator only needed to press one button to confirm?

That question became Gantry.

What It Does

Gantry is a real-time Digital Twin Command Center for industrial predictive maintenance. It monitors live engine telemetry streamed from Elasticsearch, detects imminent failures using a trained Deep Reinforcement Learning (PPO) policy, and deploys a multi-agent MCP swarm to propose and execute a repair plan — all before a human even reaches for the phone.

Key capabilities:

Live telemetry dashboard — RUL (Remaining Useful Life), vibration (g RMS), unit status, and cycle count streamed over WebSocket from Elasticsearch
Failure detection & system halt — When RUL hits zero the simulation freezes, the dashboard locks on failure values, and a live downtime counter starts
Autonomous MCP Agent Swarm — 10 reasoning steps (Watchman → Foreman → Inventory → Procurement → Logistics → Shadow Model → DRL Policy → Personnel → Auditor → Gantry AI) broadcast live to the overlay
Deep Reinforcement Learning — Stable-Baselines3 PPO model (gantry_policy_v1) trained on 20,631 NASA C-MAPSS records makes the repair/monitor decision
Shadow Model conflict detection — A simple rule-based policy runs in parallel with DRL; conflicts are surfaced to the operator for human-in-the-loop review
Cost comparison panel — Reactive ($18,500 / 72h downtime) vs Preventive ($7,200 / 24h) vs Predictive AI ($2,800 / 4h) shown side-by-side at solution time
Agent Chat — A persistent chat interface backed by Elastic Agent Builder's /converse API. Context is injected dynamically from live telemetry, so answers always reflect the current dashboard state — not stale history
One-click resume — "Accept & Resume System" calls a backend endpoint that clears the halt flag, broadcasts system_resumed over WebSocket, and triggers a 30-second grace window while the simulator catches up

How We Built It

Backend: FastAPI (Python) with uvicorn, managing WebSocket connections, MCP orchestration, DRL inference, and the Elastic Agent Builder proxy. All Elasticsearch auth stays server-side.

AI Layer:

Stable-Baselines3 PPO model trained on the NASA C-MAPSS FD001 dataset (20,631 sensor rows, RUL labels derived from max-cycle normalization)
A custom GantryEnv OpenAI Gym environment simulates the cost/risk tradeoff between monitoring and ordering parts
Shadow model runs a deterministic rule (RUL < 10 → order) alongside DRL and flags divergence

Elasticsearch: Two indices — gantry_telemetry (sensor time-series) and gantry_personnel (technician availability). ES|QL queries power real-time telemetry fetch and personnel lookup inside the MCP swarm.

Elastic Agent Builder: A custom Agent is used as the conversational backbone. The backend injects a dynamic context string (unit ID, RUL, vibration, status, DRL decision, downtime) into every /converse call so the agent answers with live data, not generic text.

Frontend: React + Vite, Tailwind CSS, Framer Motion. WebSocket hook auto-reconnects with exponential backoff. The CriticalOverlay modal drives the entire failure → analysis → solution → resume UX flow.

MCP Tools Used: get_telemetry_status, inventory_procurement, personnel_locator, platform_core_execute_esql, observability_get_alerts

Challenges We Ran Into

Elasticsearch indexing lag — When trigger_failure.py injected a RUL=0 document, the MCP get_telemetry_status query would still return the previous healthy doc for 1–2 seconds. The agent steps were showing RUL=5 instead of RUL=0. Fixed by bypassing ES entirely for the telemetry step — the failure snapshot is stored in memory at broadcast time and injected directly into orchestration.

Post-resume dashboard staying red — After the operator accepted the solution, the WS loop would immediately re-fetch from ES — and the last document was still the failure row (simulation hadn't written a new healthy row yet). Fixed with a 30-second grace window that serves a synthetic healthy payload, then seamlessly hands off to real simulation data.

Vite proxy path stripping — The dev proxy rewrites /api/* → /* before forwarding to FastAPI. A @app.post("/api/system-resume") route was therefore a silent 404 on every "Accept" click, leaving _system_halted = True forever. The catch block swallowed the error so it was invisible. Fixed by aligning the FastAPI route to /system-resume.

Chat returning stale failure data after resume — The chat endpoint always read from _last_decision, which held CRITICAL/RUL=0 from the previous orchestration run. Added _last_live_telemetry — saved on every WS tick — so the chat context always reflects whatever the dashboard is currently showing.

Accomplishments That We're Proud Of

A fully end-to-end autonomous maintenance pipeline: sensor data → failure detection → MCP swarm → DRL decision → cost comparison → one-click resolution — no manual steps required
The cost comparison panel makes the ROI of predictive AI immediately tangible: $15,700 saved per event vs reactive maintenance
The Shadow Model / DRL conflict detection is a genuinely novel explainability layer — it shows operators why two AI systems disagree and lets them make an informed override
The chat interface feels live because it actually is — it reads from the same telemetry the dashboard is displaying, not cached orchestration output

What We Learned

Elastic Agent Builder's /converse API is powerful when paired with injected context — it turns a generic LLM into a domain-specific operator assistant with zero fine-tuning
ES|QL makes complex aggregations inside an agentic pipeline remarkably clean — one query replaces what would otherwise be multiple API calls
WebSocket state management and server-side halt logic need to be designed together from the start; retrofitting halt/resume into a live streaming system creates subtle race conditions at every layer
Deep RL for maintenance scheduling is viable even on small datasets when the reward function is carefully designed around real cost structures

What's Next for Gantry

Multi-unit fleet view — extend the twin to monitor 10+ engines simultaneously with a fleet health heatmap
Causal root-cause analysis — use Elastic's ML anomaly detection to trace which sensor pattern caused the failure, not just that failure occurred
Automated work order generation — when the agent approves express shipping, auto-generate and dispatch the purchase order via a webhook integration
Continuous DRL retraining — feed confirmed outcomes (was the repair correct? did the part arrive in time?) back into the training loop so the policy improves with every event

Built With

elastic-agent-builder
elasticsearch
es|ql
fastapi
framer-motion
lucide
mcp-(model-context-protocol)
nasa-c-mapss-dataset
openai-gym
ppo-(deep-reinforcement-learning)
python
react
stable-baselines3
tailwind-css
uvicorn
vite
websockets

Updates

ABDELAALI MOUID started this project — Feb 26, 2026 08:49 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.