Inspiration
Industrial equipment fails in silence. Most operators only know something is wrong when a machine stops — by which time the damage is done, the downtime clock is running, and emergency parts are on the way at 3x the price. We wanted to flip that model entirely: what if the machine told you it was dying, an AI agent swarm handled the response, and the human operator only needed to press one button to confirm?
That question became Gantry.
What It Does
Gantry is a real-time Digital Twin Command Center for industrial predictive maintenance. It monitors live engine telemetry streamed from Elasticsearch, detects imminent failures using a trained Deep Reinforcement Learning (PPO) policy, and deploys a multi-agent MCP swarm to propose and execute a repair plan — all before a human even reaches for the phone.
Key capabilities:
- Live telemetry dashboard — RUL (Remaining Useful Life), vibration (g RMS), unit status, and cycle count streamed over WebSocket from Elasticsearch
- Failure detection & system halt — When RUL hits zero the simulation freezes, the dashboard locks on failure values, and a live downtime counter starts
- Autonomous MCP Agent Swarm — 10 reasoning steps (Watchman → Foreman → Inventory → Procurement → Logistics → Shadow Model → DRL Policy → Personnel → Auditor → Gantry AI) broadcast live to the overlay
- Deep Reinforcement Learning — Stable-Baselines3 PPO model (
gantry_policy_v1) trained on 20,631 NASA C-MAPSS records makes the repair/monitor decision - Shadow Model conflict detection — A simple rule-based policy runs in parallel with DRL; conflicts are surfaced to the operator for human-in-the-loop review
- Cost comparison panel — Reactive ($18,500 / 72h downtime) vs Preventive ($7,200 / 24h) vs Predictive AI ($2,800 / 4h) shown side-by-side at solution time
- Agent Chat — A persistent chat interface backed by Elastic Agent Builder's
/converseAPI. Context is injected dynamically from live telemetry, so answers always reflect the current dashboard state — not stale history - One-click resume — "Accept & Resume System" calls a backend endpoint that clears the halt flag, broadcasts
system_resumedover WebSocket, and triggers a 30-second grace window while the simulator catches up
How We Built It
Backend: FastAPI (Python) with uvicorn, managing WebSocket connections, MCP orchestration, DRL inference, and the Elastic Agent Builder proxy. All Elasticsearch auth stays server-side.
AI Layer:
- Stable-Baselines3 PPO model trained on the NASA C-MAPSS FD001 dataset (20,631 sensor rows, RUL labels derived from max-cycle normalization)
- A custom
GantryEnvOpenAI Gym environment simulates the cost/risk tradeoff between monitoring and ordering parts - Shadow model runs a deterministic rule (
RUL < 10 → order) alongside DRL and flags divergence
Elasticsearch: Two indices — gantry_telemetry (sensor time-series) and gantry_personnel (technician availability). ES|QL queries power real-time telemetry fetch and personnel lookup inside the MCP swarm.
Elastic Agent Builder: A custom Agent is used as the conversational backbone. The backend injects a dynamic context string (unit ID, RUL, vibration, status, DRL decision, downtime) into every /converse call so the agent answers with live data, not generic text.
Frontend: React + Vite, Tailwind CSS, Framer Motion. WebSocket hook auto-reconnects with exponential backoff. The CriticalOverlay modal drives the entire failure → analysis → solution → resume UX flow.
MCP Tools Used: get_telemetry_status, inventory_procurement, personnel_locator, platform_core_execute_esql, observability_get_alerts
Challenges We Ran Into
Elasticsearch indexing lag — When trigger_failure.py injected a RUL=0 document, the MCP get_telemetry_status query would still return the previous healthy doc for 1–2 seconds. The agent steps were showing RUL=5 instead of RUL=0. Fixed by bypassing ES entirely for the telemetry step — the failure snapshot is stored in memory at broadcast time and injected directly into orchestration.
Post-resume dashboard staying red — After the operator accepted the solution, the WS loop would immediately re-fetch from ES — and the last document was still the failure row (simulation hadn't written a new healthy row yet). Fixed with a 30-second grace window that serves a synthetic healthy payload, then seamlessly hands off to real simulation data.
Vite proxy path stripping — The dev proxy rewrites /api/* → /* before forwarding to FastAPI. A @app.post("/api/system-resume") route was therefore a silent 404 on every "Accept" click, leaving _system_halted = True forever. The catch block swallowed the error so it was invisible. Fixed by aligning the FastAPI route to /system-resume.
Chat returning stale failure data after resume — The chat endpoint always read from _last_decision, which held CRITICAL/RUL=0 from the previous orchestration run. Added _last_live_telemetry — saved on every WS tick — so the chat context always reflects whatever the dashboard is currently showing.
Accomplishments That We're Proud Of
- A fully end-to-end autonomous maintenance pipeline: sensor data → failure detection → MCP swarm → DRL decision → cost comparison → one-click resolution — no manual steps required
- The cost comparison panel makes the ROI of predictive AI immediately tangible: $15,700 saved per event vs reactive maintenance
- The Shadow Model / DRL conflict detection is a genuinely novel explainability layer — it shows operators why two AI systems disagree and lets them make an informed override
- The chat interface feels live because it actually is — it reads from the same telemetry the dashboard is displaying, not cached orchestration output
What We Learned
- Elastic Agent Builder's
/converseAPI is powerful when paired with injected context — it turns a generic LLM into a domain-specific operator assistant with zero fine-tuning - ES|QL makes complex aggregations inside an agentic pipeline remarkably clean — one query replaces what would otherwise be multiple API calls
- WebSocket state management and server-side halt logic need to be designed together from the start; retrofitting halt/resume into a live streaming system creates subtle race conditions at every layer
- Deep RL for maintenance scheduling is viable even on small datasets when the reward function is carefully designed around real cost structures
What's Next for Gantry
- Multi-unit fleet view — extend the twin to monitor 10+ engines simultaneously with a fleet health heatmap
- Causal root-cause analysis — use Elastic's ML anomaly detection to trace which sensor pattern caused the failure, not just that failure occurred
- Automated work order generation — when the agent approves express shipping, auto-generate and dispatch the purchase order via a webhook integration
- Continuous DRL retraining — feed confirmed outcomes (was the repair correct? did the part arrive in time?) back into the training loop so the policy improves with every event
Built With
- elastic-agent-builder
- elasticsearch
- es|ql
- fastapi
- framer-motion
- lucide
- mcp-(model-context-protocol)
- nasa-c-mapss-dataset
- openai-gym
- ppo-(deep-reinforcement-learning)
- python
- react
- stable-baselines3
- tailwind-css
- uvicorn
- vite
- websockets
Log in or sign up for Devpost to join the conversation.