ForgeMind

Title Page
Deteriorating Machines
Reset

Inspiration

Machines often fail, despite all the data lying around; machines tend to fail, and the entire optimal pipeline stops! Nobody connects sensor readings to decisions fast enough. A vibration spike at 2 AM becomes a three-day production halt by Monday morning. While I was sitting in my Operations class reading a similar case study, I thought of leveraging AI!

We don't just want to create a RUL calculator, cause thats very common! We saw a lot of projects and demos, in every predictive-maintenance demo: a beautiful deep-learning model that outputs a single number and then... nothing. No translation. No action. No context.

We wanted to build the thing that comes after the ML model. The part where a messy human fault description ("bearing overheating on Machine 3") gets turned into a grounded, machine-readable diagnosis, run through a real trained CNN-LSTM, converted into capacity impact, and surfaced on a live factory floor dashboard; all in under a few seconds. That's ForgeMind.

What it does

ForgeMind is an end-to-end predictive maintenance decision system. You type a fault description in natural language; it gives you back a full operational response.

The pipeline:

Input Guard filters noise and validates input.
Diagnostic Agent (Gemini) translates the fault text into a structured sensor-anomaly tensor: raw text → (50, 18) array of realistic sensor values.
DL Engine runs that tensor through a trained CNN-LSTM to predict Remaining Useful Life.
Capacity Agent (Gemini) reasons over the predicted RUL and current factory state to compute system-level capacity impact.
Ops Analytics detects RUL cliffs, sensor saturation, and degradation trends; schedules predictive maintenance.
Floor Manager (Gemini) generates a human-readable decision for the operator.
Terminal Dashboard (Textual UI) displays everything live across 4 panes: sensor feed, capacity dashboard, agent comms log, and a chaos engine for injecting faults.

The core design principle: LLMs handle interpretation and communication; the CNN-LSTM handles prediction; rules handle safety-critical logic. No component does something it shouldn't.

How we built it

Deep Learning: CNN + LSTM hybrid trained on NASA's N-CMAPSS turbofan degradation dataset. CNN layers capture local degradation patterns across 18 sensor channels; LSTM layers model long-term temporal dependencies over a 50-timestep sliding window. Trained with NASA's asymmetric scoring function so the model is penalized more for late failure predictions than early ones; which matches real-world maintenance cost.
Agent Layer: Three Gemini-powered agents using structured output schemas via Pydantic. Each agent has its own API key, its own prompt contract, and its own fallback path if the API fails.
Factory State: A single source of truth object that every component reads from and writes to; no duplicated state across UI, agents, and ML.
Terminal UI: Textual + Rich for a real-time four-pane dashboard with live sparklines, color-coded machine health, and non-blocking async workers so heavy ML runs don't freeze the UI.
Offline resilience: A fallback cache for Gemini responses means the demo keeps running even if the API rate-limits or the WiFi dies at the venue.

Challenges we ran into

Agent contracts are brittle. Getting Gemini to consistently return a valid (50, 18) sensor-spike specification took several iterations of prompt engineering, Pydantic validation, and a retry-with-fallback layer. One malformed JSON from an LLM can crash a whole system if you don't plan for it.
Keeping the terminal UI responsive. Running PyTorch inference + three Gemini calls on the main thread froze Textual. Had to move every heavy operation into @work(thread=True) workers and use call_from_thread for state updates.

What we learned

Grounding matters more than fluency. An LLM that hallucinates a sensor value wrecks the whole pipeline. Structured output + Pydantic validation + a real ML model for the numeric prediction made the system trustworthy in a way that pure-LLM "agent" demos never feel.
The scaler is part of the model. We almost shipped a version where the weights lived in one folder and the scaler lived somewhere else. If they drift apart, inference silently lies. Treating best_model.pt and scaler.pkl as a single atomic artifact saved us.
Fallbacks aren't optional for live demos. Every external call (Gemini, API keys, etc.) needs a degraded-but-alive path. Our demo runs with or without working API keys.
Terminal UIs are underrated. Textual gave us a genuinely impressive-looking dashboard in a fraction of the time a web app would have taken.

What's next for ForgeMind

Multi-model ensemble - swap in a Transformer-based RUL predictor and let a meta-agent reconcile disagreements.
Closed-loop scheduling - wire the Ops Analytics maintenance queue into an actual CMMS (computerized maintenance management system) so recommendations become work orders automatically.
Explainability layer - show operators why a RUL dropped, not just that it did. SHAP values per sensor channel, surfaced in a fifth terminal pane.

Built With

Updates

vybhav72954 Chaturvedi started this project — Apr 18, 2026 06:44 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.