Rescuing the MIMs: AI-Powered Major Incident Command
Inspiration
Major incident response becomes chaotic at exactly the moment teams need clarity most.
When a critical service fails, responders must quickly understand the incident, find the right operational knowledge, choose a safe remediation path, gain approval, execute the change, and verify recovery. In practice, that information is often fragmented across incident records, knowledge articles, implementation plans, validation steps, and individual experience.
I wanted to explore a simple question:
What would it look like if AI could reduce that cognitive load without removing human control?
That led me to build Rescuing the MIMs: AI-Powered Major Incident Command — an approval-gated incident-intelligence platform that turns incoming incidents into structured, reviewable, and auditable remediation workflows.
What it does
Rescuing the MIMs turns incoming operational and cyber incidents into approval-gated remediation workflows.
The AI coordinates the response; it does not receive unrestricted authority to change production systems.
For each incident, the platform:
- classifies whether Major Incident Management review is required;
- retrieves relevant Knowledge Base Articles, Detailed Implementation Plans, validation playbooks, and execution policies;
- proposes a sequenced remediation plan;
- pauses before execution until a human approves each action;
- executes permitted UAT actions through whitelisted Ansible playbooks;
- validates the outcome and records evidence;
- supports a separately approved rollback action where required.
The system supports both interactive incident submission through the dashboard and live ingestion through a Redis-backed queue.
The dashboard presents the incoming incident queue in a compact review table, including the incident ID, service, priority, classification, confidence score, and current workflow status.
Live demonstration scenario
The live demo uses a Salesforce SSO redirect loop caused by stale SAML certificate metadata.
The system creates a UAT-first workflow that:
- checks whether the active Salesforce-side certificate fingerprint has drifted;
- compares the vendor metadata with the approved identity-provider metadata;
- patches the UAT metadata and restarts a fake authentication deployment in Kubernetes;
- validates rollout readiness and runs a pilot-login check;
- records the UAT result before any production change is considered.
Each execution step requires explicit human approval.
The platform also ingests a broader set of normalized cyber incidents, including application exploitation, endpoint compromise, data exfiltration, and security-triage events. These incidents are matched against reusable operational playbooks and surfaced for review in the same queue.
How I built it
I built the project as a modular incident-response platform using Python, FastAPI, React, MongoDB, Redis, Ansible, Docker, and Google Cloud Platform.
Application architecture
Incoming incidents or streamed cyber events
↓
FastAPI API + Redis ingestion worker
↓
Workflow coordinator
↓
MongoDB operational memory
(KBAs, DIPs, validation playbooks, execution policies)
↓
Approval-gated remediation plan
↓
Whitelisted Ansible execution in UAT
↓
Validation evidence + rollback support
↓
React review dashboard
Backend
The FastAPI backend exposes endpoints for:
- creating workflows from manual or sample incident payloads;
- listing saved workflows for review;
- approving individual remediation actions;
- exposing available datasets and execution settings;
- checking service health.
The workflow coordinator maintains a structured workflow state containing:
- the incident details;
- the AI-generated analysis;
- matched KBAs and DIPs;
- the proposed action plan;
- approved action IDs;
- execution results;
- validation evidence;
- notes for the incident reviewer.
Operational memory
I used MongoDB as an operational-memory layer, accessed through an MCP-compatible service.
The database stores:
- similar incidents;
- Detailed Implementation Plans;
- Knowledge Base Articles;
- validation playbooks;
- execution policies.
This allows the platform to retrieve context rather than generating remediation instructions from scratch.
Execution and safety controls
Approved actions are executed through whitelisted Ansible playbooks. I intentionally constrained the live demonstration to a UAT namespace in Google Kubernetes Engine.
The execution policy includes controls such as:
- UAT-only remediation;
- explicit human approval before execution;
- a restricted cluster and GCP location;
- a whitelist of permitted playbooks;
- simulated execution by default;
- real execution only when explicitly enabled;
- a separately generated rollback action requiring its own approval.
This design keeps the AI in a coordination role while maintaining a clear boundary between recommendation and execution.
Streaming and normalization
I also created a normalization and ingestion pipeline for cyber events.
A publisher script sends events into Redis with configurable delays and burst patterns, allowing the dashboard to simulate a live operational queue. A worker consumes these events, transforms them into incident workflows, and saves them for review.
I normalized a larger dataset containing 16,860 cyber events and seeded a representative subset into the operational-memory layer. The current demo includes reusable mappings for categories such as:
| Cyber category | Example remediation context |
|---|---|
| Application exploitation | Application-server investigation and containment |
| Data exfiltration | Evidence preservation and access review |
| Endpoint compromise | Endpoint isolation and remediation |
| Security triage | Manual review and escalation |
Cloud infrastructure
The live remediation path runs against Google Cloud Platform using:
- Google Kubernetes Engine;
- Terraform-managed infrastructure;
- Kubernetes namespaces, roles, and bindings;
- Docker Compose for local orchestration;
- Cloud Run packaging for the hosted review dashboard and API.
Challenges I ran into
Handling rollback correctly
Rollback required more than running a second playbook. I implemented it as a distinct, auditable action linked to the original remediation step, with its own explicit approval requirement.
Connecting multiple independent services
The application is composed of several services that need to start and communicate correctly:
React frontend
→ FastAPI backend
→ MongoDB MCP HTTP service
→ MongoDB container
Redis publisher
→ Redis queue
→ Normalized incident worker
→ Saved workflow state
FastAPI approval endpoint
→ Ansible runner
→ GKE control plane
→ UAT Kubernetes deployment
Debugging container startup ordering, service connectivity, and environment configuration required careful iteration.
Turning operational knowledge into reusable context
Real remediation knowledge is not always neatly structured. I needed to model KBAs, DIPs, validation playbooks, and execution policies in a way that was detailed enough to be useful but simple enough to demonstrate clearly.
Building a realistic live remediation demo
The Salesforce SSO scenario required a visible and reversible fault state. I created Kubernetes ConfigMaps representing the approved and active certificate fingerprints, deliberately introduced stale metadata, and used Ansible to detect the drift, apply the corrected value, restart the UAT deployment, and validate the result.
Keeping the project focused
There were many directions I could have expanded into: production approvals, richer agent orchestration, more infrastructure automation, more datasets, and deeper observability. One of the main challenges was keeping the hackathon version focused on a complete, demonstrable workflow.
Accomplishments that I am proud of
The project is more than a static AI prototype. It demonstrates a working end-to-end operational workflow.
The current version can:
- ingest incidents through a dashboard or Redis-backed stream;
- classify operational and cyber incidents;
- retrieve matched remediation context from MongoDB;
- generate a UAT-first action plan;
- require explicit approval before each execution step;
- run whitelisted Ansible playbooks;
- patch a live Kubernetes ConfigMap;
- restart and validate a deployment in GKE;
- preserve execution evidence and workflow state;
- generate a separately approved rollback action.
I also normalized a dataset containing 16,860 cyber events and added reusable mappings for application exploitation, endpoint compromise, data exfiltration, and security-triage scenarios.
What I learned
I learned that AI is most useful in Major Incident Management when it reduces cognitive overload rather than attempting to replace operational judgment.
The model does not need to be the final decision-maker to provide value. It can help by:
- structuring incomplete incident information;
- identifying likely remediation context;
- presenting a clear sequence of next actions;
- reducing the time spent searching through documentation;
- tracking what has already been approved and executed;
- preserving evidence for validation and audit;
- making rollback options visible before they are needed.
I also learned that the quality of operational knowledge matters as much as the model itself. The platform becomes more useful as teams contribute their own KBAs, DIPs, validation checks, and execution policies.
Finally, I learned that a strong prototype does not need to pretend to be a fully finished production platform. A focused UAT-first workflow with clear safety boundaries can demonstrate the core idea more convincingly than a broad but shallow feature set.
What's next for Rescuing the MIMs: AI-Powered Major Incident Command
The next step is to evolve the prototype into a more production-ready incident-command platform.
Planned improvements include:
- expanding the library of service-specific remediation plans;
- adding richer semantic retrieval across operational documentation;
- introducing role-based approval groups for Major Incident Managers, service owners, and change approvers;
- integrating with enterprise incident platforms such as ServiceNow;
- improving observability with richer execution logs, metrics, and audit views;
- adding more sophisticated validation checks and failure handling;
- supporting staged promotion from UAT to production after successful validation;
- expanding the normalized event pipeline with Pub/Sub and managed cloud services;
- packaging the deployment for repeatable installation through Terraform and Helm;
- improving the dashboard for concurrent incidents and high-volume operational queues.
The long-term vision is not a system that blindly automates production changes.
It is an AI-assisted incident-command layer that helps teams respond faster, reuse institutional knowledge, make safer decisions, and maintain human control when it matters most.
Log in or sign up for Devpost to join the conversation.