Inspiration

I built MIMIntel because major incident teams often have to make high-pressure decisions while switching between alerts, logs, past incidents, knowledge articles, runbooks, and approval processes. In real environments, the delay is rarely just “detecting the issue” — it is understanding impact, finding the right operational context, deciding what action is safe, and getting that action approved.

MIMIntel was inspired by that gap: turning noisy incident signals into a structured, AI-assisted workflow that helps responders move from detection to decision faster, while still keeping humans in control of remediation.

What it does

MIMIntel ingests incident events and uses Splunk AI capabilities to support major incident classification, context retrieval, action planning, and approval-gated remediation.

The system can classify whether an incident is likely to be a major incident, enrich it with relevant known issues, previous incident patterns, KBAs, DIPs, and validation playbooks, then generate a recommended workflow for responders to review. Instead of automatically executing risky changes, MIMIntel creates a human approval checkpoint before remediation steps are run.

Once approved, remediation actions can be executed in a controlled UAT environment, with validation evidence captured back into the workflow. The result is an end-to-end incident intelligence loop: detect, classify, enrich, recommend, approve, execute, and validate.

How I built it

I built MIMIntel as a cloud-native incident workflow demo with a FastAPI backend, a React dashboard, Redis-based incident queueing, MongoDB-backed operational memory, and Ansible remediation playbooks.

The backend coordinates the incident workflow lifecycle, including incident analysis, context retrieval, action planning, approval handling, execution, and validation. I used Redis to simulate a live queue of normalized incidents, while MongoDB stores operational memory such as similar incidents, known business impacts, diagnostic investigation procedures, knowledge base articles, validation playbooks, and execution policies.

For remediation, I used Ansible playbooks to demonstrate controlled operational actions, including a Salesforce SAML certificate drift scenario in a UAT Kubernetes environment. I deployed the supporting infrastructure using GCP components including GKE, Terraform-managed resources, Kubernetes namespaces, service accounts, and controlled execution policies.

Splunk sits at the observability and intelligence layer, supporting the incident visibility, search, analysis, and AI-assisted workflow story.

Challenges I ran into

One major challenge was designing the boundary between AI recommendation and operational safety. For incident response, it is not enough for AI to suggest an action — the system also needs approval gates, environment restrictions, validation steps, and rollback thinking.

Another challenge was connecting multiple moving parts into a coherent demo: live incident ingestion, workflow state management, operational memory, approval logic, playbook execution, Kubernetes validation, and Splunk visibility. I also ran into practical integration issues around queues, service startup order, Terraform provider configuration, Kubernetes namespaces, playbook whitelisting, and making sure workflow status transitions were clear enough to demonstrate.

The biggest product challenge was keeping the demo focused. Major incident management can quickly become too broad, so I focused MIMIntel on one clear value chain: turning an incident signal into an explainable, reviewable, approval-gated response workflow.

Accomplishments that I'm proud of

I’m proud that MIMIntel demonstrates more than a static chatbot or mock incident assistant. It models a real operational workflow with state, context, approvals, execution policies, remediation steps, and validation evidence.

The system can process both manual sample incidents and live queued incidents, classify impact, retrieve relevant operational context, and generate structured response plans. It also demonstrates a realistic UAT remediation path for a Salesforce SSO/SAML incident, where proposed actions must be approved before execution.

I’m especially proud of the approval-gated design. The goal was not to replace incident managers or engineers, but to reduce the manual effort required to understand an incident and prepare a safe response.

What I learned

I learned that the most valuable use of AI in major incident management is not simply summarising alerts. The real value comes from connecting observability data, operational history, known procedures, and controlled execution into one decision-support workflow.

I also learned how important it is to make AI recommendations auditable. In an incident environment, responders need to know why an action was suggested, what evidence supports it, what environment it will affect, and whether it has been validated.

Finally, I learned that successful automation in operations is less about full autonomy and more about trustworthy escalation: helping humans make faster, better-informed decisions without removing accountability.

What's next for MIMIntel

Next, I would like to extend MIMIntel with deeper Splunk integration, richer incident correlation, and more production-grade workflow tracking. This includes expanding the Splunk dashboards, improving the audit trail, adding more incident types, and connecting additional remediation playbooks.

Future versions could also support stronger rollback workflows, more granular approval policies, role-based access controls, and integrations with tools such as ServiceNow, Jira, PagerDuty, Slack, and cloud-native monitoring systems.

The long-term vision is for MIMIntel to become an incident command co-pilot: helping teams detect major incidents faster, understand operational impact sooner, recommend safe next actions, and validate recovery with evidence.

Built With

Share this project:

Updates