Inspiration

SplunkSense: AI-Powered Incident Investigation and Remediation

MIT License

Copyright (c) 2026

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files...

Inspiration

Modern engineering teams are overwhelmed by alerts. While observability platforms can detect issues, engineers still spend significant time manually investigating logs, correlating metrics, identifying root causes, and deciding on remediation actions.

We wanted to answer a simple question:

What if an AI agent could handle the first stages of incident response automatically?

Our goal was to build a system that not only detects incidents but also investigates them, explains what happened, predicts what might happen next, and assists with remediation.

That idea became SplunkSense.


What it does

SplunkSense is an AI-powered incident response platform that combines Dynatrace observability data with Splunk analytics and AI-driven workflows.

When an incident occurs, SplunkSense:

  1. Ingests incidents from Dynatrace
  2. Launches an automated investigation workflow
  3. Uses Splunk MCP tools to gather logs and metrics
  4. Identifies probable root causes
  5. Generates human-readable explanations
  6. Predicts future failures using historical trends
  7. Recommends or executes remediation actions
  8. Maintains a complete audit trail of every action

The result is a system that helps teams move from alert to resolution in seconds instead of minutes.


How we built it

The solution consists of several integrated components:

Dynatrace Integration

Dynatrace provides real-time incidents, service health information, and observability signals.

Splunk Platform

Splunk serves as the analytics and investigation engine, storing and querying operational telemetry.

Splunk MCP Server

Splunk MCP enables AI agents to execute investigation workflows through tool-based interactions with Splunk.

AI Investigation Engine

An agentic workflow orchestrates multiple investigation steps:

  • Incident collection
  • Log analysis
  • Metric correlation
  • Root-cause discovery
  • Recommendation generation

Forecasting Engine

Historical Splunk metrics are analyzed to identify resource exhaustion and capacity risks before they become outages.

Remediation Workflow

Engineers can review recommendations and approve remediation actions through a human-in-the-loop process.

Audit & Governance Layer

Every AI decision, investigation step, and remediation action is recorded for transparency and accountability.


Challenges we ran into

One of the biggest challenges was integrating multiple observability systems into a single workflow.

We needed to:

  • Correlate Dynatrace incidents with Splunk telemetry
  • Design reliable agent workflows
  • Handle long-running Splunk searches
  • Create meaningful root-cause summaries
  • Maintain transparency for AI-generated decisions

Another challenge was balancing automation with operational safety. Instead of allowing fully autonomous actions, we implemented a human approval step before remediation execution.


What we learned

During development, we learned:

  • How AI agents can interact with observability platforms through MCP-based tooling
  • The importance of explainability in operational AI systems
  • Effective patterns for human-in-the-loop automation
  • Techniques for correlating logs, metrics, and incidents across platforms

We also gained hands-on experience building agentic workflows that interact with production-style monitoring environments.


What's next for SplunkSense

Future enhancements include:

  • Multi-agent collaboration for complex investigations
  • Automated runbook generation
  • Incident similarity detection
  • Advanced predictive analytics
  • Integration with ticketing and collaboration platforms
  • Expanded remediation capabilities

Our vision is to transform observability data into autonomous operational intelligence that helps teams resolve incidents faster and more confidently.


Why SplunkSense

SplunkSense transforms observability data into intelligent action.

By combining Dynatrace, Splunk, MCP tooling, and AI-driven workflows, it reduces investigation time, improves operational efficiency, and empowers engineers to focus on solving problems rather than searching for information.

Built With

Share this project:

Updates