Inspiration
SplunkSense: AI-Powered Incident Investigation and Remediation
MIT License
Copyright (c) 2026
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files...
Inspiration
Modern engineering teams are overwhelmed by alerts. While observability platforms can detect issues, engineers still spend significant time manually investigating logs, correlating metrics, identifying root causes, and deciding on remediation actions.
We wanted to answer a simple question:
What if an AI agent could handle the first stages of incident response automatically?
Our goal was to build a system that not only detects incidents but also investigates them, explains what happened, predicts what might happen next, and assists with remediation.
That idea became SplunkSense.
What it does
SplunkSense is an AI-powered incident response platform that combines Dynatrace observability data with Splunk analytics and AI-driven workflows.
When an incident occurs, SplunkSense:
- Ingests incidents from Dynatrace
- Launches an automated investigation workflow
- Uses Splunk MCP tools to gather logs and metrics
- Identifies probable root causes
- Generates human-readable explanations
- Predicts future failures using historical trends
- Recommends or executes remediation actions
- Maintains a complete audit trail of every action
The result is a system that helps teams move from alert to resolution in seconds instead of minutes.
How we built it
The solution consists of several integrated components:
Dynatrace Integration
Dynatrace provides real-time incidents, service health information, and observability signals.
Splunk Platform
Splunk serves as the analytics and investigation engine, storing and querying operational telemetry.
Splunk MCP Server
Splunk MCP enables AI agents to execute investigation workflows through tool-based interactions with Splunk.
AI Investigation Engine
An agentic workflow orchestrates multiple investigation steps:
- Incident collection
- Log analysis
- Metric correlation
- Root-cause discovery
- Recommendation generation
Forecasting Engine
Historical Splunk metrics are analyzed to identify resource exhaustion and capacity risks before they become outages.
Remediation Workflow
Engineers can review recommendations and approve remediation actions through a human-in-the-loop process.
Audit & Governance Layer
Every AI decision, investigation step, and remediation action is recorded for transparency and accountability.
Challenges we ran into
One of the biggest challenges was integrating multiple observability systems into a single workflow.
We needed to:
- Correlate Dynatrace incidents with Splunk telemetry
- Design reliable agent workflows
- Handle long-running Splunk searches
- Create meaningful root-cause summaries
- Maintain transparency for AI-generated decisions
Another challenge was balancing automation with operational safety. Instead of allowing fully autonomous actions, we implemented a human approval step before remediation execution.
What we learned
During development, we learned:
- How AI agents can interact with observability platforms through MCP-based tooling
- The importance of explainability in operational AI systems
- Effective patterns for human-in-the-loop automation
- Techniques for correlating logs, metrics, and incidents across platforms
We also gained hands-on experience building agentic workflows that interact with production-style monitoring environments.
What's next for SplunkSense
Future enhancements include:
- Multi-agent collaboration for complex investigations
- Automated runbook generation
- Incident similarity detection
- Advanced predictive analytics
- Integration with ticketing and collaboration platforms
- Expanded remediation capabilities
Our vision is to transform observability data into autonomous operational intelligence that helps teams resolve incidents faster and more confidently.
Why SplunkSense
SplunkSense transforms observability data into intelligent action.
By combining Dynatrace, Splunk, MCP tooling, and AI-driven workflows, it reduces investigation time, improves operational efficiency, and empowers engineers to focus on solving problems rather than searching for information.
Log in or sign up for Devpost to join the conversation.