Inspiration

SRE teams spend 70%+ of incident time on diagnosis rather than resolution. We asked: what if AI could predict failures before they happen, investigate them autonomously, and deliver a fix — all before users are impacted?

What it does

PRISM — a full-stack platform that connects Splunk observability data (via the Splunk MCP Server), Cisco's Deep Time Series Model for predictive anomaly detection, and Google Gemini AI for multi-agent reasoning. It predicts incidents, investigates root causes through 5 specialized AI agents, and generates remediation Pull Requests on GitHub automatically.

How we built it

The backend is Fastify + TypeScript orchestrating a multi-agent pipeline. Metrics flow from Splunk through MCP, get split into coarse/fine temporal contexts, and are scored by CDTSM using:

$$\text{score} = (\text{trend_acceleration} \times 50) + (p90_\text{divergence} \times 30) + 20$$

Agents stream results via SSE in real-time. The frontend is React 19 with TanStack Router.

Challenges we ran into

Designing the CDTSM context-splitting strategy for meaningful predictions, orchestrating 5 agents with interdependent outputs without blocking, and creating a GitHub remediation pipeline that produces reviewable code — not just suggestions.

Accomplishments that we're proud of

  • Built a multi-agent incident investigation platform that combines Splunk operational data, AI reasoning, and GitHub workflows.
  • Successfully traced incidents back to the most likely pull request using telemetry, deployment, and code change correlation. Implemented human-in-the-loop remediation, allowing engineers to review AI recommendations before creating fix PRs.
  • Integrated Splunk MCP to enable agents to investigate incidents directly from operational data rather than static datasets.
  • Added predictive reliability capabilities using time-series forecasting to identify potential issues before they become critical incidents.

What we learned

MCP as an integration protocol dramatically simplifies AI ↔ data-source communication. Multi-agent architectures shine when each agent has a narrow, well-defined scope with clear inputs/outputs.

What's next for PRISM

Enhance predictive incident prevention with advanced forecasting and anomaly detection models.

Share this project:

Updates