Inspiration

Every "AI for observability" product on the market — Dynatrace Davis, Datadog Watchdog, AppDynamics Cognition, Splunk's own AI Troubleshooting Agent — answers the same question: "given this symptom, what's the root cause?" They all traverse upstream, from a broken thing back to the thing that broke it.

But in an incident, the on-call engineer already knows what broke — the alert just fired on paymentservice. The question that actually decides whether it's a Sev-3 or a Sev-1 is the opposite one: "what's about to break next, and who's going to feel it?" Nobody answers that proactively. We wanted to flip the arrow — traverse downstream from root cause to predicted cascade, and write that prediction into the alert before the cascade happens.

What it does

When a Splunk alert fires, Blast Radius Predictor predicts — within seconds, at alert time — which downstream services, business workflows, and user sessions will be impacted in the next 15 minutes, and pushes the prediction into the alert payload itself.

A single alert on paymentservice becomes:

Downstream services: Likely to degrade, ranked by propagation probability.

Business workflows: Affected (e.g., checkout 38%, search 22%) and RUM sessions at risk.

Cascade ETA: When each downstream service is predicted to degrade.

Revenue risk: In $/min.

It surfaces this everywhere responders already look: a custom Splunk Alert Action, a | brpredict search command, a Dashboard Studio glass table with a cascade timeline, a Slack Block Kit message, and an MCP tool so Claude Desktop / Cursor can pull the same prediction.

How we built it

Telemetry: The OpenTelemetry Astronomy Shop (10 microservices) → Splunk OTel Collector → Splunk Observability Cloud (APM Service Map, RUM, Business Workflows).

Graph: We build a weighted dependency graph in NetworkX where edges are trace-weighted RED metrics pulled via signalfx-python and the O11y REST API.

The algorithm: A forward-propagation engine does a BFS downstream from the alerting service, multiplying learned per-edge propagation probabilities along each path.

AI layer: Splunk AI Toolkit (MLTK) learns P(downstream fails∣upstream fails) per edge; Cisco Deep Time Series forecasts each service's degradation curve; ITSI Predictive Analytics anchors the 30-minute health forecast.

Delivery: A Splunk app (custom alert action + search command), Dashboard Studio, Slack, and an MCP server — all consuming one BlastRadiusPrediction JSON payload.

Discipline: All three SDKs sit behind a src/brp/sources/ adapter layer, so the propagation engine has zero Splunk dependency and is fully unit-testable offline.

Challenges we ran into

Nobody publishes "forward" edge probabilities: RCA tooling is all upstream-oriented, so we had to derive propagation likelihoods from historical trace co-failure rather than buy them off the shelf.

Avoiding a cascade explosion: Naive BFS lights up the entire graph. We had to decay probability along paths and threshold aggressively to keep predictions precise instead of "everything is on fire."

Translating service impact into business impact: Mapping O11y Business Workflow definitions and RUM sessions onto the predicted set without hand-modeling every workflow.

Keeping the engine testable: Managing dependencies across three different Splunk SDKs — solved with a strict adapter/engine/delivery layering.

Speed: Ensuring the engine runs at alert time, not as an after-the-fact batch job.

Accomplishments

Inverted the entire category: A working forward cascade predictor where every competitor is a backward root-cause finder.

End-to-end and real: Live OTEL demo → fault injection → Splunk detector → prediction in the alert payload → Slack + dashboard + Claude, with a reproducible fault-injection evaluation harness (5 scenarios).

Clean architecture: An engine that's decoupled and offline-unit-testable, despite living on top of three Splunk SDKs.

Unified delivery: One prediction payload that fans out to five different surfaces (alert action, SPL command, dashboard, Slack, MCP) without duplication.

What we learned

The data is there: The graph and the metrics for forward prediction already exist in every APM stack — the missing piece was never data, it was direction. Same Service Map, opposite traversal.

Precision is king: Probability decay and thresholding matter more than graph depth; precision beats recall when you're writing into a pager.

Business context resonates: Business-impact translation is the part responders actually act on — "checkout down 38%, $2.4k/min" moves people in a way "12 downstream services" never does.

MCP is powerful: Exposing the prediction as an MCP tool made it trivially composable with AI assistants — the same payload an engineer reads is the one an agent reasons over.

What's next for Blast Radius Predictor

Close the loop: Feed actual post-incident outcomes back to MLTK to continuously sharpen edge probabilities.

Auto-remediation suggestions: Pair each predicted cascade with the runbook, circuit-breaker, or scale action most likely to contain it.

Advanced metrics: Confidence intervals on the cascade ETA, not just point predictions.

Broader topology sources: ITSI service trees, Kubernetes, and service mesh — beyond APM-derived edges.

"What-if" mode: Predict blast radius for a planned deploy or dependency change before shipping it, not just at alert time.

Built With

Share this project:

Updates