Inspiration

Every engineer who has been paged at 3am knows the feeling: 47 alerts firing simultaneously, a wall of dashboards showing everything is red, and no clear answer to the only question that matters — what actually caused this?

Traditional monitoring tools give you correlation. "DB latency and API errors are moving together at r=0.91." That observation is technically accurate and completely useless when your checkout service is down and the SLA clock is running.

The inspiration for ARIA was a simple frustration: the gap between detecting that something is wrong (which Splunk does brilliantly) and understanding why it's wrong and what will break next — that gap is still filled by human intuition, tribal knowledge, and luck. We wanted to close that gap with mathematics.

We asked: what if instead of more alerts, we gave on-call teams a system that automatically proves causation, not correlation? One that tells you not just "DB latency spiked" but "DB connection pool exhaustion caused the checkout cascade with 94% confidence, and without it the error rate would have been 0.2% not 14.7%"? And what if it could predict which services will fail in the next three minutes — before they actually fail?

That's ARIA.


What it does

ARIA is an autonomous multi-agent incident command system built natively on Splunk. When production breaks, ARIA:

Detects — The Sentinel agent continuously monitors Splunk for anomalies via MCP tools. When an alert fires, it classifies severity (P1–P4) and activates the pipeline.

Proves — The Forensic agent fetches metric time-series from 5+ services via splunk_get_metrics, runs the Peter-Clark (PC) causal discovery algorithm to auto-discover the directed causal structure, then uses DoWhy to estimate effect sizes. The output isn't "these are correlated" — it's "DB connection pool exhaustion caused API errors with 94% confidence." It also generates counterfactual scenarios: "Without the DB exhaustion, checkout errors would have been 0.2% not 14.7%."

Predicts — The Propagation agent performs a probability-weighted BFS over the causal graph to predict which downstream services will fail next, with impact probabilities and ETAs in minutes — before those services actually degrade. This is 3–5 minute early warning.

Fixes — The Remediation agent generates a step-by-step runbook tailored to the root cause, classifies each step by risk level, and presents a Human-in-the-Loop approval gate. Low-risk steps auto-approve after a countdown; high-risk steps require explicit operator sign-off. Approved steps execute via splunk_execute_action through the MCP Server.

Synthesizes — The Commander assembles a full Incident Brief in markdown — root cause, causal chain, blast radius, remediation status — broadcast live to the War Room dashboard and written back to Splunk.

Everything is visible in real time in the War Room: a live D3 force-directed causal graph that builds as the Forensic agent works, an animated blast radius heatmap, and an agent communications feed showing exactly what each agent is telling the next.

All of this starts autonomously when a Splunk alert fires — no human has to open a dashboard or click anything until the approval gate.


How we built it

Backend — FastAPI + LangGraph (async-native) We built the orchestration layer as a native async pipeline. LangGraph defines the agent DAG with conditional edges (remediation → approval gate → commander), and when LangGraph's sync executor conflicts with FastAPI's event loop, the pipeline falls back to sequential async execution with identical behavior. The orchestrator writes state to an in-memory cache after every agent step so late-joining WebSocket clients receive a full state replay.

Causal AI — the core differentiator We implemented a two-stage causal inference pipeline. First, the PC algorithm from causal-learn runs on the multi-service metric DataFrame (response time, error rate, CPU, memory) and produces a directed causal graph — discovering which metrics causally precede which. Second, DoWhy builds a CausalModel with treatment and outcome variables correctly specified (DoWhy ≥0.9 requires both at construction time, not just at estimation), identifies the backdoor estimand, and estimates effect size via linear regression. The result feeds into a counterfactual analyzer and a BFS blast radius predictor with configurable depth decay.

Splunk MCP integration The SplunkMCPClient wraps every Splunk operation as a JSON-RPC 2.0 call to the MCP Server endpoint (/services/mcp) with MCP Encrypted Token authentication. Agents never call Splunk REST directly — everything routes through MCP. A full demo data layer means the entire system runs in DEMO_MODE=true without any live Splunk, which was essential for development and testing.

Frontend — React + D3 The D3 force-directed causal graph was the hardest frontend problem. The ResizeObserver pattern ensures the simulation centers on actual pixel dimensions rather than zero (the early bug that put all nodes in the top-left corner). Zustand's subscribe() stores graph data in refs so the observer always reads fresh values. The WebSocket hook implements exponential backoff reconnection, and on every connect sends get_state to replay the latest state snapshot — solving the curl-trigger race condition where the pipeline finishes before the client connects.

Splunk app Five Simple XML dashboards (Overview, Incidents, Causal Analysis, Service Health, Agent Activity) and a modular alert action (aria_trigger.py) that uses the Splunk Python SDK to autonomously POST to the ARIA REST API when any configured alert fires.


Challenges we ran into

Port conflict with Splunk — Splunk Enterprise owns port 8000. We initially tried to run FastAPI on 8000 as well, which caused every API call from the frontend to hit Splunk's login redirect instead. The fix was moving ARIA to port 8001 and configuring Vite's proxy so all frontend requests use relative URLs — the browser never makes a cross-origin request, so CORS is not triggered in development at all.

DoWhy API changes — DoWhy 0.9+ changed its constructor to require both treatment and outcome at construction time, not just at identify_effect. The old pattern CausalModel(data=df, outcome=metric, graph=gml) raises a confusing error at runtime. We also had to guard against degenerate metric columns (zero variance) before passing them to the PC algorithm, which was producing numpy division warnings.

LangGraph in an async FastAPI context — LangGraph's .invoke() is synchronous. Wrapping it in run_in_executor inside a thread pool fails because the thread has no running asyncio event loop, and calling run_until_complete inside a thread pool raises "no current event loop." The solution was to treat LangGraph's ainvoke as the primary path (which supports async nodes natively) and fall back to a plain sequential await chain — functionally identical, no threading required.

WebSocket state replay race condition — When an incident is triggered via API (curl or Splunk alert action), the pipeline completes in ~2 seconds. The frontend only opens the WebSocket after detecting the new incident via polling — by then, all the live messages have already fired. We solved this with a server-side in-memory state cache: every time the orchestrator saves state, it also writes to _state_cache. When a client connects and sends get_state, the server replays the full final state in one shot.

Splunk dashboard format compatibility — The initial dashboards used <dashboard version="1.1"> (Dashboard Studio format), which requires a separately licensed component not available on fresh Enterprise installs. Rewrote all five in classic Simple XML which works on every Splunk version.


Accomplishments that we're proud of

The causal inference pipeline is the achievement we're most proud of. Combining the PC algorithm for structure discovery with DoWhy for effect estimation — and making it work reliably on real time-series metric data with proper guards for degenerate inputs — took significant effort to get right. The result produces output that is genuinely different from everything else in the observability space: not "these are correlated" but a mathematically verified causal chain with confidence scores and counterfactuals.

The counterfactual analysis is surprisingly compelling to watch in practice. Seeing "without DB connection pool exhaustion, checkout errors would have been 0.2% not 14.7%" gives an incident engineer immediate context about severity and urgency that no correlation coefficient can provide.

Building a genuinely async multi-agent pipeline that is robust to network failures, LangGraph unavailability, DoWhy edge cases, and WebSocket disconnections — while maintaining a smooth frontend experience throughout — required solving a lot of non-trivial engineering problems. The fallback architecture at every layer means ARIA degrades gracefully rather than crashing.


What we learned

Causal inference on live operational data is harder than academic causal inference. The PC algorithm assumes infinite data and no measurement noise — real metric time-series have neither. We learned to standardize columns before running the algorithm, remove degenerate (zero-variance) columns, and treat the discovered graph as a hypothesis that gets merged with known topology rather than ground truth.

Running fully async multi-agent pipelines in production requires thinking about state persistence at every step, not just at the end. If an agent fails midway through, the orchestrator needs to know exactly where it got to. The per-step Redis/memory write pattern was the right call.

The WebSocket late-joiner problem is underappreciated. In most tutorials, the client is always connected before events start. In real incident response tools, the client often connects after the fact — an operator opens the War Room 30 seconds after the alert fired. Designing for this from the start (state replay on every connect) is much easier than retrofitting it later.

Splunk's Simple XML dashboard format, while less visually polished than Dashboard Studio, is universal and reliable. For a hackathon submission where you need things to work on the judge's machine, reliability beats aesthetics.


What's next for ARIA

Splunk AI Assistant integration — Natural language queries over incident history. "Show me all incidents where DB connection pool was the root cause last month" translated to SPL automatically.

Multi-incident correlation — Detecting when two simultaneous incidents share a common upstream cause — the "gray failure" pattern where the real root cause is invisible because it's triggering cascades in parallel.

Learning loop — Writing MTTD, MTTR, and root cause confidence back to Splunk as metrics after each resolved incident, then using that data to tune the PC algorithm's significance threshold per service cluster over time.

Real-time causal graph updates — Currently the causal graph is built once per incident from historical data. The next version rebuilds it continuously as the incident evolves, with edges appearing and disappearing as new metric data arrives.

ITSI integration — Mapping ARIA's causal chains to Splunk ITSI service trees, so blast radius predictions automatically update service health scores in real time.

Built With

Share this project:

Updates