Inspiration

Supply chain attacks are the fastest-growing threat vector in software security. SolarWinds. Codecov. ua-parser-js. The xz backdoor. Same pattern every time — attackers compromise a trusted dependency, and the blast radius is massive before anyone notices.

The problem isn't detection. Splunk fires alerts. The problem is investigation speed. When an alert hits, analysts manually write SPL queries, cross-reference threat intel, trace lateral movement, map affected systems — hours or days while attackers are still moving.

I built what a senior SOC analyst would do if they could clone themselves three times and work in parallel — powered by AI, connected directly to Splunk through MCP.

What I Learned

  • LangGraph's Send() API is the key to real multi-agent parallelism. Regular nodes execute sequentially. Send(), returned from a conditional edge function, fans out to concurrent agents that write back to shared state through reducers. Biggest architectural insight: the supervisor isn't a node — it's a conditional edge.
  • MCP turns Splunk from a passive data store into an active tool. Agents generate SPL, execute it through MCP's splunk_run_query, and analyze real results — not simulated data.
  • LLM synthesis beats rule-based merging. Three agents return overlapping findings. An LLM deduplicating IOCs and resolving conflicting severity scores produces more coherent output than any heuristic I could write.
  • SSE streaming makes AI investigations feel alive. Watching each agent's reasoning appear in real time in the Decision Log completely changes the experience versus waiting for a final report.

How I Built It

Backend: FastAPI serves the REST API and SSE streaming endpoints. The core is a LangGraph StateGraph with checkpointing — DETECT classifies the alert and extracts IOCs, then a supervisor dispatches three parallel sub-agents (IOC Hunter, Threat Intel, Blast Radius) via Send(). Each sub-agent independently generates SPL queries, executes them against Splunk through MCP, and analyzes results. A merge node synthesizes findings with LLM-powered deduplication. ASSESS scores severity and maps blast radius. REMEDIATE generates prioritized actions gated by human approval.

Frontend: Next.js with a Splunk-native dark theme. Three-panel Investigation Workspace — agent state machine and decision log on the left, tabbed content (attack graph via React Flow, evidence, SPL queries, timeline) in the center, entity sidebar on the right. Remediation actions surface in a bottom bar with approve/reject controls.

Infrastructure: Splunk Enterprise in Docker with five custom indexes (cicd_events, git_events, secret_audit, extensions, threat_intel). A synthetic data generator creates realistic attack scenarios. The Splunk App for MCP exposes splunk_run_query and splunk_get_indexes as MCP tools over Streamable HTTP.

Challenges

  • MCP token management. Splunk's MCP app uses RSA-encrypted JWT tokens, not simple API keys. Tokens don't survive container restarts — the app regenerates its private key. Built a setup flow that generates and configures tokens programmatically via Splunk's REST API.
  • LangGraph Send() wiring. First attempt: Send() calls inside a regular node. LangGraph silently ignored them. Dug into the source — Send() objects must be returned from a function passed to add_conditional_edges(), not from a node. Single most important architectural constraint in the project.
  • MCP response parsing. Splunk's MCP returns inconsistent formats — nested JSON inside string fields, raw lists, varying structures. Index discovery required multiple extraction strategies with graceful fallbacks.
  • Parallel state merging. Three agents writing to the same state fields simultaneously. Without LangGraph's Annotated[list, merge_lists] reducer pattern, the last agent to finish silently overwrites the others.

What's Next

  • Expand MCP integrations. Splunk is one data source. GitHub (commit/PR audit), AWS CloudTrail, GCP Audit Logs, and container registries are next — each one is just another MCP tool the agents can call.
  • SBOM-aware blast radius mapping. Feed in SBOMs and the blast radius agent traces exactly which services pull the compromised dependency — not just which hosts talked to malicious IPs.
  • Persistent investigation memory. Add LangGraph checkpointing with a real backend so analysts can pause, resume, and build on prior investigations.
  • Automated playbook generation. After enough investigations, the system has data to suggest reusable response playbooks for recurring attack patterns.
  • Fine-tuned SPL generation. Current queries are general-purpose. Fine-tuning on org-specific index schemas and field names cuts query errors significantly.

Built With

Share this project:

Updates