Inspiration
Every security operations center has the same painful bottleneck: threats arrive faster than engineers can write detections for them — and the rules they do write are noisy, burying analysts in false alarms. A new CVE drops, a Mandiant report lands, an analyst flags a technique… and turning that into a tuned, deployed Splunk detection takes a senior detection engineer the better part of a week.
The current wave of "AI for security" is all aimed downstream — triaging the alerts an engineer already wrote. We wanted to automate the upstream problem nobody touches: writing, testing, and tuning the detection itself. That's Counterspell — an autonomous detection engineer for Splunk.
What it does
Paste in a threat — a CVE, a threat-intel report, or a MITRE ATT&CK technique — and a team of five AI agents takes over:
- 🧠 Architect reads the threat, maps it to MITRE ATT&CK, and writes a tunable detection design.
- 🔴 Red-team generates a synthetic attack and injects it into Splunk via HEC — a guaranteed true positive, stamped with a scenario ID.
- ✍️ Translator turns the design into runnable SPL.
- 🔍 Validator — pure, deterministic Python, never an LLM — backtests the search and counts true vs. false positives.
- 🚀 Deployer writes the real, scheduled saved search and a SOC runbook.
The Validator hands its false positives back to the Architect, which tightens the rule, and the loop repeats. The headline visual is the false-positive count falling — 47 → 12 → 0 — with the attack still caught every time. Then a human approves, and Counterspell deploys a real saved search to Splunk with a runbook in the KV store.
The work that took a week, done in about three minutes.
How we built it
Counterspell is a Python agent runtime around a single tight loop, talking to a live Splunk Enterprise instance.
- Five agents orchestrated by a deterministic, non-LLM
Orchestratorthat owns the loop, the iteration cap, and the human-approval gate. - A provider-agnostic, OpenAI-compatible LLM drives the four LLM agents. Swapping providers (Groq, Ollama, self-hosted, or Splunk-hosted Foundation-Sec) is a
.envchange, not a code change. Every call is schema-locked with Pydantic and one-shot-repaired on malformed JSON. - The Splunk MCP Server (v1.2.0) runs every backtest. The Validator calls
splunk_run_queryover JSON-RPC with an RSA-encrypted, audience-scoped token — and every run recordsused_mcp = true. If MCP is ever unreachable, it falls back transparently to the SDK so the loop never breaks. - The Splunk Python SDK performs the real writes — HEC injection, the scheduled saved search (ES-ready, with notable + risk-based-alerting metadata when Enterprise Security is present), and the KV-store runbook.
- Deterministic, provable scoring. A result row is a true positive only when an attacker marker matches a field value exactly (whole-token, never substring):
$$\text{TP}(r)\iff \exists\, f\in r:\ \text{cs_scenario_id}\in \text{tokens}\big(r[f]\big)\ \lor\ \text{entity}\in \text{tokens}\big(r[f]\big)$$
No language model ever judges true vs. false positive — which is what makes the FP curve trustworthy.
- A generalization hold-out. The data generator plants a second class of benign events the tuning loop never sees. After deploying,
check_generalization.pyreplays the rule against only that hold-out — and it fires on zero of them. It learned the pattern, not the planted noise. - An in-Splunk app. A custom SPL command (
| counterspell) and a SimpleXML dashboard let an analyst run the whole loop from the Splunk search bar and watch the FP curve drop natively.
Challenges we ran into
Almost every interesting bug was at a boundary between systems:
- The backtest was secretly searching 12 hours, not 30 days. The LLM's SPL carried an inline
earliest=-12h, which Splunk honors over the dispatch time range — hiding the entire noise floor and pinning the FP curve at zero. Stripping inline time terms fixed convergence. - A "phantom" generalization failure. Our deployed rule did generalize — but the MCP client parsed the server's empty-result envelope
{"results": []}as one bogus row, so the check always reported a hold-out hit. A truthiness bug ([] or …) hiding behind a perfectly good detection. - Getting the in-Splunk command to run at all. Splunk's bundled Python lacked our deps, and installing into it risks breaking Splunk. We made the command a thin wrapper that shells out to the system Python — then peeled back four stacked failures, each only visible from inside the
NT SERVICE\Splunkdservice account: missingsplunklib, no read access to the user-profile repo/deps, anhttpxSSL crash from an inheritedSSL_CERT_FILE, and free-tier token-per-minute rate limits. Each one taught us something about how Splunk actually runs external processes. - Splunk's chunked-search protocol locks result columns to the first row — so our
iteration/fp_count/splfields silently vanished from the dashboard until we emitted a stable schema on every row.
Accomplishments that we're proud of
- A genuinely closed loop: detect → design → attack → backtest → tune → deploy, with a real write to Splunk that survives a UI refresh.
- A visible magic moment — the FP curve dropping to zero is what people remember.
- Deterministic, provable scoring and a generalization proof that answers the sharpest judge question on screen.
- Real MCP integration, not a mock — backtests genuinely run through the Splunk MCP Server.
- Guardrails as a feature: a human-approval gate, a scoped service account, an iteration cap, and no outbound actions.
What we learned
The hard part of agentic security isn't the prompting — it's the boundaries: time ranges, token schemas, service-account permissions, SSL trust, rate limits. An agent is only as trustworthy as its deterministic parts, which is why we kept the Validator and the scoring out of the LLM entirely. And a held-out test set turns "trust me, it works" into a number on screen.
What's next
- The MCP-primary path for SPL generation once the AI Assistant for SPL add-on is installed.
- A wider MITRE ATT&CK coverage map driven by batch runs.
- Analyst-confirmed labels replacing synthetic ground truth — the same human-in-the-loop our approval gate already models.
Log in or sign up for Devpost to join the conversation.