Inspiration

Alert fatigue is the number one pain in a security operations center. Analysts drown in false positives long before anyone notices a missed detection. The usual fix, tuning a noisy rule by hand, is slow and quietly dangerous: a tighter rule can stop catching the very attacks it was built to find, and teams often discover that only after a breach. We wanted an agent that does the tedious tuning and, just as important, proves it did not break the detection.

What it does

RulePilot is an analyst-in-the-loop agent for Splunk that reduces false positives in noisy detection rules without dropping the events analysts care about.

You give it a noisy rule and describe, in plain English, what the rule must never stop catching. An AI model compiles that plain-English intent into an executable Splunk check, the native Splunk parser validates it, and you approve it. You never have to hand write the verification query, but you can tweak it if you see fit. RulePilot then diagnoses why the rule is noisy by running live Splunk searches through the Splunk MCP Server, proposes a tighter rule, and runs an agentic loop: validate the candidate through the Splunk parser, run it on real data, measure how much noise it removes, and confirm the must-catch entities still survive. If a candidate cannot both cut noise and preserve the flagged behavior, it is rejected, and the interface says so honestly instead of faking a result.

On synthetic security data with live Splunk and GPT-4o, RulePilot cut a failed-login rule from 113 alerts to 1, a 99% reduction, while preserving 100% of the real brute-force attack. A suspicious-command rule went from 122 to 12, a 90% reduction, again fully preserved.

It is not a model that writes detections. It is an agent that reduces alert noise and proves it did not break the detection.

How we built it

Splunk is the source of truth, and RulePilot reaches it through the Splunk MCP Server. Every search it runs (e.g. the baseline, the diagnostics, the candidate rules, and the preservation checks) executes through the MCP Server with token authentication, so the project uses a Splunk AI capability at run time, not just at setup. Every AI-generated query is also pre-validated by the native Splunk SPL parser before it runs; when the parser rejects a query, its own error message is fed straight back to the model as revision feedback, which turned out to be the single biggest reliability win.

The agent is model-agnostic behind one interface. You pick the provider at run time: a frontier model through the OpenAI API, a local open-source model (e.g. Qwen through Ollama), or the Splunk AI Assistant through the Splunk MCP Server. A verification gate probes whether the refined rule still surfaces the approved must-catch entities, and a Streamlit interface presents one shared input form across two worked examples and a blank bring-your-own-rule tab, plus a replay mode that renders saved runs with no Splunk or model required.

Challenges we ran into

A masked parser bug cost us the most time. The Splunk parser endpoint needs a POST with a form-encoded body; a GET returns an HTTP 405 that looks like a parse error. That single mismatch was silently rejecting every candidate. Fixing it unblocked the whole pipeline.

Keeping the verifier honest was the second challenge. If one model writes both the rule and the check, it can grade work it produced itself. We solved that by compiling the verification check once, up front, and having the analyst approve it before the loop runs, so the gate stays independent of the refinement.

The third was the integration of Splunk AI Assistant. The MCP tools for it are wired and exposed, but the cloud backend returns a 403: the tenant is entitled for the v1 Assistant API, which is why the browser assistant works, but not the v2 SPL-generation API the MCP tools call. The provider is fully implemented and switches on the moment the entitlement clears; until then we run on GPT-4o.

Accomplishments that we're proud of

  • A verification-first refinement agent that proves no regression, instead of generating SPL and hoping.
  • Built on the Splunk MCP Server: every Splunk search runs through it at run time, not just a connection test.
  • Honest-by-construction behavior: no fabricated metrics, no overwriting a working rule with a rejected attempt, no silent index changes.
  • Natural language to SPL, so analysts express intent rather than syntax.
  • A genuinely model-agnostic design that fails loudly and clearly when a provider is misconfigured, rather than failing silently.

What we learned

Ground the AI in the target system. The biggest reliability gain did not come from a cleverer prompt; it came from pre-flighting every generated query through the native Splunk parser and feeding the errors back to the model. The real value of AI for detections is not generation, it is trustworthy verification.

What's next for RulePilot

  • Splunk AI Assistant as the model once the v2 entitlement clears.
  • Live exemplars from the environment's own saved searches via MCP.
  • More scenarios (data exfiltration), analyst approve/edit history, and MITRE ATT&CK labeling on refined rules.

Built With

Share this project:

Updates