-
-
SOC data: six label formats, 20% unlabeled — the conditions Squelch was designed for.
-
hree hypotheses tested, one wins. Attack injection excluded an unsafe IP. Precision: 24% → 59%
-
Squelch declined to tune — 45.5% empty dest_ip from one sourcetype. Filed a diagnosis, not a filter
-
The PR the engineer reviews Monday: SPL diff, eval before/after, attack injection results.
-
Full stack: trigger → LLM → MCP tools → adversarial eval harness → KV store → Git output.
Squelch
Inspiration
SIEM engineers carry tuning backlogs that grow faster than they shrink. A noisy correlation search generates hundreds of false positives per day, and the analyst on the other end stops reading those alerts. The engineer who wrote the rule knows it. But tuning a single rule properly takes three to five hours of manual investigation — querying notables, pivoting across fields, guessing at the pattern — and there's no way to validate the result. No precision score. No recall score. Just an eyeball check and a hope that you didn't quietly start dropping true positives behind a dashboard that still shows green.
Every AI demo I watched this year showed the LLM doing the hard part — generating queries, summarizing incidents, triaging alerts. But nobody was building the thing that proves the LLM's output is safe. The generation is easy. A NOT filter is one line of SPL. The hard part is: did that filter just hide a real attack? Did it overfit to a transient pattern? Would it survive if 10% of your labels were wrong?
That's where Squelch started. Not as a tuning tool — as a validation harness that happens to tune.
What it does
Every AI-Splunk tool generates SPL. Squelch is the only one that adversarially proves it's safe — and refuses to tune when the real problem is data quality.
Squelch is an adversarial eval harness for Splunk detection logic, delivered as a Splunk App. The user runs | squelch mode="tune" in the Splunk search bar, and the pipeline runs end-to-end: cluster FPs, propose a filter, attack the proposal, gate on recall, and deliver results as a GitHub PR or Issue. No dashboard, no chat interface — the outputs are PR diffs and eval tables that engineers review in their existing workflow.
The eval harness also ships standalone: | squelch mode="eval" runs precision, recall, perturbation, and holdout against any detection — no clustering, no LLM, no GitHub. Install it Monday, even if you never run the agent.
The pipeline:
Clusters false positives across multiple fields (IP, user, dest) and ranks every field as a filter hypothesis — showing explanatory power percentages and rejection reasons for each.
Proposes a minimal NOT filter using Gemini 2.5 Flash. The LLM writes one line of SPL. That's its entire contribution.
Attacks the proposal by injecting synthetic true positives matching the filter pattern. In the demo, the LLM proposed filtering 10 scanner IPs. The harness tested one —
192.168.40.81— and found it also carried true-positive traffic. That IP was excluded automatically. Nine shipped. Ten would have been unsafe.Gates on recall — a hard veto. If the proposed revision drops even one true positive, it's rejected automatically. No override, no "close enough."
Runs label perturbation — flips 10% of golden labels across three seeded trials and checks whether precision and recall are stable under noise. Reports PASS/WARN.
Runs temporal holdout — splits the golden dataset 70/30 by time and checks whether the proposed filter generalizes to unseen data. Catches overfitting to transient patterns.
Declines to tune when filtering would be wrong. On the Endpoint detection in the demo, no field cluster cleared the 20% explanatory power floor. The agent discovered that 45.5% of false positives (96/211) had an empty
dest_ipfield, all fromsvc_install_log. A field extraction gap, not a filterable pattern. Instead of applying a filter that would mask the real problem, it filed a GitHub Issue with diagnostic evidence. The worst thing a tuning system can do is mask a data quality problem with a filter.Delivers results as GitHub pull requests on per-detection branches — SPL diffs, eval before/after tables, attack injection results, FP cluster analysis, perturbation badges, temporal holdout numbers. The decision trail shows the math.
Here's the actual PR body for the DNS detection (PR #60):

And the decline-to-tune output for the Endpoint detection — no PR, just a diagnosis:

Demo results (verified against final run): DNS detection precision went from 24% to 59% (fp 205 → 43). Identity detection from 24% to 47%. Endpoint detection: declined — field extraction gap diagnosed. Not on clean data — six label formats, 20% unlabeled events.
How we built it
Architecture: Squelch is a native Splunk App — commands.conf registers a chunked Python streaming command (| squelch) that dispatches across five modes: test, tune, validate, llm_probe, and eval. The eval library (eval_lib.py, cluster.py, attack_inject.py, revise.py, github_integration.py) lives in the repo at eval/ and is vendored into the Splunk App at bin/lib/squelch_eval/. Every edit to eval/ gets mirrored to the vendored path; diffs must be empty after every session.
Splunk MCP Server (v1.1.3): Squelch integrates with the Splunk MCP Server for read-only data collection — 10 built-in tools (SPL queries, index metadata, saved searches) plus one custom BYOT tool (squelch_fp_rates_by_search) that exposes live FP-rate data to peer agents. Write paths and custom command invocation go through splunklib SDK directly, because MCP's command allowlist (safe_spl.json, 143 commands) excludes custom SPL commands. This is a documented architectural boundary: MCP for reads, SDK for writes.
LLM integration: Gemini 2.5 Flash via direct HTTPS from within the custom command. The LLM's job is narrow: given an original SPL query and a cluster of safe-to-filter values, produce the original SPL verbatim plus exactly one NOT {field} IN (...) clause. A structural validator (_structurally_valid()) rejects rewrites of the original query and requires a NOT clause to be present. A syntax checker runs the proposed SPL through Splunk (| head 0) to catch parse errors. The LLM is a component, not the product.
The eval harness is the core engineering:
evaluate_detection()computes event-level precision and recall using Splunk's_cdfield for individual event identity — not aggregate countsgate_revision()is a hard recall-preservation gate:proposed.recall >= baseline.recallor the revision is rejected, with the specific dropped event IDs capturedrun_adversarial_eval()parses the proposed NOT filter, picks a target value at random (seeded RNG), synthesizes a true-positive event with that exact field value, injects it, and re-evaluates. If recall drops, the value is excluded from the filterperturb_and_eval()uses SHA-256-seeded, namespaced RNG for reproducible trials — same detection name and trial index always produce the same label flipstemporal_holdout_eval()queries golden data for min/max_time, computes the split point, and runs fourevaluate_detection()calls: baseline×training, baseline×holdout, revised×training, revised×holdout
Label normalization: A lookup (disposition_normalization.csv) maps six analyst label formats to two canonical values. Unlabeled events are excluded from the golden dataset, not imputed.
Golden dataset: 1,000 seeded events across 8 detections (~125 per detection), distributed with realistic noise — six label variants, 20% unlabeled, three distinct FP root causes including a field extraction gap. Seeded via scripts/seed_notable.py.
GitHub integration: 12 REST API endpoints. Per-detection branches (squelch/tune/{slug}-{epoch}). PRs include SPL file commits (original + revised), eval before/after tables, attack injection results, FP cluster analysis, label sensitivity badges, temporal stability sections. The decline-to-tune path files Issues with diagnostic evidence. Credentials stored in Splunk's storage/passwords.
Named constants govern behavior, not magic numbers:
| Constant | Value | What it controls |
|---|---|---|
MIN_TOP_ENTRY_FP_PCT |
0.20 | Minimum explanatory power for a cluster to be filterable |
PERTURB_RECALL_PASS_THRESHOLD |
0.05 | Max recall delta under 10% label flip |
HOLDOUT_SPLIT_PCT |
0.70 | 70% training, 30% holdout |
HOLDOUT_PRECISION_FLOOR_DELTA |
0.0 | Holdout precision must not degrade |
DIAGNOSE_EMPTY_THRESHOLD |
0.30 | Field empty in >30% of FPs triggers diagnosis |
DIAGNOSE_SOURCETYPE_THRESHOLD |
0.80 | >80% of empties from one sourcetype → extraction gap |
Scale: 4,833 Python LOC. ~55 functions across 7 modules. 15 commits over 6 build days. 4 unit tests for temporal holdout. 6 bundle iterations, each with verified CSV captures. Solo builder.
Challenges we ran into
Label chaos was the first real problem. The demo seed data has six different label formats — true_positive, false_positive, resolved, closed, fp, FP - scanner — because that's what production SOC data actually looks like. The normalization lookup took 30 lines to build but it was the decision that unlocked everything downstream. Without it, precision and recall are meaningless because you can't tell true positives from false positives.
The recall metric was misleading. Early bundles reported "recall at 100%." A code audit revealed that Splunk's eval was returning recall = 6.5% (the detection only fires on ~6.5% of golden events), but true-positive preservation was 100% (zero TPs dropped by the filter). We rewrote the metric reporting to be honest: "recall held flat, zero true positives dropped" instead of "recall at 100%." This cost us a cleaner number but earned us a number we could defend.
The shared-branch 422 problem. Bundle 3 used a single squelch/proposals branch for all PRs. GitHub only allows one open PR per head branch, so the second accepted detection's PR would 422. Bundle 4 fixed this with per-detection timestamped branches (squelch/tune/{slug}-{epoch}), making collisions vanishingly rare.
MCP can't invoke custom commands. Splunk's MCP Server has a 143-command allowlist (safe_spl.json) that rejects custom SPL commands like | squelch. Our BYOT tools work for read-only queries wrapped in allowlisted SPL, but the write path (running the tune pipeline) goes through splunklib SDK directly. This is a genuine architectural finding, not a workaround — MCP for reads, SDK for writes.
Synthetic data is synthetic. The golden dataset is seeded, not production. We acknowledge this in the demo ("simulated SOC data") and in the architecture ("production deployment requires real analyst dispositions"). The label normalization layer, the eval harness, and the decline-to-tune logic were designed for the inconsistencies of production data, not the cleanliness of test data.
Accomplishments that we're proud of
The decline-to-tune beat. Squelch's most impressive moment is when it refuses to generate output — because the false positives are caused by a field extraction gap, not by a filterable pattern, and a filter would mask the real problem. An agent that knows when NOT to act is harder to build than one that always acts.
Every number is verified. Six bundle iterations, each with a CSV capture. Every precision number, every perturbation result traces to a specific row in a specific CSV committed to the repo. When the recall metric was misleading, we rewrote it. Not rounded. Not aspirational.
The eval harness ships standalone. | squelch mode="eval" runs precision, recall, perturbation, and holdout against any detection with zero side effects — no clustering, no LLM, no GitHub, no KV writes. It's the on-ramp: install the eval harness, get numbers on your existing detections, decide later if you want the agent.
What we learned
The validation harness took 3x longer to build than the LLM integration. call_gemini() is ~20 lines. evaluate_detection() is ~100. run_adversarial_eval() is ~75. perturb_and_eval() is 80+. temporal_holdout_eval() is 70+. The generation was a weekend. The validation was the project. This confirmed the thesis: the hard part isn't writing the fix — it's proving the fix is safe.
The golden dataset is synthetic — and we say so. The demo opens with "simulated SOC data" because the data is simulated. Production deployment requires real analyst dispositions; the architecture handles that input natively through the normalization layer. We designed for messy production labels, validated on messy synthetic labels.
Detection engineering is an eval problem, not a generation problem. The industry has plenty of tools that generate SPL. Nobody has built the eval harness. The gap isn't "write better queries" — it's "prove the queries you wrote are safe." Squelch exists because that gap exists.
Named constants > magic numbers. Every threshold in Squelch is a named constant with a comment explaining the choice. MIN_TOP_ENTRY_FP_PCT = 0.20 is readable, auditable, and tunable. if fp_pct > 0.2 buried in a function is none of those things.
What's next for Squelch: Adversarial Eval Harness for Splunk
Beyond NOT filters. The eval harness validates any SPL revision regardless of type. The current agent generates NOT filters — the simplest, safest class of detection change. The architecture supports time-window exclusions, lookup-based filters, and field-value combinations as the generation layer matures. The eval harness doesn't care what the LLM proposes. It cares whether the proposal is safe.
Complex SPL support. Production detections reference macros, eventtypes, lookup tables with staleness concerns, CIM field aliases, and nested search constructs. Extending the triage step to parse macro definitions, check lookup freshness, and resolve field aliases is the primary next-tier engineering challenge. Production label collection — integrating with case management systems for real analyst dispositions — is the other major prerequisite.
Adoption paths (already shipping). | squelch mode="eval" already works as a standalone tool — no LLM, no GitHub, no agent. squelch-harness (v0.1.0) is published on PyPI — the eval library without Splunk dependencies, for CI/CD pipelines and detection-as-code workflows. The next step is packaging the standalone eval as a lightweight Splunk App: five minutes to install, zero commitment, value on day one.
Log in or sign up for Devpost to join the conversation.