AgentGate: pre-action governance for Splunk AI agents

Pending findings queue and block log. Every refused call with agent ID, tool name, matched policies. Splunk-native analyst surface.
Live in Splunk Web. 12 decisions in 24 hours, 6 blocked, 3 pending findings. Every verdict traceable in HEC.
Five deterministic stages gate every agent action against Splunk. A sixth Foundation-Sec stage explains, never decides.
30-day cross-index query trips cost (POL-007) and reach (POL-012). AgentGate drafts an ES 8 Findings record for analyst review.
Benign call sails through. All five deterministic stages pass in under one millisecond. The gate isn't a bottleneck.
Indirect injection embedded in a log the agent just read. Override pattern matched in 0.1ms. POL-006 blocks before cost or policy run.
Cleanup agent tries to disable the only detection on a PCI-tagged host. KO-graph shows zero redundant coverage. BLOCK on 3 policies.

Inspiration

"Compliance and governance teams block the application from going to production." (Jeff Wiedemann, Global AI Partner Technical Leader, Splunk).

Splunk shipped six agentic products in twelve months (Triage Agent, Investigation Agent, Malware Reversal Agent, AI Playbook Authoring, AI Assistant for SPL, Foundation-Sec-8B). Every one of them can drive a destructive SPL command, disable a detection, or mutate the audit index. The visibility side was filled in last month with MCP Telemetry Dashboard v1.2, but it audits after the agent acts. The pre-action enforcement layer that procurement actually asks about did not exist.

I built AgentGate to be that layer: a deterministic, auditable, Splunk-native gate that sits in front of the MCP server and turns every agent action into a verdict an auditor can reproduce.

What it does

AgentGate is the missing pre-action governance layer for Splunk AI agents.

Every tool call from an MCP-speaking agent (Claude, Cursor, Triage Agent, SOAR playbook) is intercepted and run through five deterministic stages plus one advisory sixth:

Injection: pattern + homoglyph normalisation against NIST AISC, Anthropic, AgentDojo, and INJECAGENT-style corpora.
Blast radius: walks a NetworkX dependency graph of saved searches, indexes, sourcetypes, hosts, MITRE techniques, and compliance tags; computes per-asset redundancy.
Cost: predicted SVC-hour estimate before execution (Cisco Deep Time Series Model on Cloud).
Policy: twelve standards-mapped rules (NIST AI RMF, OWASP LLM Top 10, PCI DSS 10.6/10.2.4, HIPAA 164.308(a)(1)(ii)(D), SOX, ISO/IEC 42001, EU AI Act Article 14).
Decision: ALLOW | REQUIRE_APPROVAL | BLOCK, read only from the policy stage. Reproducible from the policy library alone.
Reasoning (advisory): Foundation-Sec-1.1-8B-Instruct generates the analyst-facing explanation. It never gates the verdict.

Outputs: a Splunk HEC audit event for every decision, an ES 8 Findings record for non-ALLOW verdicts, and a dark-theme Splunk dashboard.

Measured: precision 1.000, recall 0.971, specificity 1.000 on a committed corpus of 35 positives (including AgentDojo important_instructions templates and adversarial obfuscations) + 26 negatives. Policy-gate FPR 0.000 on 20 benign tool calls. Decision-path p95 = 0.56 ms. Reasoning-path mean = 9.4 s on local GPU (advisory only).

How I built it

Solo build, Python 3.12, Pydantic v2 for the domain model, structural typing via Protocol for stages, frozen dataclasses for policies.

Pipeline orchestrator (agentgate/middleware/pipeline.py) runs stages in order, accumulates StageResult objects, fails closed on policy-stage exceptions (an uncaught crash in the verdict path BLOCKs, it does not silently ALLOW), and computes the verdict from the policy stage only.
Knowledge-object graph (agentgate/graph.py) parses SPL from seeded saved searches into a NetworkX DiGraph and computes blast_radius() from per-asset redundancy + MITRE + compliance tags.
Policy library (agentgate/policies.py): twelve frozen-dataclass rules, each carrying id, title, standards tuple, rationale, action.
Foundation-Sec advisory stage (agentgate/middleware/reasoning.py) runs locally via Ollama at ~30 tok/s on RTX 5060. Verdict never reads its output.
Audit emitter (agentgate/audit.py) writes to Splunk HEC. Findings persistence (agentgate/findings.py) writes to a KV collection shaped like the ES 8 v2 /public/v2/investigations/{id}/findings body.
Dashboard: Splunk simple-XML, dark theme, traffic-light KPIs, by-policy and by-agent panels, pending-findings queue, block log.
Tests: 77 passed / 1 skipped / 1 xfailed. Benign-corpus FPR test asserts FPR=0. Scenario tests assert on specific matched policy IDs (POL-001, POL-002, POL-006, POL-009). A dedicated test_policy_stage_exception_fails_closed regression test proves that a crash inside the gating stage produces BLOCK rather than silently degrading to ALLOW.
CI: GitHub Actions runs ruff check . + pytest tests/ -v on every push.

Challenges I ran into

Determinism on top of an LLM. The hardest design call was deciding what Foundation-Sec is allowed to do. I locked it down to advisory: _decide() reads only the policy stage. A judge can verify that property from one file.
Fail-closed semantics. The first orchestrator had a broad per-stage except that silently turned a policy-stage crash into ALLOW, the canonical security anti-pattern. I caught it during adversarial review and special-cased the policy stage to BLOCK on exception, with a regression test.
Knowledge-object scale. The seeded SOC content is 12 saved searches + 9 KV assets. Real customers have thousands. The graph walk is O(V+E), so scaling is a measurement task, not a re-architecture, but I called it out in the "What I did NOT validate" section instead of hand-waving.
Splunk Hosted Models vs Ollama. Hosted Models is Cloud-only; my dev license is Enterprise. I shipped Ollama + Foundation-Sec-8B locally and documented the one-line Hosted Models swap path.
ES 8 v2 Findings API. The real /public/v2/investigations/{id}/findings POST needs an ES 8 Cloud entitlement I do not have. I wrote against the v2 shape and persist to a KV collection named agentgate_findings so the swap is a one-file change.

Accomplishments that I'm proud of

The deterministic-vs-generative boundary is enforceable from one file (pipeline.py::_decide).
The knowledge-object graph is the moat, and it is not LLM-firewall code anyone could have prompt-cloned. It is detection-engineering reasoning over saved searches, indexes, MITRE techniques, and compliance tags.
Reproducible, published metrics. Precision 1.000, recall 0.971, FPR 0.000, decision-path p95 0.56 ms. The corpora are in tests/corpora/. CI re-verifies them on every push.
Twelve policies mapped to seven named standards, wired into actual code, not decorative.
An honest "What I did NOT validate" section in the README. Procurement reviewers read that section first.

What I learned

Audit dashboards (MCP Telemetry Dashboard v1.2, MCP Watch) and pre-action gates are different products. Splunk shipped the audit side. The pre-action side was open ground.
Governance products live or die on what they do when something throws. Fail-closed semantics on the verdict stage is the single highest-leverage line of code in the system.
Standards mapping only counts if it is wired into the rule data structure. Decorating a README with NIST clause numbers is cheap; making the matched policy carry its standards tuple into the audit row is what gives an auditor something to query.
For a governance product, "no CI" reads as "no engineering rigor" before a judge has read a single line of code. The badge is the credibility signal.

What's next for AgentGate

Swap the ES 8 Findings KV mock for the real /public/v2/investigations/{id}/findings POST.
Swap Ollama + Foundation-Sec-8B for Splunk Hosted Models on Cloud (| ai provider=splunk model=foundation-sec-1.1-8b-instruct).
Scale the KO graph from 12 seeded saved searches to a real customer's 10k+ portfolio; measure graph-walk latency at scale.
Synthesize policies from the graph at evaluation time, so "removes sole coverage of T1190 on a PCI-tagged asset" is generated rather than hard-coded.
Cross-track: extend the gate to Observability (summary-index write protection) and ITSI (service-tree blast radius).

Built With

cisco-deep-time-series-model-target
github-actions
httpx
networkx
pydantic-v2
pytest
python-3.12
rich
ruff
splunk-ai-assistant-for-spl-(gated-via-mcp)
splunk-ai-toolkit
splunk-app-bundle-(savedsearches.conf-/-collections.conf-/-transforms.conf)
splunk-enterprise-10.4
splunk-enterprise-security-8-v2-findings-api-target
splunk-hec
splunk-hosted-models-target-(foundation-sec-1.1-8b-instruct-via-ollama-dev-license-substitute)
splunk-kv-store
splunk-mcp-server-1.2-(splunkbase-7931)
splunk-rest-api
splunk-sdk-(splunklib.client-+-splunklib.results)
splunk-simple-xml-dashboards
typer

Updates

Jonathan SolvesProblems started this project — Jun 14, 2026 08:48 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.