Inspiration
The idea behind Sentinel started with a pattern I kept noticing across engineering teams: the same incidents get solved over and over, by different engineers, from scratch, because nobody wrote down what worked last time. An experienced SRE carries years of institutional knowledge in their head, which Redis configuration causes connection pool exhaustion, which service dependency usually causes a latency cascade, which remediation steps resolve which class of failure. When they leave, that knowledge walks out with them. When a junior engineer gets paged at 3am, they start from zero.
The underlying problem is that institutional knowledge in software operations is unstructured, transient, and trapped in people. I wanted to build an agent that solves this structurally, not just a better alert dashboard, but an agent that accumulates operational knowledge with every incident it resolves and uses it to get meaningfully smarter over time.
Splunk is where operational data already lives for most engineering teams. It felt wrong to build an incident agent that pulls data into an external system- the agent should live where the data lives and speak the same language the data already speaks.
The core loop that drives Sentinel is not new, but applying it to SRE is:
observe problem → search institutional memory → reason through current context
→ act → verify → write resolution back to memory
This loop generalises. Customer support, legal review, security investigation, any domain where humans solve the same class of problem repeatedly with knowledge locked in their heads is a candidate. Incident response is just the most obviously broken version of it.
What it does
Sentinel monitors your Splunk instance for incident patterns and acts autonomously when they trigger. no command, no human in the loop.
The production flow:
Your app → Splunk HEC (logs land in index=prod)
→ Splunk saved search detects error pattern
→ Splunk fires webhook to Sentinel API
→ Sentinel runs the full investigation and resolution cycle
→ Post-mortem indexed back into Splunk
→ Brain updated — next incident resolved faster
The eight-phase reasoning chain, streamed live to the UI as it executes:
| Phase | What happens |
|---|---|
| ASSESS | Parses the Splunk alert — service, symptoms, severity |
| REMEMBER | Searches KV Store for similar past incidents and what resolved them |
| INVESTIGATE | Runs targeted SPL queries against live log data in real time |
| MAP | Traverses the service dependency graph, identifies blast radius, upgrades severity if warranted |
| RETRIEVE | Selects the best matching runbook, or generates and saves a new one |
| ACT | Executes low-risk remediations; pauses and pages oncall for medium/high risk |
| VERIFY | Re-runs the diagnostic SPL query to confirm the fix actually worked |
| CLOSE | Writes a structured post-mortem to KV Store and indexes it into Splunk |
The brain grows with every resolved incident. Measured proof from the same incident type run twice:
| Run | Brain state | Resolution time | Match confidence |
|---|---|---|---|
| First occurrence | No history | 21.6s | 13% |
| Second occurrence | Learned | 7.1s | 95% |
Additional capabilities:
- Autonomous detection — Sentinel fires from a Splunk scheduled search with no trigger command required
- Multi-signal correlation — queries all services in the blast radius during investigation, not just the one that fired
- Severity auto-upgrade — if blast radius analysis reveals 3+ affected services, Sentinel upgrades the severity before acting
- Confidence-based escalation — if match confidence is below threshold after failed remediations, Sentinel pages oncall with its full investigation context and stops acting
- Audit log — every tool call, confidence score, and decision is written to a real-time audit trail separate from post-mortems
- Multi-tenancy — each organisation gets an isolated brain; cross-org data access returns 403
- Native Splunk dashboard — a Simple XML dashboard (
sentinel_overview) installed inside the Splunk app itself, showing active incidents, resolution timeline, brain growth, and agent decisions inside Splunk's own UI - Dead letter queue — if the agent crashes mid-run, the incident is retried up to three times before being marked failed and the team notified
How we built it
Sentinel is a TypeScript monorepo (pnpm workspaces) structured as three layers: the brain, the agent, and the surfaces.
The brain lives in @sentinel/splunk-brain — a typed REST client for Splunk's management API, KV Store, and HEC. Every document written to KV Store gets an orgId field; every read filters by it. Post-mortems are dual-written: once to KV Store for agent lookups, once to Splunk via HEC for native search and dashboarding. Similarity search runs as SPL against indexed post-mortems, scoring results by keyword overlap since KV Store has no native vector index.
The agent runs a structured reasoning loop in packages/agent. Each phase calls a tool that wraps either a KV Store query or a live SPL search via a custom Splunk MCP REST adapter — Splunk's official MCP package was not available on npm at build time, so I implemented the tool interface directly against Splunk's REST API. The agent probes for Splunk Hosted Models at startup using the AI Toolkit's | ai SPL command. On local Enterprise, the probe returns false and all generation routes to Gemini. On Splunk Cloud Platform, the probe would return true and Hosted Models activate without a code change.
The surfaces are the Next.js web app (apps/web) and the native Splunk dashboard. The web app uses Server-Sent Events to stream each reasoning phase to the UI in real time as the agent executes. The Splunk dashboard is a Simple XML file deployed into the sentinel Splunk app — it shows everything from inside Splunk's own interface.
The trigger is a Splunk saved search with a webhook alert action pointing at POST /webhooks/splunk-alert. This replaces any dependency on a public message queue (Pub/Sub, etc.) and means Sentinel fires whenever your existing Splunk alerts fire — you point the webhook at Sentinel, not the other way around.
I used Claude Code and Codex as the primary builders throughout, operating in goal mode on structured prompts that enforced no mocked data paths, no monkey-patches, and a strict shipped-vs-imagined verification table before any feature was marked done.
Challenges we ran into
Splunk Hosted Models are a Splunk Cloud Platform capability, not a local Enterprise feature. The | ai SPL command is available in AI Toolkit 5.7.4 on Enterprise, but the Splunk Hosted Models provider is Cloud Platform only. Trying to authenticate via SCS token endpoints on local Enterprise returns 404. The solution was a runtime capability probe: Sentinel runs a test search at startup, detects whether Hosted Models are available, logs the result, and routes generation to the appropriate provider. No environment flags required — the behaviour adapts to the deployment.
Splunk's default webhook payload doesn't match what the docs describe. When a real Splunk scheduled search fires an alert action, it sends fields like host as arrays, not strings. The strict human-flow integration test caught this: the API schema rejected the real payload while accepting the hand-crafted test payloads. Fixing this required reading the actual incoming payloads and updating the Zod schema to handle both shapes.
SPL keyword similarity is architecturally honest but imprecise. Splunk KV Store has no native vector index. The similarity search computes keyword overlap scores between symptom arrays, which means it misses synonyms and handles unusual incident descriptions poorly. I documented this as a known gap and built the system so the similarity score is always surfaced to the agent — it can reason about its own confidence rather than acting on false precision.
Building the production gate. Sentinel blocks startup if it detects any combination of demo flags, offline generation, local URLs, or missing credentials. Identifying every "fake path" in the codebase required reading every conditional and env check that could silently substitute real behaviour for test behaviour. The gate runs before any service is bound.
Splunk Cloud account verification. The Cloud credentials needed for full cloud cutover are still pending account verification — this is a platform timing issue, not an implementation gap. The full Splunk Cloud integration is implemented and the code paths are tested. The submission runs on local Splunk Enterprise exposed via a Cloudflare tunnel, which uses the same Splunk codebase and satisfies the full production flow.
Accomplishments that we're proud of
The learning loop number is the one I keep coming back to: the same incident type resolved in 21.6 seconds the first time and 7.1 seconds the second time, with match confidence moving from 13% to 95%.
A few others worth naming:
- The autonomous detection path — injecting logs with a standalone Node.js script, walking away, and watching Sentinel fire from a Splunk saved search without any command is the cleanest possible demonstration that the product works as described
- 498 audit entries written across test runs, proving every agent decision is traceable and not a black box
- A native Splunk dashboard that lives inside Splunk's own UI — not another external web app, but a real Splunk application component
- A production gate that physically blocks startup if demo flags are on — this matters because it means the verification suite and the production deployment run through entirely different code paths
- The strict human-flow test — a fresh org, real Splunk saved search scheduler, real HEC log ingestion, no
pnpmworkspace shortcut — the full chain working as a real user would experience it
What we learned
The most important thing was understanding the boundary between Splunk Enterprise and Splunk Cloud Platform as deployment targets for agentic capabilities. The | ai command and Hosted Models are genuinely distinct things with different availability. Building a capability probe rather than hardcoding environment flags is the right pattern for any agent that needs to run across multiple deployment environments.
Writing a custom MCP REST adapter taught me how thin the abstraction over Splunk's management API actually needs to be - once you have a typed client for searches, KV Store operations, and HEC writes, the rest of the agent can be completely Splunk-agnostic. The same five tool contracts could point at Elastic or Datadog with different implementations underneath.
The similarity search gap made me think more carefully about what "the brain" actually is. A keyword match that returns an 87% confidence score is useful even when the underlying similarity is imprecise, as long as the agent surfaces the score and can reason about its own uncertainty, a lower-precision memory layer is still better than no memory at all. Honesty about confidence is more important than optimising the confidence score.
What's next for Sentinel
Vector embeddings for similarity search. The keyword overlap implementation is the documented known gap. Replacing it with Gemini embeddings stored in a dedicated vector index (or Splunk's ML Toolkit for a native solution) would significantly improve pattern recognition on novel incident types.
Splunk Cloud production deployment. The code is ready. As soon as account credentials are available, the full cloud cutover is a config change.
ThreatChain. Applying the same observe-remember-reason-act-learn loop to security operations — autonomous threat investigation that enriches IOCs from external threat intelligence feeds, classifies threat types, and executes containment actions. Security is Splunk's core market and the same architectural pattern applies directly.
Sub-agent delegation. A coordinator Sentinel spawning specialist sub-agents per incident domain — a database agent, a network agent, a security agent — each with its own focused brain and runbook library. The coordinator manages escalation and orchestration.
Cloud Run remediation execution. The current remediation layer calls registered admin endpoints on services. Full Cloud Run job dispatch requires GCP billing enabled, which is the remaining infrastructure gap for live automated remediation.
Built With
- cloudflare
- custom-splunk-mcp-rest-adapter
- docker
- express.js
- gemini
- google-cloud-run
- jwt
- next.js-14
- node.js-20
- pnpm
- railway
- server-sent-events
- spl-(search-processing-language)
- splunk-ai-toolkit-5.7.4
- splunk-alert-actions
- splunk-enterprise
- splunk-hec
- splunk-kv-store
- tailwind-css
- typescript
- zod


Log in or sign up for Devpost to join the conversation.