## Inspiration

Splunk's own AI — AppInspect AI Intelligence Reports, the AI Assistant — is excellent at explaining what's wrong. But a developer still has to do the resolving, re-validating, and remembering by hand. Onboarding a new data source, getting an app through AppInspect, and writing cost-aware SPL are exactly the expert-gated, repetitive tasks that eat an afternoon and leave no audit trail. We wanted to close that loop: an agent that doesn't just describe a failure, but fixes it, proves the fix against real Splunk, and records why.

Our one-line thesis: Splunk's AI explains failures. This copilot resolves, validates, and remembers them.

What it does

One shared, artifact-agnostic self-heal engine — diagnose → select a deterministic patch → apply → re-validate → repeat — drives three working loops across the Splunk app lifecycle:

  1. Onboarding (live MCP). Ingests a raw UPI/GST-style log sample through HEC, then validates inline rex/eval field-extraction candidates against the real indexed events via the Splunk MCP server's splunk_run_query. It diagnoses coverage, CIM-mapping, and PII gaps, swaps to a more robust candidate, and re-validates until clean. On our fixture: 150 events ingested, 6 real splunk_run_query calls, 3 gaps → 0, 6 CIM mappings (amt→amount, status→action, vpa→dest, payer_vpa→src_user, txn_id→transaction_id, gstin→vendor_id) and 2 PII flags (payer_vpa, payer_mobile).

  2. AppInspect (static). Runs splunk-appinspect on a deliberately broken app, parses the JSON failures, applies deterministic patch functions (remove forbidden local/, default/user-seed.conf, forwarding outputs.conf), and re-runs until zero failures. No live Splunk required.

  3. Cost-aware SPL lint (static). Lints a deliberately costly search for anti-patterns (index=*, no time bound, unbounded | sort) and applies deterministic rewrites (index=main, earliest=-24h, | sort 1000) until clean.

Every patch across every loop is appended to a provenance ledger{stage, iteration, failure, diagnosis, patch, rationale, validation_result, timestamp, changed_paths} — the "and remember" half of the thesis.

A React dashboard replays all three loops from committed events (zero dependencies), opening on a Lifecycle overview and drilling into each stage's timeline, metrics, CIM/PII panel, and a full provenance audit trail. It also has a Live mode: a "Go Live" button streams a loop's events over Server-Sent Events as it actually runs, rendered through the same reducer as replay. The same copilot agent is wired into VS Code as Run Task / Run-and-Debug entry points.

How we built it

  • Python package lifecycle_copilot (CLI: copilot) built around a single SelfHealEngine reused verbatim by all three loops — only the diagnosis source and patch registry differ. Crucially, the LLM produces diagnosis and rationale text only; concrete file changes come from deterministic patch functions in a constrained registry. That keeps every run reproducible and the audit trail defensible.
  • Splunk MCP server as the action surface for onboarding: an encrypted-token-authenticated streamable-HTTP client calling splunk_run_query to validate extractions against real events. HEC for ingestion; splunklib SDK where simpler than MCP.
  • splunk-appinspect CLI for the AppInspect loop; a small deterministic SPL cost ruleset for the lint loop.
  • Dashboard: React + TypeScript + Vite + Bun, with a pure, unit-tested event-stream reducer that renders replay and live identically. Live transport is a Starlette + SSE server that runs a static loop in a background thread and streams each emitted event via an on_event hook → asyncio.Queue → SSE.
  • Splunk Enterprise in Docker (splunk/splunk:latest) with HEC enabled for the live onboarding slice.
  • Quality: green test suite — 32 Python tests + 18 dashboard tests, ruff clean, dashboard tsc + vite build green.

Challenges we ran into

  • Indexing race. HEC acknowledges before indexing completes, so the first MCP validation could see zero events and mis-diagnose. We added a bounded settle step that polls | stats count through splunk_run_query until the indexed count matches what was ingested — making runs deterministic. (You can see this as the index_settle-purpose MCP calls in a real run.)
  • One patch, many gaps. The onboarding candidate swap heals the timestamp, CIM, and PII gaps at once, but only the selected gap emits per-failure events. The dashboard reducer backfills the others on a clean completion so the timeline reads as fully healed.
  • Encrypted MCP token. The MCP token is RSA-encrypted by the Splunk MCP Server app — not a REST bearer token. Minting and wiring it correctly, including a relative-expiry quirk, took real care.
  • Python 3.13, not 3.14. On Python 3.14, AppInspect's bundled static analyzer fails to initialize and reports every Python check as an error, so the loop could never reach a clean result even after patching the real failures. We pinned the interpreter to 3.13 and documented it rather than weakening our "clean means zero failures and zero errors" definition.
  • Keeping it honest. We deliberately constrained the LLM to text and kept patches deterministic, so judges can trust the audit trail and reproduce every run.

Accomplishments that we're proud of

  • One engine, three loops. Adding the SPL lint loop reused the self-heal engine verbatim — the platform thesis made concrete, not aspirational.
  • A live, end-to-end MCP-validated onboarding run: 150 events ingested, three gaps healed, six CIM mappings, two PII flags, six real splunk_run_query calls — the MCP integration is load-bearing, not decorative.
  • A dashboard that runs in both replay and live mode off the same reducer, fully responsive, with a real provenance audit trail.
  • Deterministic, reproducible loops backed by a green test suite.

## What we learned

The leverage is in the closed loop plus provenance, not in a bigger model. Constraining the agent to diagnosis text and deterministic patches gave us repeatability and a defensible audit trail — and made the "one engine, many loops" platform story credible. We also learned how much of "production-readiness" in Splunk is really about validating against real data (the MCP splunk_run_query oracle) rather than reasoning about logs in the abstract.

What's next for Splunk App Lifecycle Copilot

  • Stage 3 (app scaffold + synthetic test-data generation) and Stage 5 (Simple XML → Dashboard Studio migration) on the same engine.
  • Use splunk_get_knowledge_objects to reconcile extractions against existing field extractions and data models before proposing new ones.
  • Emit final props.conf / transforms.conf from the onboarding loop, not just validated candidates.
  • Surface the provenance ledger back into Splunk as an index — so the audit trail of every automated fix is itself searchable in Splunk.

Built With

Share this project:

Updates