SplunkForge - From one sentence to a deployed Splunk app

Inspiration

Building a Splunk app is harder than it looks, and not for the reason you'd expect. It's not Python. It's not even SPL. It's the gap between knowing what you want and knowing how Splunk wants you to say it: conf files, restmap routes, AppInspect rules, packaging conventions, all the little opinions the platform has that nobody documents in one place. A senior Splunk engineer navigates that in a couple of days. Most people stall somewhere in the middle. And the people who feel the pain most acutely, say the ops engineer watching error rates climb, or the security analyst who just wants an enrichment workflow are usually not the people who can build the thing that fixes it. We built this because we kept running into that gap. We wanted something that didn't just hand you a starting point and wish you luck. We wanted something that actually finished the job.

That tool is SplunkForge.

What it does

You describe what you need in plain English. SplunkForge builds and deploys a working Splunk app.

That sounds like a big claim, so here's exactly what happens under the hood. The pipeline is:

Intent classification figures out what kind of app you're asking for — alert, dashboard, AI agent, streaming command, enrichment workflow.
Schema grounding connects to your actual Splunk instance and discovers your real indexes, source types, and fields. Every query SplunkForge writes is grounded against your data, not a generic template.
Expert expansion is the part we're most proud of. A vague prompt like "monitor my checkout API" gets unpacked the way a senior engineer would unpack it: what questions does this app actually need to answer? Current state, trend, breakdown, comparison, anomaly, what to do about it. We also enforce minimums here — dashboards need at least 8 panels, agent apps need at least 3 subagents and 5 tools. This is what keeps the output from being a one-panel placeholder.
The Logic Plan is a plain-English description of what the app will do, shown to you before any code is written. Agent topologies, SPL queries, system prompts, output schemas — all readable, all editable. You review it, approve it, or ask for changes.
Code generation produces the full app: Python, conf files, system prompts, tool implementations, packaging. AI components use splunklib.ai with SCPv2 protocol compliance.
Deploy and verify installs the app on your Splunk instance, runs the saved searches, invokes the agents with test inputs, runs AppInspect, and shows you the results.

From prompt to verified app: about four minutes.

How we built it

The frontend is a three-panel Next.js interface: prompt on the left, pipeline visualization and Logic Plan in the middle, file tree and generated code on the right. The Logic Plan panel is the centerpiece — it's there specifically so the AI's reasoning isn't hidden. You see exactly what's going to be built before it's built.

The backend is a stage-based pipeline with Pydantic v2 schemas at every handoff. Each stage produces a typed intermediate — AppSpec, ExpansionResult, LogicPlan — that the next stage consumes. This made it much easier to debug and to iterate on individual stages without breaking everything else.

Schema grounding uses the Splunk MCP Server with token-based auth. We chunk discovery queries and cache the results per session to avoid hitting MCP timeout limits on large indexes.

The Logic Generation Agent takes the expanded spec and grounded schema and writes the actual engineered content: system prompts tuned to the app's purpose, tool implementations with real SPL, structured output schemas, subagent topologies with non-overlapping responsibilities.

AI agent components are generated using splunklib.ai, with the async-to-sync bridging needed for SCPv2 custom commands. Tool registry pattern, supervisor + specialist agent topology, Pydantic-validated structured outputs.

Challenges we ran into

Vague prompts producing thin apps: The first version generated technically correct apps that were operationally useless — one-panel dashboards, single-tool agents. We solved it by inserting the expert expansion stage between intent classification and Logic Plan generation, with hard minimum artifact counts that prevent the model from taking the easy path.

splunklib.ai is alpha: The library is on the develop branch and the SCPv2 streaming command protocol needed manual async-to-sync bridging that isn't documented. We worked it out by reading the source and packaging the wrapper into our generators.

MCP Server query timeouts: Some schema discovery queries against large indexes hit the MCP query timeout. We worked around it by chunking discovery into smaller targeted queries and caching grounded schema per session.

AppInspect compliance: Generated apps initially failed several AppInspect checks. We hardened the generators by treating AppInspect as a constraint upstream — the templates now produce structures that pass validation by construction rather than by remediation.

Demonstrating that the AI part actually works: A code generator that produces unrun code is not impressive. The hardest engineering decision was committing to the verification loop — automating deploy, search execution, agent invocation, and AppInspect into a single smoke test that surfaces real evidence in the UI.

Accomplishments that we're proud of

A vague one-sentence prompt produces a real, multi-artifact, schema-grounded Splunk app and not just a skeleton.
The Logic Plan review step puts a human in the loop without slowing things down.
The verification loop closes the agentic cycle: describe → plan → generate → deploy → prove.
The expert expansion stage is doing something genuinely different from typical AI scaffolding — it's encoding what "good" looks like in this domain.
And the verification loop means the output is a deployed, tested product, not a prototype.

What we learned

The biggest lesson is that AI code generation gets dramatically better when you make the model think like a domain expert before it writes anything. The default behavior of "interpret the prompt literally" produces thin output. Forcing an expansion step that asks "what would a senior engineer build here?" produces output that's qualitatively different and not just more code, but better-reasoned code.

The second lesson is that the verification loop changes everything about how the project is perceived. A generated app that hasn't been run is a prototype. A generated app that has been deployed, executed, and validated against real data is a product.

The third is that the intermediate representation matters more than the final output. The Logic Plan being plain English and reviewable is what turns SplunkForge from a black box into a tool engineers actually want to use.

What's next for SplunkForge

Expand from three well-supported app archetype to the full set: alerting workflows, security enrichment pipelines, observability dashboards, custom SPL commands, ML-enabled apps.
Conversational refinement - edit the Logic Plan through dialogue rather than direct manipulation.
Multi-app composition - let SplunkForge build apps that depend on or extend other apps. Splunk Cloud deployment in addition to Splunk Enterprise.
A library of generated apps that users can share, fork, and remix as Logic Plans rather than as code.