Inspiration

We have production-grade observability for services (Datadog, Splunk Infrastructure), SRE workflows for jobs (Kubernetes, Airflow), and audit trails for human operators (SSO, GRC, SOC 2). But AI agents, the fastest-growing production workload of 2026 and arguably ever, run unsupervised with no flight recorder. When a "Cost Analyst Agent" spends $0.17 on a single investigation, who knows what tools it called, what data it saw, or whether the answer was right? When it loops, hallucinates, or silently switches providers, who catches it?

Existing LLM-observability tools (Langfuse, Helicone, LangSmith) bolt tracing on as an afterthought and treat it as separate from the rest of the production stack. We wanted the same kind of incident response for agents that Splunk gave infrastructure: durable, queryable, alert-able, with the same retry / replay / risk-grading workflows. Splunk already has the data model (events with timing, status, identifiers) and the AI primitives (MCP Server) to be the agent control plane. AgentScope is the missing link: the runtime that turns "agent did a thing" into "here is the audit trail, the SPL query that found the problem, and the risk-graded report".

What it does

AgentScope is the operations plane for AI employees. It is three things working together:

  1. A control plane for AI work. A "Run Splunk Investigation" button on /agents queues a job, claims it on a durable worker, and shows live Queued to Running and finally Completed status.

  2. A black-box recorder that captures every model call and tool invocation into a replayable event timeline: ModelInvoked, ModelCompleted, ToolCalled, ToolReturned, CostRecorded, SessionStarted, SessionCompleted which is persisted in Postgres and forwarded to Splunk HTTP Event Collector under the agentscope:event sourcetype. The session replay at /sessions/<id> re-reads that same timeline.

  3. An incident investigator built on the Splunk MCP Server. When a run finishes, the Splunk AI Investigator spawns the MCP Server over stdio and issues an SPL search against the active session's events to produce a risk-graded operations report: findings, risk level, and the exact SPL query used. The runtime path is fail-closed: missing Splunk MCP or AI credentials produce an explicit failed run, never synthetic analysis.

The web app also has a dashboard with agent scorecards, a Splunk health panel, a direct SPL search panel against the live index, and a Sessions page for full audit replay.

How we built it

  • Web + API. Next.js 16 (App Router) with tRPC, Drizzle ORM against Postgres, Better Auth (Discord OAuth + email) with Resend for invites, Stripe for billing, and the Vercel AI SDK for the provider-agnostic agent runtime. Monorepo with pnpm workspaces and Turborepo.

  • Splunk integration lives in packages/telemetry with three touchpoints: splunk.ts forwards events to HEC behind a durable Postgres outbox (so no event is lost when Splunk is briefly unavailable), mcp.ts spawns the Splunk MCP Server over stdio with indexer-backoff retries, and anomaly.ts runs direct SPL through the management API for cost-by-agent, p50/p95 model latency, and tool reliability.

  • Worker. apps/workers is a separate Node process that polls the agent_run table, claims rows with row-level locks, executes the agent, and emits events. The web app never runs agents inline, the worker is the only thing that touches the LLM or MCP server.

  • Agents. packages/agents ships three starter AI employees (Cost Analyst, Reliability, Research) and a generic tool-using loop. The user-facing agents call a splunk-context-search tool during their own run; the investigator is a separate agent that calls the Splunk MCP Server (via mcpSearch from @agentscope/telemetry) exactly once per run and is instructed to ground every claim in the returned events.

  • Local + production deployment. Docker Compose brings up Postgres + Splunk Enterprise + the worker with a healthcheck-gated depends_on. Production deploys the worker as a separate process from the Next.js app (see docker-compose.prod.yml).

Challenges we ran into

  • Splunk cold start. The Splunk container has a 360-second start_period and the HEC token has to be configured post-start with a one-shot script (./scripts/splunk-setup.sh). The worker uses depends_on: { splunk: { condition: service_healthy } } so it never starts before Splunk is ready, and the outbox handles the brief indexer-unavailable windows after start.

  • Telemetry schema drift. The first cut of the event emitter sometimes dropped duration, tokensIn, and provider on ModelCompleted, the result was that p50/p95 latency SLOs were uncomputable and cost attribution was inconsistent. We added a typed event schema and a per-event assertion in the outbox so missing fields fail loudly instead of silently degrading the data.

  • Cost guardrails. Every session costs roughly $0.17 because the agentic loop re-sends the full prompt three times. Without a per-agent per-period cap (AgentCostBudget with enforceHardCap, evaluated by evaluateAgentCostBudgets), a runaway agent could rack up dollars in minutes. We wired the cap into the run-queue admission check and emit a CostRecorded event on every session; the next iteration adds prompt caching for an expected 30–50% reduction.

  • MCP recursion. The investigator's own tool calls show up in its Splunk query results. We solved this by scoping the SPL to the active session via _raw="*${sessionId}*" so the investigator only sees its own session's events, not the meta-events from investigating itself.

  • MCP server lifecycle. The Splunk MCP Server is a stdio binary, not a long-lived service. Spawning it per-run with proper backoff, error reporting, and crash recovery was the most subtle piece of the runtime.

Accomplishments that we're proud of

  • The runtime MCP loop works end-to-end. An agent's behavior is recorded into Splunk, and a second agent (the investigator) uses the Splunk MCP Server to query that same data. This is a non-trivial second-order use of MCP, agents investigating themselves through Splunk, with no API glue and no shadow state.

  • Replay is just a UI re-read of the canonical event log. Every event the agent emits is the source of truth. The replay page, the dashboard, the SPL queries, and the investigator all hit the same agentscope:event data. If it isn't in Splunk, it didn't happen.

  • Risk-graded output, grounded in evidence. The investigator returns findings, a risk level, and the exact SPL query, judges can re-run that query in Splunk Web and see the same data the agent saw.

  • A real, durable worker. The agent_run queue survives restarts (Postgres-backed), supports retries, and uses row-level locking to prevent two workers from claiming the same run. The same worker pattern works in the local Docker Compose stack and in production.

  • Zero-cost onboarding. docker compose up -d brings up the full stack (Postgres + Splunk + worker) on a laptop. New users get an organization, an owner membership, and three starter AI employees on first sign-in.

What we learned

  • Agent observability must be default-on, not opt-in. The agent runtime wraps every model call and tool invocation in an emitter, so the event log is a side-effect of execution, not a separate concern.

  • Splunk's data model is a great fit for agent sessions. Events with timing, status, identifiers, and a free-form JSON payload map cleanly onto ModelInvoked, ToolReturned, SessionCompleted, etc. no custom schema gymnastics required.

  • MCP unlocks second-order agent use cases. The same MCP Server that a human could query from a chat app is also queryable from another agent so you can build agents that investigate other agents through Splunk with no API glue, no auth boundary, and no extra infrastructure.

  • Cost attribution is the missing primitive for FinOps on agents. Without a per-session CostRecorded event, you cannot answer "what did this agent cost us this week?" and you cannot set per-agent budgets or trigger alerts on cost anomalies.

  • HEC durability matters. The outbox pattern (Postgres-backed retry queue in front of HEC) is essential for not losing events when Splunk is briefly unavailable, exactly the kind of failure mode that observability tooling has to be the most resilient against.

  • The investigator's own report is a meta-recursion test. Reading the AI Analysis it produced on its own session was a useful forcing function: it found a real schema bug, a real cost lever, and a real latency floor that we hadn't yet addressed. Agents are useful for investigating agents.

What's next for Agentscope

  • Multi-tool sessions. Today the Cost Analyst only uses splunk-context-search. We want to wire in web-search, code-exec, git, and a SQL tool so the investigator can cross-reference Splunk data with the rest of the world and answer questions like "did this run match the production schema change from last week?".

  • Prompt caching. Caching the system-prompt prefix should cut cost by 30–50% (per the AI analysis the investigator itself produced on its own session).

  • Streaming model output to the UI. The current 30-second wall time per model call is a UX bottleneck. Streaming tokens cuts perceived latency to under a second and makes the agent feel interactive.

  • Provider failover and circuit breakers. Today we are 100% on OpenAI-compatible providers. We want automatic failover to Anthropic when p95 latency exceeds 60 seconds, plus a circuit-breaker on the primary provider.

  • Open the MCP server to other teams. Let any agent: internal or third-party, point at AgentScope as its audit and replay layer. The Splunk index becomes the agent's permanent memory, and AgentScope becomes the way humans and agents collaborate on incident response.

  • Production hardening. Multi-tenant isolation for the Splunk index, a schema validator at ingest, SLOs on p95 model latency, and a per-session cost budget alert. Most of this is already in the architecture; the rest is operational work for the next sprint.

Share this project:

Updates