Project Story: log-book AI

Inspiration

Every engineer who has ever been on call knows the feeling. It is 2 AM. Your pager goes off. You have fifteen minutes to figure out why production is degrading — and the only window you have into the system is a Splunk search bar and a prayer that you remember the right SPL syntax.

We built the first version of this at a hackathon. The pitch was simple: what if you could just ask your logs what was wrong? An AI chatbot that could translate English into Splunk queries. Cute, right? Everyone else at the hackathon was building an "AI chatbot for X." We won the AI category.

But we could not shake the feeling that chatbots answer questions — they do not fix things.

We spent the next several months talking to SREs at three different companies. The pattern was universal and unsettling. Teams run 8 to 12 monitoring tools simultaneously. PagerDuty screams. Datadog dashboards flash. Splunk has the raw data. The human operator is the only integration point between them all. Incident response is not a search problem — it is an orchestration problem. The hardest part is not finding the error. It is connecting the error to a deployment, understanding the blast radius, deciding what to do, getting approval, executing safely, and not repeating the same mistake next week.

Existing AIOps tools (Datadog AI, New Relic AI, Splunk AI Assistant, more every quarter) are uniformly single-agent chat interfaces. They are reactive. They wait for you to ask. They have no memory of yesterday's outage, no knowledge that service A depends on service B, no ability to act when you are not at your desk.

The idea that became SplunkLens was not about building a better chatbot. It was about building an operations team made of software. Detect without being asked. Investigate while the human is still waking up. Generate a plan with risk scoring and rollback strategy. Present it for approval with confidence metrics and similar past incidents. Execute with safety gates. Store the resolution in a knowledge graph so the next incident starts from a smarter place.

That is what we built.

What It Does

SplunkLens AI is a desktop application that connects to Splunk Enterprise and deploys a fleet of autonomous AI agents to handle the full incident lifecycle.

Open the app for the first time and you see an onboarding wizard. Enter your Splunk URL, username, password. Pick your LLM provider from ten options — OpenAI, Anthropic, Google Gemini, Groq, OpenRouter, DeepSeek, Ollama (local or cloud), MiniMax, Z.AI. Select a model. Click Launch. That is the last time you need to touch configuration.

Behind the scenes, an Observer Agent starts ticking every thirty seconds. It runs five detection rules against your live Splunk data: error rate spikes, alert cascades, high latency, disk usage, authentication failures. Each rule produces a confidence score. If the score crosses 0.5, an Incident is created automatically. Duplicate prevention prevents the same incident from firing twice within five minutes. The Coordinator Agent receives the incident, transitions it through a state machine with thirteen statuses and over twenty valid transitions, creates a timeline event, stores the incident in memory, and adds nodes and edges to the knowledge graph.

Meanwhile, you see a dashboard with four views: Overview (total events, active alerts, error rates), Server Health (CPU, memory, disk across hosts), Alerts (fired alerts with severity), and Logs Explorer (virtualized table of log events with search). You can select rows and send them directly to the AI agent as context.

The AI Chat view is where the cognitive work happens. Type "why is my API gateway returning 503s" and the agent starts a streaming conversation. You see it think. You see it call tools — run_query against your Splunk data, get_metadata to discover hosts and sourcetypes, get_alerts to check for triggered alerts, correlate_events to find temporal relationships. Each tool call streams its result back to the UI. When the agent has enough evidence, it produces an analysis. If it identifies a root cause that can be remediated, it embeds a structured remediation plan in <remediation_plan> XML tags with steps, risk levels, rollback commands.

The Incident Console surfaces all of this in a unified view. A filterable incident list on the left. Incident detail with the full state machine timeline in the center — event by event, evidence by evidence, agent by agent. An Observer Panel on the right showing live observer status, active rules, and detection counters.

The Approval Panel (being built) will present generated plans to human operators with full context: confidence scores, risk assessments, blast radius analysis, rollback strategies, similar past incidents. Approve, reject, modify, or escalate.

The Executive Dashboard tracks MTTR, AI success rates, incident trends over time.

The Executor Agent takes approved plans and executes them step by step. Each step runs a health check after completion. If a step fails, the executor triggers automatic rollback — reversing completed steps in reverse order using stored rollback commands. The Safety Engine validates every plan across five gates: command validation (blocked commands like rm -rf / are caught), blast radius, timing (business hours awareness), compliance (audit trail requirements), and reversibility (all steps must be reversible).

The Memory Engine maintains short-term and long-term stores with TTL-based eviction, consolidation, and semantic retrieval. The Knowledge Graph Engine tracks entities — incidents, services, hosts, alerts, deployments, root causes, remediations — and the relationships between them. When a new incident arrives, the coordinator automatically queries the graph for similar past incidents and links the new incident to affected services.

The entire system runs with contextIsolation: true and nodeIntegration: false. All credentials are encrypted with a machine-bound AES key. All network calls originate from the main process. The renderer never touches Node.js.

How We Built It

Architecture

The application has three processes: a main process (Node.js), a preload bridge (contextBridge), and a renderer (React 19). This is standard Electron, but the architecture inside each process is far from standard.

The main process contains seven layers stacked vertically:

Layer 1 — Shell: Electron BrowserWindow with the usual lifecycle handlers. Nothing special except the show-after-ready pattern that prevents white flash.

Layer 2 — IPC Surface: Nine handler modules register around sixty typed channels. Every handler is registered in main.ts at startup. Every channel is defined in shared/types/electron-api.types.ts ahead of time — the full type contract between processes is explicit, not implicit. The preload file is a 300-line mapping from typed function calls to ipcRenderer.invoke calls. If a new IPC channel is needed, it must be added in three places: the type interface, the preload api object, and the main process handler. We accepted this friction because it makes the system auditable and type-safe.

Layer 3 — Agent Layer: The Coordinator Agent is the central state machine. It manages a 13-status transition table with over 20 valid transitions. Every transition is validated before execution — there is no way to skip from detected to resolved without going through investigating, analyzing, planning, pending_approval, approved, executing, and verifying. The Observer Agent polls Splunk every 30 seconds with five detection rules. The Executor Agent runs step-by-step execution with automatic rollback. Each agent is a singleton with listeners. They communicate through the event bus and through the incident store.

Layer 4 — Orchestration: This is the part we are most proud of architecturally. The Orchestrator manages a full multi-agent lifecycle: a Registry (agent registration with capabilities), a RoutingEngine (task-to-agent matching with scoring), a LoadBalancer (tracks task counts per agent), a MessageBus (typed pub/sub with priority queues and dead letter retention), a TaskScheduler (creates, starts, fails, completes tasks with timeout), a DependencyResolver (ensures task DAG ordering), a RetryHandler (exponential backoff with configurable max retries), a ConflictDetector (detects resource conflicts between tasks), a ConflictResolver (priority, queue, merge, or escalate), and a SessionManager (tracks collaboration sessions between agents). This layer is fully implemented, not aspirational. We wrote it because the chat-agent loop, while powerful, is fundamentally single-threaded. The orchestration layer is designed for the future where multiple agents run concurrently and need to coordinate without stepping on each other.

Layer 5 — MCP Tool Layer: Thirty-plus Splunk tools, each defined with a Zod schema that validates arguments at runtime. The tools are categorized into six groups: core Splunk (search, info, indexes, metadata), alerts and monitoring (list alerts, get fired alerts, alert history), search and query advanced (search jobs, explain results, compare time ranges), license and config (license usage, server roles, apps), data inputs (inputs, input health, forwarder status), security (roles, audit logs), and SPL education (suggest, validate). Each tool has a human-readable description, a Zod input schema, and an execute function. The zodSchemaToJsonSchema function converts Zod schemas to OpenAI-compatible JSON schema for LLM tool definitions — a nontrivial recursive transformation that handles strings, numbers, booleans, enums, objects (with optional and default), records, and nested schemas.

Layer 6 — Intelligence Layer: The Memory Engine manages short-term (TTL-based, ephemeral) and long-term (persistent, cross-session) memory with importance scoring, ranking, consolidation, eviction, and semantic retrieval. The Knowledge Graph Engine uses an adjacency list data structure with BFS, DFS, and Dijkstra traversal, plus a visualization data provider for the interactive graph view. The Context Engine maintains session-level context baselines with reconciliation — when the Splunk environment changes (new alerts, health degradation), the context manager detects the delta and injects a context update into the agent's conversation.

Layer 7 — Safety: Five validation gates run against every remediation plan before execution. Command validation blocks known dangerous commands. Blast radius analysis flags plans affecting more than five services. Timing validation warns about business-hours execution for non-low-risk plans. Compliance checks ensure audit trails for data modifications. Reversibility validation flags irreversible steps. Plans that fail critical gates are blocked entirely. Plans that fail warning gates proceed with flags visible to the human operator.

Technology Decisions

Electron over web: The application needs direct access to a local Splunk instance. Browser-based apps cannot make requests to localhost:8089 in most enterprise networks. Electron gives us Node.js file system access for persisted incident storage, raw HTTPS for Splunk communication, and the ability to run as a standalone desktop tool that does not require a server infrastructure.

Vercel AI SDK over direct API calls: The SDK provides a unified interface across ten providers. We did not want to write ten integration paths. The ai package handles streaming, tool calling, and error normalization. When a user switches from OpenAI to Ollama local, the code path is identical — only the config changes.

Zustand over Redux: Twelve stores, each small and focused. Zustand's selector-based re-render optimization keeps the UI fast. No boilerplate, no context providers, no action creators. Each store mirrors a main-process domain via IPC.

File-based JSON over SQLite: For a desktop app targeting a single operator, SQLite adds build complexity (native bindings, cross-platform compilation). JSON files in userData/incidents/ and userData/timeline/ are human-readable, debuggable, and trivially backup-able. The tradeoff is synchronous I/O — acceptable for desktop scale, not for thousands of concurrent writes.

Singleton engines over dependency injection: Every engine (Memory, Graph, Orchestrator, Safety, Executor) is a singleton with a getX() factory and a resetX() teardown. This is not architecturally pure, but for a single-user desktop app with eager initialization, it is pragmatic. The try/catch wrappers around each initialization ensure one engine failing does not prevent others from starting.

The Data Model

The most complex model is the Incident. It tracks severity (critical through info), priority (P1 through P5), confidence (0.0 to 1.0), category (availability, performance, security, data, configuration), status (13 states), phase (10 phases), detection source (observer, security, user, external), evidence chain (typed evidence items with relevance and confidence scores), affected services, remediation plan reference, approval timestamps, execution timestamps, resolution summary, similar incident references, knowledge graph references, and a full audit log. Every mutation creates an audit entry. Every status transition creates a timeline event. The incident is both a data record and an event source.

Challenges We Ran Into

The Streaming Content Bug

The agent loop streams tool_call and tool_result events to the renderer. During testing with long investigations (15+ tool calls), the UI would accumulate stale content and the virtualized message list would desync. The root cause was that our agent:stream event listener was not properly cleaning up between sessions. Fixing it required adding a cleanup effect in AIChatShell that unsubscribes the stream listener when the session ends. One line of cleanup code, three hours of debugging.

The State Machine Complexity

The incident state machine started with five states. It grew to thirteen. Adding the thirteenth state (reopened) broke four existing transition paths because the transition table had become a densely connected graph rather than a clean pipeline. We had to re-architect the transition validation from if-else chains to a declarative transition table (VALID_TRANSITIONS) where every status explicitly declares its valid targets. The table is more verbose but infinitely more maintainable. Every transition is now auditable in one place.

The Singleton Trap

Engines that depend on other engines create initialization order problems. The Coordinator Agent tries to register adapters with the Memory Engine, the Graph Engine, the Orchestrator, and the Planner Agent. If any of those engines fail to initialize (missing config, import error), the Coordinator would crash on startup. We wrapped every engine initialization in its own try/catch block, but this introduced a subtler problem: the Coordinator might start successfully while the Graph Engine silently failed, and nobody would notice until a graph-dependent feature broke hours later. The solution was to make every engine's getX() function log its initialization outcome, plus add a health check endpoint that the Observer Agent could poll.

Port Auto-Correction

Splunk's management API runs on port 8089. Users consistently entered port 8000 (Splunk Web) or 8088 (HTTP Event Collector). We wrote an auto-correction system that detects common wrong ports and attempts connections on port 8089 automatically. The same system auto-corrects protocol (http vs https) when TLS handshakes fail. The error messages for the most common failure modes (EPROTO, ECONNREFUSED, ECONNRESET, ETIMEDOUT, self-signed cert) were carefully crafted to tell the user exactly what to fix rather than dumping raw SSL errors. This was not glamorous work, but it made the difference between "this app does not work" and "oh, I need to use port 8089."

The Token Budget

Different LLM models have wildly different context windows — from 8K (older models) to 200K (Claude). Letting the agent fill the context window without limit caused two problems: API calls became expensive (paying for wasted tokens) and the LLM's reasoning quality degraded at high context utilization. We built a token budget system that estimates tokens per message, tracks utilization, and triggers context compaction when budget exceeds 75%. The compaction engine uses an LLM to summarize old messages while preserving critical information — a metalinguistic trick that adds latency but can extend session lifetime by 5-10x. The compaction threshold is aggressive (75% warning, 90% critical) because we prioritized reasoning quality over session length.

The File Persistence Tradeoff

Storing incidents as individual JSON files in userData/incidents/{uuid}.json is simple and debuggable, but it introduced a subtle consistency problem. The in-memory cache (a Map<string, Incident>) could desync from disk if the user somehow modified files externally. For a desktop app this is theoretical, but the pattern prevented us from implementing a clean persistence abstraction. We accepted this tradeoff for the initial build, knowing that a migration to SQLite or LevelDB would be straightforward behind the existing incidentStore interface.

Accomplishments That We're Proud Of

The Full Incident State Machine

Thirteen statuses. Twenty validated transitions. Each transition creates a timeline event, emits an agent event, and persists to disk. The Coordinator enforces the pipeline: you cannot execute a plan that has not been approved, and you cannot approve a plan that has not been generated. The state machine is the beating heart of the system, and it took multiple iterations to get right.

The Orchestration Layer

We built a complete multi-agent orchestration system — registry, routing, load balancing, message bus, task scheduling, dependency resolution, retry handling, conflict detection and resolution, session management — that is fully implemented and operational. Most projects at this stage would have a Coordinator and a few agents communicating via ad-hoc callbacks. We have a system designed for concurrent multi-agent operation with explicit conflict resolution protocols.

The Safety Engine

Five gates that validate every remediation plan. Command validation catches rm -rf / before it reaches execution. Blast radius analysis flags plans affecting more than five services. Timing validation catches risky executions during business hours. Compliance checks ensure data modifications leave audit trails. Reversibility validation ensures every destructive step has a rollback. The safety engine is not bolted on — it is invoked at plan generation, before approval display, and before execution. Three separate checkpoints.

The MCP Tool Layer

Thirty-plus tools, each with Zod validation, JSON schema conversion, typed argument parsing, and descriptive help text. The tool definitions are comprehensive enough that the LLM can discover and use them autonomously without additional guidance. The zodSchemaToJsonSchema function was a significant engineering effort — Zod's internal type representation is complex, and converting it to OpenAI-compatible JSON schema required handling ZodString, ZodNumber, ZodBoolean, ZodEnum, ZodObject (with optional/default/required), ZodRecord, ZodOptional, and ZodDefault shapes recursively.

Ten Provider Support Out of the Box

OpenAI, Anthropic, Google Gemini, Groq, OpenRouter, Ollama (local), Ollama (cloud), DeepSeek, MiniMax, Z.AI. Each with its own base URL, default models, capability matrix, auth method, and recommended settings. The provider definitions are declarative — adding a new provider requires adding an entry to shared/provider-definitions.ts with its API details, models, and capabilities. The rest of the system consumes the registry generically.

The Boot Safety System

If the renderer crashes, the app enters Safe Mode — a minimal React component that renders independently of the component tree. This was born from a debugging session where we could not tell if the crash was in React initialization or in the IPC layer. Safe Mode answers that question in one glance. It is a simple trick, but it has saved us hours of debugging.

What We Learned

The hardest part of an agent system is not the AI. It is the state machine. The LLM is the most predictable component — it takes input and produces output. The state machine, the orchestration, the persistence, the error recovery, the UI synchronization — these are where complexity multiplies. We spent three times as long on the Coordinator as we did on the chat agent.

Singleton engines are a debt that must be repaid. They made early development fast. They make testing, parallel operation, and graceful degradation harder. If we rebuild this, we would use a dependency injection container with explicit lifecycle management.

File-based persistence is the right call for a desktop app, but the abstraction must be clean. The IncidentStore interface (list, get, create, update, addEvidence, subscribe) is implementation-agnostic. Switching from JSON files to SQLite would require changing only the store internals, not the consumers. We invested in the interface, not the implementation.

Streaming UIs need defensive design. The virtualized message list must handle rapid updates, mid-stream disconnections, reconnection with partial state, and cleanup between sessions. Every missing removeListener call is a memory leak that manifests as a subtle performance degradation over time.

The token budget is a first-class system constraint, not an optimization. Models have different limits. Conversations grow unboundedly. Compaction changes the conversation history irrevocably. We learned to treat token management as a core architectural concern, not a performance knob.

Write the error messages first. The Splunk connection error messages were rewritten three times. The final versions — "SSL/TLS protocol mismatch — the server is not speaking HTTPS... If you entered port 8000 (Splunk Web), change it to 8089" — are the result of watching real users fail and iterating on the guidance. Error messages are UX, not logging.

What Is Next for log-book AI

Short Term

The approval panel UI needs to ship. The state machine, IPC channels, and frontend store are in place — the actual approval workflow interface with approve/reject/modify/escalate buttons is the remaining gap. The streaming timeline visualization — events that update in real-time as the agent works — is the next UX priority. And the test suite must be built. Zero unit tests on a system with this complexity is not sustainable.

Mid Term

The Planner Agent — the component that takes investigation results and generates structured remediation plans with risk scoring, rollback strategy, and alternative approaches — is the highest-impact missing agent. The Investigator Agent — deep-dive root cause analysis with evidence collection, temporal correlation, and hypothesis testing — would close the loop between detection and planning. The Memory System needs learning — automatically extracting patterns from resolved incidents and surfacing them when similar symptoms appear. The Knowledge Graph should be queryable from the chat interface: "show me all incidents related to the API gateway last month."

Long Term

We want log-book to become the operations platform, not just the Splunk tool. Integration with PagerDuty (receive alerts, auto-acknowledge), Slack (notify, request approval), Jira (create tickets from incidents), and cloud providers (AWS, GCP, Azure for infrastructure context). A plugin system for custom detection rules and remediation actions. Support for multiple data sources — Elasticsearch, Datadog, Prometheus — via the MCP abstraction layer. And eventually, a web-based deployment for team operations, with multi-user collaboration and shared knowledge graphs.

The vision has not changed since that first hackathon. Every other AIOps tool is an AI chatbot that answers questions. log-book is an AI operations team that detects, investigates, plans, executes, and learns — with the human operator as the commander, not the search engine.

Splunk is the nervous system. The agents are the reflexes. The operator is the brain.

We built the reflexes.

Built With

electron
mcp
react
splunk
tailwind
typescript
vite