Inspiration
In the enterprise world, Site Reliability Engineering (SRE) and Incident Response are plagued by human bottlenecks and passive alerting systems. Traditional monitoring setups treat log aggregators like Splunk as graveyard registries where logs are stored passively until an error occurs. When a crash is detected, a developer or SRE receives a notification, manually logs into a telemetry portal, executes custom search queries to gather context, searches the code repository for the offending file, designs a bug fix, writes a pull request, and deploys it. This manual investigation loop typically takes anywhere from 30 minutes to several hours, directly impacting business SLA metrics.
We built Talos to completely rewrite this workflow. Our goal was to close the loop on incident response by transforming Splunk from a passive telemetry archive into an active, autonomous SRE partner. Inspired by modern self-healing software paradigms and recent advancements in LLM tool execution, we created an agentic loop that intercepts crashes at the client runtime, correlates search logs across Splunk indices via the Model Context Protocol (MCP), calculates statistical blast-radiuses, generates structured root-cause reports alongside verified code-diff recommendations, and delivers actionable alerts to developer communication platforms (Slack/Discord)—all in a matter of seconds.
What it does
Talos is an autonomous, self-healing site reliability engineering loop designed to handle telemetry collection, anomaly triage, AI-driven log research, code fix generation, and alerting. It is built around a strict 4-stage lifecycle:
- Capture & Intercept (Talos SDK): The
@mylife-as-miles/talos-sdkis integrated into customer applications (Browser and Node.js). It sets up global exception listeners (window.onerror,process.on('unhandledRejection')) to catch unhandled errors. Concurrently, it records runtime breadcrumbs (such as navigation events, network requests, or user clicks) to create a visual sequence of the steps leading to the failure. - Ingest Gateway Proxy: To avoid exposing sensitive credentials on client devices, Talos routes all events through a secure server-side Ingest Gateway. The gateway validates incoming JSON payloads, records them in the local database for historical reporting, and relays them to the Splunk HTTP Event Collector (HEC).
- Agentic Splunk Search (MCP & REST): Upon receiving a crash event, the Headless AI Resolver is invoked. It acts as an autonomous agent, executing research tasks. It uses the Model Context Protocol (MCP) to interact with a local Splunk MCP Server to run real-time search queries. It checks logs matching the error signature, tracks historical repeat counts, and analyzes adjacent services. If the MCP server is unreachable, it falls back to the native Splunk REST API.
- Anomaly Scoring & Cognitive RCA: The agent feeds the full log context, stack traces, breadcrumbs, and service configurations to an anomaly scoring algorithm and Google's Gemini models. It produces a detailed root-cause triage report containing:
- An explanation of why the crash occurred.
- A calculated anomaly severity score.
- An actual syntax-highlighted Git proposed code diff resolving the bug.
- Automated notification payloads sent directly to target Slack or Discord webhooks.
How we built it
Talos is developed as a modular TypeScript monorepo using pnpm workspaces:
- Next.js 15 (App Router): Powers the Neubrutalist dashboard UI, as well as the serverless API handlers for crash simulation (
/api/simulate-crash), payload ingestion (/api/ingest), agent execution (/api/agent), and notification routing (/api/notify). - Talos SDK (
packages/sdk): A universal TypeScript library exporting ESM and CommonJS modules for runtime exception wrapping. - Splunk HEC Integration: Built a custom fetch wrapper supporting connection pooling and token authentication to Splunk's HTTP Event Collector.
- Splunk MCP Submodule (Python): Integrates the upstream Splunk MCP Server, enabling LLMs to run search tools natively.
- Google Gemini (Gemini Flash): Configured with structured JSON schema outputs to guarantee that the generated triage report matches the application's strict database interfaces.
- Neubrutalist UI System: A custom CSS design token system built on high-contrast black borders, flat primary accents, and responsive telemetry charts.
- Browser-based Config Overrides (BYOK): To bypass configuring
.env.localvariables, we built a header-forwarding system. Browser forms save HEC urls, HEC tokens, MCP URLs, and Webhook urls tolocalStorage. Dashboard actions automatically attach these to request headers (x-talos-hec-url, etc.), overriding environment configurations dynamically.
1. The SDK Ingest Payload Model (TalosErrorEvent)
The SDK packages crash telemetry into a structured layout:
{
"eventId": "e4b2d13a-7f2c-4903-a178-59a6c9d74f32",
"projectKey": "checkout-prod-009",
"environment": "production",
"release": "v1.4.2",
"service": "checkout-service",
"route": "/api/checkout",
"timestamp": "2026-06-15T11:45:00.000Z",
"error": {
"name": "TypeError",
"message": "Cannot read properties of undefined (reading 'email')",
"stack": "TypeError: Cannot read properties of undefined (reading 'email')\n at processPayment (/app/checkout.ts:44:28)\n at POST (/app/api/checkout/route.ts:12:10)"
},
"breadcrumbs": [
{ "category": "ui", "message": "User clicked submit checkout order button", "timestamp": "2026-06-15T11:44:50.000Z" }
],
"context": {
"userId": "user_99a8b11c",
"tags": { "gateway": "Stripe" }
}
}
2. The Statistical Anomaly Scoring Engine
To prevent alert fatigue, Talos calculates an anomaly score (\text{Score} \in [0, 100]) for every incident to determine severity. The score is computed using the following mathematical model:
[\text{Score} = \min\left(100, \text{Base} + S_V + S_M + S_C + S_P + S_R\right)]
Where:
- (\text{Base}): Initial starting constant score of (20).
- Volume Variance ((S_V)): Evaluates real-time error counts ((V)) against historical averages ((\mu)) and standard deviation ((\sigma)): [S_V = \begin{cases} 45 & \text{if } V > \mu + 3\sigma \ 32 & \text{if } V > \mu + 2\sigma \ 18 & \text{if } V > \mu + \sigma \ 0 & \text{otherwise} \end{cases}]
- Match Repeat Frequency ((S_M)): Measures Splunk match repetition ((M)): [S_M = \begin{cases} 18 & \text{if } M \ge 5 \ 0 & \text{otherwise} \end{cases}]
- Critical Route ((S_C)): Matches path namespace to high-priority features (e.g., checkout, payment, auth): [S_C = \begin{cases} 15 & \text{if path matches} \ 0 & \text{otherwise} \end{cases}]
- Environment Bias ((S_P)): Blast penalty for production deployments: [S_P = \begin{cases} 10 & \text{if environment is } \text{"production"} \ 0 & \text{otherwise} \end{cases}]
- Release Regression Bias ((S_R)): Missing release version penalty: [S_R = \begin{cases} 5 & \text{if release metadata is empty} \ 0 & \text{otherwise} \end{cases}]
Severity is classified into four tiers:
- Critical: (\ge 85)
- High: (\ge 70)
- Warning: (\ge 45)
- Normal: (< 45)
Challenges we ran into
- Header-based Dynamic Override Routing: Since Next.js API route handlers run on the server side, allowing users to configure settings inside the browser settings form (without editing server
.env.localfiles) was difficult. We resolved this by building a request header proxy pattern. When the dashboard triggers an SRE resolver run, it grabs keys from browserlocalStorageand attaches them as customx-talos-*headers to the fetch request. The API route interceptors read these headers, override the default server variables, and configure the Splunk/Webhook clients dynamically on a per-request basis. - TypeScript Monorepo Compilation: Managing dependency compilation between local workspaces (
packages/sdkandapps/web) required strict build scheduling. We configured a pre-build typescript commandnpx tsc -p ../../packages/sdk/tsconfig.jsonto compile the SDK output target into the local workspace cache before Next.js triggers static page collection, ensuring the compiler always has access to matching exports. - Preventing static page compilation errors: In Next.js, pages that query database stores at compilation time fail if those stores are dynamic file-based lists. Next.js attempted to compile the
/reportsand/incidentspages statically, throwingENOENTerrors because the mock files didn't exist during compilation. We fixed this by declaringexport const dynamic = "force-dynamic"at the page boundary.
Accomplishments that we're proud of
- First-Class BYOK local settings: Users can bring their own Gemini key and Splunk HEC details directly in the browser settings UI. No terminal command setup, file editing, or environmental variables are required.
- Zero-Dependency Offline Sandbox: Built a highly responsive offline mode. If no Splunk credentials exist, the resolver falls back to deterministic mock datasets and mock schemas, enabling developers to test the dashboard, inspect telemetries, and trigger notifications with zero external setup.
- Neubrutalist Design Pattern: Transformed a dry DevOps SRE console into a premium neubrutalist masterpiece, employing thick contrast borders, interactive tables, code block code-diff displays, and vivid status indicators.
What we learned
- Model Context Protocol (MCP): Gained deep experience configuring MCP servers, learning how to expose databases and search indexes securely to LLM agents as standard tools.
- Client Telemetry Wrapping: Learned how to construct robust breadcrumb tracking loops that collect browser network logs and user actions without introducing memory leaks.
What's next for Talos: Autonomous AI Incident Resolver
- Automated PR Submissions: Integrate with Github APIs to automatically check out a git branch, commit the generated code-diff, and submit a pull request for SRE approval.
- Self-Healing Test Suites: Spin up ephemeral docker containers to apply the code-diff, compile the project, and run unit tests to verify the fix before alerting developers.
- Multi-Platform Integrations: Expand telemetry ingestion support beyond Splunk to Datadog, New Relic, and Elastic, giving enterprises unified self-healing pipelines.
Built With
- css
- discord-api
- gemini-api
- javascript
- model-context-protocol
- next.js
- node.js
- python
- react
- slack-api
- splunk
- typescript
Log in or sign up for Devpost to join the conversation.