Inspiration

I've been on enough late-night calls where a service goes down and the next hour is just... chaos. Someone's grepping logs, someone else is checking dashboards, and everyone's asking the same question in Slack — what happened, and can the requests be safely retried? Nobody knows.

When I found the Splunk MCP Server during this hackathon, I didn't immediately think "let me build something cool." I thought — this can actually close the gap. So I did some brainstorming and landed on an idea: what if when something breaks, a system does the RCA automatically — where it failed, why it failed — investigates, decides, and tells the on-call engineer exactly what happened and whether it's safe to retry, before they've even had time to open a terminal?

That's THREAD.

What it does

THREAD watches your distributed services and jumps in the moment a transaction fails. It uses the Splunk MCP Server to trace the failure, check service health, and analyze error trends, then uses Splunk's anomalydetect command to decide if it's safe to retry. The full investigation lands in Slack as a one-click alert before your engineer has finished reading the notification. Click Replay — THREAD pulls the original request body and re-executes it. Failure to recovery, under 30 seconds, no runbook required.

How I built it

I built THREAD around a single idea: the Splunk MCP Server shouldn't just answer questions — it should run investigations. Every service publishes a 5-field contract to RabbitMQ and logs to Splunk HEC on every request. When a failure hits, the Thread Platform consumes it, waits for Splunk to index, then opens a persistent MCP session over Streamable HTTP + JSON-RPC 2.0 and fires 6 splunk_run_query calls in sequence — transaction chain, failure details, service health, system errors, error rate timeseries, and a sixth call with Splunk's anomalydetect SPL command to score the anomaly. The results feed an InvestigationResult that calculates a dynamic replay limit, posted to Slack as a Block Kit alert with one-click recovery. For /thread-search, THREAD calls saia_generate_spl via MCP first to convert plain English to SPL — falling back to Groq if the AI Assistant isn't available — then executes it through splunk_run_query and returns a results table in Slack. The original request body is stored in SQLite at the start of every transaction so THREAD can re-execute it exactly — no reconstruction, no guessing.

Challenges I ran into

The biggest one was Splunk Cloud itself fighting me at every turn.

My original idea was to auto-generate a Splunk dashboard after every investigation — AI picks the right visualizations, creates the panels, links it in the Slack message. Looked great on paper. In practice, every POST request I sent through the Splunk Cloud proxy came back as a 303 redirect to the login page. The management port is blocked externally. I spent more time on that than I'd like to admit before accepting it wasn't going to work and switching to Groq-generated search links instead.

Then there was saia_generate_spl — Splunk's AI SPL generator. It returned an error JSON saying the service wasn't initialized. Fine. Except the error message happened to contain the string index=thread_logs, so my code thought it was valid SPL and tried to run it. That caused some genuinely confusing failures before I caught it.

The anomalydetect command also tripped me up. The model needs at least 3 time buckets of history to work. Fresh failures — which are exactly the ones you most want to analyze — often don't have that. I had to fall back to the timeseries data I'd already pulled in query 5 and score it directly.

And Slack's 3-second ack window. If your handler doesn't respond in time, Slack retries the button click — which means double replays. The fix was simple once I understood it: ack immediately, do all the real work on a queue. But the first time I saw a transaction replayed three times I definitely panicked.

Accomplishments I'm proud of

I made MCP the investigation engine, not a wrapper. Every failure triggers 6 real splunk_run_query calls over Streamable HTTP + JSON-RPC 2.0 — named, timed, and visible in the terminal. This isn't a chatbot on top of Splunk. The MCP Server is doing the actual work.

The full loop works end-to-end. Failure to HTTP 200 recovery in under 30 seconds, completely automated. No runbook, no manual log hunting, no gut-feel retry decision. Verified it live on Splunk Cloud.

Wiring anomalydetect in as the 6th MCP call was a clean solution — same transport, same client, but the SPL does the heavy lifting. The IsOutlier field drives a dynamic replay limit that actually means something.

/thread-search works in plain English. Type a question in Slack, get a Splunk results table back. saia_generate_spl via MCP first, Groq fallback when unavailable — ops never touch SPL.

Slack as the entire ops interface. No custom dashboard, no new tool to learn, no context switching. The alert, the investigation summary, the replay button — all in the channel the team is already watching. One click, done.

And I shipped it. Full E2E pipeline, 5 slash commands, anomaly detection, one-click replay, natural language search — built and verified before the deadline.

What I learned

MCP makes observability actually programmable in a way that feels different. Not "here's a log dump, figure it out" — more like having a junior engineer who knows exactly how to query Splunk and comes back with structured answers in under 300ms.

I also learned that the ops interface matters as much as the tech. I almost built a custom dashboard. I'm glad I didn't. Nobody wants to learn a new tool at 2 AM. Slack is already open. The best interface is the one people don't have to think about.

And honestly — the 10-second indexing delay was the most important line of code I wrote. It's not clever. It's not elegant. But without it, every MCP query fires before Splunk has the event and you get nothing. Sometimes the boring fix is the right one.

What's next for THREAD

I want to add blast radius mapping — given one failing service, figure out which other services are affected by the same root cause, automatically. The data is already in Splunk. It's just a graph traversal over the transaction chain.

And I want to get the dashboard generation working properly once the API access is available. The investigation already produces everything you'd need — failure class, error timeseries, affected services. Turning that into a live Splunk dashboard automatically feels like the natural next step.

Built With

Share this project:

Updates