Inspiration
Data pipelines rarely fail loudly. They fail silently.
A schema upstream shifts a field name. An ingestion job quietly halts and volume drops to a trickle. A scheduler stalls and the freshest record is six hours old. None of these throw an exception. No alert fires. The pipeline keeps "running," producing data that is incomplete, stale, or subtly wrong. By the time someone notices, an executive dashboard has already shown the wrong number, or a downstream model has already trained on corrupted input.
When that happens, the on-call data engineer becomes a detective. They open Splunk, pivot across a dozen sourcetypes, manually correlate timestamps, and try to reconstruct what changed and where. Mean-time-to-diagnosis stretches from minutes into hours, not because the answer is hard, but because the correlation work is tedious and human-bound.
We built Pipeline Doctor to ask a simple question: what if an agent could reason through that investigation the way a senior data engineer does: lineage first, evidence-backed, and certain?
What it does
Pipeline Doctor is an AI diagnostic agent for silent data-pipeline failures. Given a symptom in a Splunk-monitored pipeline, it autonomously investigates and returns a root-cause diagnosis that a human can act on immediately.
For each incident, the agent produces a structured diagnosis containing:
- The exact root cause: not a vague "something is wrong," but the specific failure class and where it originated.
- Supporting evidence: the actual events and metrics from Splunk that justify the conclusion.
- A remediation recommendation: what to do to fix it.
- Explicit elimination of alternatives: why it is not the other plausible failure modes, so the engineer can trust the verdict instead of second-guessing it.
Actionable Splunk alerts: 2 to 4 ready-to-use SPL saved-search definitions, tailored to the diagnosed root cause, that can be pasted directly into Splunk to prevent recurrence. It currently diagnoses three of the most common, and most dangerous, classes of silent failure:
schema_change: an upstream field is renamed, dropped, or retyped, breaking parsing downstream without an error.volume_drop: ingestion partially or fully stalls, so record counts fall far below the expected baseline.freshness_delay: the pipeline is alive but lagging; the most recent record is older than the freshness SLA, where lag is simply $\Delta t = t_{\text{now}} - t_{\text{last_event}}$. These three look deceptively similar from the outside (a dashboard "looks wrong"), which is exactly why diagnosing them by hand is so error-prone, and why an agent that can tell them apart is valuable.
How we built it
Pipeline Doctor is built in two clean layers:
1. The data layer (generate_data.py) simulates a realistic, observable pipeline. It injects synthetic events into Splunk via HEC (HTTP Event Collector) across five sourcetypes (job logs, schema registry, data quality checks, lineage edges, and alerts) and can deterministically trigger any of the three fault scenarios on demand. Each scenario is a 60-minute "incident recording" with a normal baseline, a root-cause event, downstream propagation, and recovery. This gives us a controlled, reproducible "patient" to diagnose.
2. The agent layer (agent.py) is the diagnostician. Powered by the Claude API (claude-sonnet-4-6), it connects to Splunk through the Splunk MCP Server (using Python MCP SDK with Streamable HTTP transport) as its default path, giving it three tools: run SPL queries, list indexes, and get metadata. The agent follows a lineage-first diagnostic strategy: instead of pattern-matching on surface symptoms, it traces the data's journey through the pipeline, locates where the signal degrades, and reasons from that point to a root cause, actively ruling out the failure modes that don't fit the evidence. A --use-rest flag provides a direct REST API fallback if MCP is unavailable.
Three key design choices in the MCP integration:
- Runtime tool discovery. The client calls
list_tools()on startup and auto-maps MCP tool names to the agent's internal tool names, with no hardcoded paths. This means the agent works against any MCP server that exposesrun_query,get_indexes, andget_metadata. - Async-to-sync bridging. The MCP SDK is async-native; we bridge it into the synchronous agent loop via a background event loop and
run_coroutine_threadsafe. - Graceful degradation. If MCP fails to connect, the agent prints the error and suggests
--use-rest, so the diagnostic workflow never just dies. The full stack: Python, the Anthropic Claude API, the Splunk MCP Server (Python MCP SDK, Streamable HTTP), the Splunk REST API (fallback), and HEC for fault injection, running against a local Splunk instance.
We also built a Pipeline Health dashboard in Splunk that visualizes all three DQ dimensions side by side, with an inventory-vs-revenue blast-radius comparison panel.
Challenges we ran into
The "editorial log" trap. Our first data generator produced log messages like "Possible schema mismatch after recent deployment", which is the conclusion the agent is supposed to reach, not something a log should contain. We learned to write factual-but-not-diagnostic messages: "KeyError: 'stock_count' — field not found in database response. Available keys: ['available_qty', ...]" gives the agent raw evidence without doing the reasoning for it.
Timestamp causality. The agent reasons about cause and effect using timestamps. If a schema change event and the first null-rate spike happen at the same second, or worse in the wrong order, the causal chain breaks. We enforced a 2-minute gap between root-cause events and first symptoms, with realistic propagation delays downstream.
Async/sync bridging for MCP. The Python MCP SDK is fully async, but our agent loop is synchronous to keep the Claude API interaction simple. We solved this with a dedicated background thread running its own asyncio event loop, using run_coroutine_threadsafe to bridge calls. Getting the lifecycle right (startup, cleanup, error propagation) took several iterations.
Getting reasoning instead of guessing. An LLM will happily produce a confident-sounding answer. We didn't want a confident guess; we wanted a diagnosis backed by evidence with explicit elimination of alternatives. Shaping the agent to rule out competing hypotheses, rather than just assert one, was the core prompt-design challenge.
Accomplishments that we're proud of
We didn't just demo it once and hope. We ran a 15-run validation suite across all three scenarios, five runs each, and graded every output against a strict four-level rubric.
All 15 runs scored L4, with zero variance. L4, our highest tier, requires the agent to identify the exact root cause with supporting evidence, provide a remediation, and explicitly eliminate the other two scenarios, every time.
The same system prompt was used for all three scenarios with zero adjustments per scenario. It encodes a general diagnostic methodology (follow lineage, check each DQ dimension, rule out alternatives), not scenario-specific hints. The agent had to figure out what broke, and why, from the evidence alone. That a single, fixed prompt reaches L4 on three structurally different failure modes confirms that it learned to reason, not to pattern-match.
That zero-variance result is the accomplishment we care about most. Pipeline Doctor behaves like an instrument, not a slot machine: same fault in, same correct diagnosis out.
What we learned
- Lineage beats symptoms. The single biggest leap in diagnostic accuracy came from making the agent reason along the data's path rather than off its surface symptoms. Observability is fundamentally about narrowing causes systematically, and lineage is the structure that makes narrowing possible.
- Prompt design is experiment design. Early versions included hints like "when you see timeouts, check the network," essentially hardcoding the answer. We learned to encode methodology, not answers: "trace back to the earliest anomaly across all sourcetypes" works for every scenario without pointing to any specific one.
- A good diagnosis includes what it rules out. Forcing the agent to justify why it isn't the other failure modes didn't just make outputs more trustworthy; it made them more correct, because the act of elimination catches reasoning that would otherwise slip through.
- Repetition is the only honest test for AI agents. LLM agents are stochastic. A single successful run proves nothing. Running 5× per scenario (15 total) with a single, unmodified prompt gave us actual confidence and forced us to fix real prompt weaknesses rather than overfitting.
Reliability is a feature, not an afterthought. Treating consistency as a first-class goal, and validating it across many runs, turned a clever demo into something that behaves like a real diagnostic tool.
What's next for Pipeline Doctor
More failure classes. Schema changes, volume drops, and freshness delays are the start. Duplicate-record storms, partial-partition failures, and late-arriving-data anomalies are natural next additions.
From synthetic to production. The next milestone is pointing Pipeline Doctor at real production pipelines, so it diagnoses live incidents instead of injected ones, turning hours of manual log-correlation into seconds of agent reasoning.
Claude Desktop integration. We plan to bring Pipeline Doctor into Claude Desktop as an MCP-native tool, so an engineer can diagnose a live pipeline conversationally, right from where they work.
Log in or sign up for Devpost to join the conversation.