Inspiration

Data pipeline failures are one of the most time-consuming problems for data teams. When a pipeline breaks at 2 AM, engineers spend hours manually tracing dependencies, reading logs, identifying root causes, and writing fixes. I asked: what if an AI agent could handle the entire incident response autonomously?

What it does

Agentic Pipeline Repair is a multi-agent system that monitors, diagnoses, and repairs data pipeline failures automatically. A single check command triggers a workflow that:

  1. Detects failures, SLA breaches, schema drift, and data quality violations
  2. Traces the pipeline DAG upstream to find root causes
  3. Reads actual dbt model SQL to understand the problem
  4. Proposes exact code fixes with before/after diffs
  5. Applies fixes with human approval, then verifies the repair

The system coordinates five specialized agents - Monitor, Diagnostics, Repair, Verification, and Orchestrator - each using Amazon Nova 2 Lite's extended reasoning capabilities.

How I built it

  • Amazon Nova 2 Lite on Bedrock powers all agent reasoning with extended thinking enabled for complex root cause analysis
  • Strands Agents SDK coordinates the multi-agent workflow
  • 16 MCP Tools let agents interact with pipeline infrastructure (read schemas, check quality, apply dbt fixes)
  • FastAPI + React Dashboard provides real-time visibility into pipeline health
  • PostgreSQL + dbt manages pipeline metadata and transformations
  • Docker enables one-command local deployment

The agents dynamically discover what to monitor at runtime - no hardcoded pipeline names - making the system work with any dataset.

Challenges I ran into

  • Extended thinking tuning: Finding the right reasoning intensity for each agent type. Diagnostics needs deep analysis (high), while Verification needs quick checks (low).
  • Tool design: Creating tools that give agents enough context without overwhelming them. The get_dbt_model_sql tool was crucial for generating accurate fixes.
  • Human-in-the-loop: Balancing autonomy with safety. The system proposes fixes but requires approval before modifying code.

Accomplishments that I'm proud of

  • End-to-end autonomous repair: From detection to verified fix in under 3 minutes
  • The agent reads real dbt SQL and generates working fixes, not just suggestions
  • Dashboard updates in real-time as agents work
  • Rollback capability if a fix fails compilation

What I learned

Nova 2 Lite's extended thinking is a game-changer for agentic systems. The model's ability to reason through complex dependency chains and generate syntactically correct SQL fixes exceeded my expectations. The Strands Agents SDK made multi-agent coordination surprisingly straightforward.

What's next for Agentic Pipeline Repair

  • Slack/PagerDuty integration for production alerting
  • Learning from past repairs to improve fix accuracy
  • Multi-pipeline cascading repairs when failures affect multiple downstream systems
  • Cost optimization by routing simpler tasks to lighter models

Built With

Share this project:

Updates