Agentic Pipeline Repair

System Architecture
Cloud Deployment Architecture
React Dashboard

Inspiration

Data pipeline failures are one of the most time-consuming problems for data teams. When a pipeline breaks at 2 AM, engineers spend hours manually tracing dependencies, reading logs, identifying root causes, and writing fixes. I asked: what if an AI agent could handle the entire incident response autonomously?

What it does

Agentic Pipeline Repair is a multi-agent system that monitors, diagnoses, and repairs data pipeline failures automatically. A single check command triggers a workflow that:

Detects failures, SLA breaches, schema drift, and data quality violations
Traces the pipeline DAG upstream to find root causes
Reads actual dbt model SQL to understand the problem
Proposes exact code fixes with before/after diffs
Applies fixes with human approval, then verifies the repair

The system coordinates five specialized agents - Monitor, Diagnostics, Repair, Verification, and Orchestrator - each using Amazon Nova 2 Lite's extended reasoning capabilities.

How I built it

Amazon Nova 2 Lite on Bedrock powers all agent reasoning with extended thinking enabled for complex root cause analysis
Strands Agents SDK coordinates the multi-agent workflow
16 MCP Tools let agents interact with pipeline infrastructure (read schemas, check quality, apply dbt fixes)
FastAPI + React Dashboard provides real-time visibility into pipeline health
PostgreSQL + dbt manages pipeline metadata and transformations
Docker enables one-command local deployment

The agents dynamically discover what to monitor at runtime - no hardcoded pipeline names - making the system work with any dataset.

Challenges I ran into

Extended thinking tuning: Finding the right reasoning intensity for each agent type. Diagnostics needs deep analysis (high), while Verification needs quick checks (low).
Tool design: Creating tools that give agents enough context without overwhelming them. The get_dbt_model_sql tool was crucial for generating accurate fixes.
Human-in-the-loop: Balancing autonomy with safety. The system proposes fixes but requires approval before modifying code.

Accomplishments that I'm proud of

End-to-end autonomous repair: From detection to verified fix in under 3 minutes
The agent reads real dbt SQL and generates working fixes, not just suggestions
Dashboard updates in real-time as agents work
Rollback capability if a fix fails compilation

What I learned

Nova 2 Lite's extended thinking is a game-changer for agentic systems. The model's ability to reason through complex dependency chains and generate syntactically correct SQL fixes exceeded my expectations. The Strands Agents SDK made multi-agent coordination surprisingly straightforward.

What's next for Agentic Pipeline Repair

Slack/PagerDuty integration for production alerting
Learning from past repairs to improve fix accuracy
Multi-pipeline cascading repairs when failures affect multiple downstream systems
Cost optimization by routing simpler tasks to lighter models

Built With

amazon-bedrock
amazon-nova-2-lite
aws-ec2
aws-rds
dbt
docker
fastapi
html
javascript
mcp-tools
postgresql
python
react
strands-agents-sdk

Updates

Sanidhya Karnik started this project — Feb 13, 2026 01:33 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.