Inspiration
Data pipeline failures are one of the most time-consuming problems for data teams. When a pipeline breaks at 2 AM, engineers spend hours manually tracing dependencies, reading logs, identifying root causes, and writing fixes. I asked: what if an AI agent could handle the entire incident response autonomously?
What it does
Agentic Pipeline Repair is a multi-agent system that monitors, diagnoses, and repairs data pipeline failures automatically. A single check command triggers a workflow that:
- Detects failures, SLA breaches, schema drift, and data quality violations
- Traces the pipeline DAG upstream to find root causes
- Reads actual dbt model SQL to understand the problem
- Proposes exact code fixes with before/after diffs
- Applies fixes with human approval, then verifies the repair
The system coordinates five specialized agents - Monitor, Diagnostics, Repair, Verification, and Orchestrator - each using Amazon Nova 2 Lite's extended reasoning capabilities.
How I built it
- Amazon Nova 2 Lite on Bedrock powers all agent reasoning with extended thinking enabled for complex root cause analysis
- Strands Agents SDK coordinates the multi-agent workflow
- 16 MCP Tools let agents interact with pipeline infrastructure (read schemas, check quality, apply dbt fixes)
- FastAPI + React Dashboard provides real-time visibility into pipeline health
- PostgreSQL + dbt manages pipeline metadata and transformations
- Docker enables one-command local deployment
The agents dynamically discover what to monitor at runtime - no hardcoded pipeline names - making the system work with any dataset.
Challenges I ran into
- Extended thinking tuning: Finding the right reasoning intensity for each agent type. Diagnostics needs deep analysis (high), while Verification needs quick checks (low).
- Tool design: Creating tools that give agents enough context without overwhelming them. The
get_dbt_model_sqltool was crucial for generating accurate fixes. - Human-in-the-loop: Balancing autonomy with safety. The system proposes fixes but requires approval before modifying code.
Accomplishments that I'm proud of
- End-to-end autonomous repair: From detection to verified fix in under 3 minutes
- The agent reads real dbt SQL and generates working fixes, not just suggestions
- Dashboard updates in real-time as agents work
- Rollback capability if a fix fails compilation
What I learned
Nova 2 Lite's extended thinking is a game-changer for agentic systems. The model's ability to reason through complex dependency chains and generate syntactically correct SQL fixes exceeded my expectations. The Strands Agents SDK made multi-agent coordination surprisingly straightforward.
What's next for Agentic Pipeline Repair
- Slack/PagerDuty integration for production alerting
- Learning from past repairs to improve fix accuracy
- Multi-pipeline cascading repairs when failures affect multiple downstream systems
- Cost optimization by routing simpler tasks to lighter models
Built With
- amazon-bedrock
- amazon-nova-2-lite
- aws-ec2
- aws-rds
- dbt
- docker
- fastapi
- html
- javascript
- mcp-tools
- postgresql
- python
- react
- strands-agents-sdk
Log in or sign up for Devpost to join the conversation.