Inspiration
In modern data platforms, pipeline failures can halt analytics, delay reporting, and waste countless hours of engineering time. After dealing with these issues firsthand on my past experiences, I was inspired to build a solution where AI agents take on the burden of identifying and resolving incidents automatically-freeing up developers and ensuring smoother operations.
The idea of leveraging multiple specialized agents that work in sync to detect, remediate, and document failures felt like the perfect real-world application of the Agent Development Kit.
What it does
AI Agent-Driven Data Pipeline Incident Resolver is an intelligent, multi-agent system that: • Detects pipeline failures (from Airflow errors redirected to Slack) • Diagnoses the root cause using Vertex AI Search (That contain the internal documentation) and Google Search • Executes fixes (e.g., run SQL, modify files by uploading them in Cloud Storage) through a Remediation Agent • Generates a post-mortem report stored for future reference and indexed with Vertex AI Search
This all happens autonomously, with minimal manual input-creating a smart, self-healing pipeline environment.
How I built it
I used the Agent Development Kit (ADK) to define and orchestrate three core agents: • Knowledge Agent: Investigates errors using Vertex AI Search (for internal documentation, past incident records) or external web resources. • Remediation Agent: Applies appropriate fixes-such as updating SQL queries or corrects Python/SQL files. • Post-Mortem Agent: Summarizes the issue, resolution steps, and stores this report for learning and future troubleshooting.
These agents run on Google Cloud Vertex AI Agent Engine, triggered by a mention of the Agent on Slack with the error infos (Given by Airflow). The underlying data pipeline runs through BigQuery with Airflow DAGs orchestrating Bronze, Silver, and Gold data transformations. All components are deployed via Cloud Run, managed with Secret Manager.
Challenges I ran into
• Vertex AI Search setup was non-trivial-splitting it from the knowledge logic helped modularize the agent properly.
• Slack integration debugging exposed subtle bugs with session ID types in the Agent Engine.
• Coordinating services (Composer, Pub/Sub, Cloud Run, Slack bot, BigQuery) took careful planning to avoid timing or auth conflicts.
Accomplishments that I'm proud of
• Created a working multi-agent architecture with live remediation capabilities.
• Built a robust incident resolution flow, all powered by AI agents without human input.
• Achieved seamless integration of Google Cloud tools to minimize boilerplate and maximize functionality.
• Authored a Medium blog post to share the architecture and learnings with the wider community.
What I learned
• A clear use case dramatically accelerates development
• The Agent Starter Pack was instrumental in rapid testing and development.
• Using Gemini Code Assist sped up instruction crafting and fine-tuning agent logic.
• Google Cloud’s services enabled clean, scalable orchestration without needing to build custom infrastructure.
What's next for AI Agent-Driven Data Pipeline Incident Resolver
• Fine tune the instructions (more confirmation safeguards, handle more cases, ...)
• Make it production-ready CI/CD wise
• Add more integration errors
Built With
- bigquery
- cloudrun
- composer
- computengine
- python
- secretmanager
- slackapi
- vertexaisearch
Log in or sign up for Devpost to join the conversation.