💡 Inspiration: The 2 AM Pager Dread
Every data engineer knows the dreaded 2 AM pager alarm. An upstream team silently modifies a JSON schema—changing user_id to customer_id or introducing unexpected null values. Suddenly, your downstream ETL pipeline crashes, and dashboards go dark.
Traditionally, an engineer has to wake up, groggily sift through error logs, manually compare data with code, write a fix, and submit a Merge Request. This process takes at least 30 minutes of downtime. We asked ourselves: Why act as an AI Copilot when we can build an AI Autopilot? We wanted to create a "Digital First Responder" that could diagnose and heal data pipelines with zero human intervention.
⚙️ What it does
DataOps Auto-Healer is an autonomous AI agentic workflow built on GitLab Duo. Triggered by a simple @ mention in a GitLab Issue, it executes a strict, secure 4-step loop:
- Cross-File Detective: Automatically reads
raw_orders.jsonandetl_pipeline.pyto pinpoint schema mismatches. - Secure Remediation: Backs up the database and commits exact code/data fixes to a completely isolated, dynamically generated branch.
- Automated Branching & MR: Automatically opens a Merge Request against the
mainbranch. - Comprehensive Reporting: Runs sanity checks (e.g., revenue validation) and leaves a highly detailed, formatted Emoji report in the MR comments for human review.
🛠️ How we built it
We utilized the GitLab AI Catalog to orchestrate a multi-agent system using YAML flow definitions (data_healer_flow.yml). We defined distinct components (detect_failure, analyze_and_fix, open_merge_request, and validate_and_report) and equipped the AI with precise toolsets, such as read_file, create_branch, create_commit, and create_merge_request.
By strictly managing the prompt context and execution order, we transformed a generalized LLM into a highly specialized DevOps engineer.
⚠️ Challenges we ran into
Building true autonomy is incredibly hard. Our biggest hurdle was the "Infinite Read Loop." Initially, if the agent failed to create a Merge Request (due to missing project path context or branch names not passing between steps), it would panic. Thinking it missed something, it would infinitely loop back to call the read_file tool.
Furthermore, navigating the strict YAML schema validation for the AI Catalog (tool_name configurations) required pixel-perfect syntax tuning. We had to dive deep into the platform's CI/CD pipeline logs to reverse-engineer the required schema.
🏆 Accomplishments that we're proud of
We are immensely proud of solving the infinite loop by implementing a "Hard Logic Lock." We deliberately stripped the agent of its read_file tool during the Merge Request phase, forcing it to exclusively use create_merge_request with the exact branch name passed from the previous step.
We also achieved a massive efficiency gain. By shifting from manual debugging to our Agentic Workflow, we fundamentally altered the incident response math. If $T_{manual}$ is 30 minutes and $T_{auto}$ is 30 seconds, the efficiency gain is:
$$Efficiency = \frac{T_{manual} - T_{auto}}{T_{manual}} \times 100\% \approx 98.3\%$$
We brought downtime down to near zero while enforcing enterprise-grade security (no direct commits to main).
📚 What we learned
We learned that Prompt Engineering for Agentic Workflows is fundamentally different from chatting with an LLM. You aren't just giving instructions; you are building a state machine. We learned the critical importance of restricting an AI's toolset at specific lifecycle stages to prevent hallucination and enforce DevOps best practices (always branch, always backup).
🚀 What's next for DataOps Auto-Healer
Currently, the Auto-Healer excels at localized JSON schema fixes. Next, we plan to:
- Integrate it directly with modern data stacks like dbt and Apache Airflow.
- Implement auto-rollback features if the sanity checks in the final reporting phase fail.
- Add Webhook listeners so the agent triggers automatically on pipeline failure, removing the need for even the initial
@mention!
Built With
- gitlab
- python
Log in or sign up for Devpost to join the conversation.