Inspirationdevelopers often lose significant time acting as "pipeline janitors." When a Continuous Integration/Continuous Delivery (CI/CD) pipeline breaks—whether due to a linting issue, a dependency failure, or a new Static Application Security Testing (SAST) vulnerability—development comes to a screeching halt. The process involves manually sifting through extensive logs, pinpointing the error, crafting a patch, and then waiting for the entire pipeline to restart. This severely damages development velocity.

This challenge led to the creation of Duo Auto-Heal, with the vision of moving AI beyond a passive chat function to an active force in the DevOps lifecycle. The goal was to build an autonomous, virtual team member capable of not just flagging a pipeline failure, but actually writing and testing the necessary fix, allowing the developer to continue coding uninterrupted.

What it does Duo Auto-Heal is designed to automatically handle GitLab CI/CD pipeline failures. When a pipeline fails, this agentic orchestrator springs into action to deliver an automated, precise fix through the following steps:

Failure Detection: The system intercepts the pipeline failure webhook in real-time. Error Contextualization: It securely gathers the necessary information, including the exact tail-end of the failing job log and the affected code repository files. Intelligent Routing: A sandboxed AI router, known as The Universal Troubleshooter, safely categorizes the identified problem. Automated Remediation: A specialized Remediation Agent then writes a precise code patch, creates a new Git branch, and opens a Merge Request (MR) targeting the original branch. Crucially, Duo Auto-Heal operates with a strict "Human-in-the-Loop" security model. It never pushes changes directly to production. Instead, it provides a verified, passing Merge Request, ready for a developer's one-click approval.

How we built it AI-Powered GitLab CI/CD Troubleshooting System

This application features a modern, decoupled, event-driven architecture designed to autonomously troubleshoot and fix GitLab CI/CD pipeline errors.Core Architecture Decoupled Backend: A lightweight Python service using FastAPI handles GitLab webhooks as the primary entry point. GitLab Integration: The python-gitlab library is used to manage repository operations, including extracting job traces, creating new branches, and generating Merge Requests (MRs). AI Engine (The Universal Troubleshooter): Powered by Google's Gemini models, this component is highly constrained using system instructions and response_mime_type: "application/json" to ensure stable, predictable, and structured outputs. Infrastructure: The entire service is containerized with Docker and deployed serverless on Google Cloud Run. Deployment is fully automated via a custom .gitlab-ci.yml pipeline that builds the container image, pushes it to Google Artifact Registry, and updates the Cloud Run service on every commit. Innovations and Solutions to Key Challenges Challenge Solution & Innovation The "Giant Log" Problem (Massive CI/CD logs wasting tokens and causing AI hallucination) A custom extraction script was implemented to surgically target the last 150 lines of a failed job trace—the section most likely to contain the actual error footprint. AI Hallucination & Safety (Risk of AI writing malicious code based on poisoned logs) The Universal Troubleshooter was invented as a strictly sandboxed, non-coding routing layer. It is isolated from the Remediation Engine and only permitted to categorize the error into strict JSON schemas (e.g., SYNTAX_ERROR, SECURITY_VULNERABILITY), creating an airtight architectural safety boundary. Cloud IAM Permissions (Complex security configuration for cross-platform deployment) Extensive debugging was required to lock down the exact minimum necessary Google Cloud IAM roles for GitLab to build and deploy to Cloud Run securely: Artifact Registry Writer, Cloud Run Admin, and Service Account User.

Proud Accomplishments

The primary success of this project is the creation of a truly serverless, event-driven AI architecture that seamlessly connects GitLab webhooks with Google Cloud infrastructure. By enforcing strict JSON schemas for agent routing and implementing a secure, human-in-the-loop MR generation process, the solution moves beyond simple API calls and functions like a production-ready enterprise tool.

What we learned This project significantly advanced my expertise in core DevOps automation. Key areas of growth included mastering GitLab Webhook payload structures, configuring robust Google Cloud IAM security policies, and applying advanced prompt engineering techniques. Specifically, I gained a deeper understanding of how to implement operational guardrails to ensure AI models can function safely and reliably within enterprise CI/CD environments.

What's next for Duo Auto-Heal The next step is to expand the Remediation Agent to handle complex, multi-file architectural fixes. Beyond that, I plan to package the Duo Auto-Heal webhook configuration and CI/CD deployment pipeline as an official, plug-and-play component for the GitLab CI/CD Catalog, allowing any organization to drop self-healing capabilities into their projects with a single line of code.

Built With

  • gemini-api
  • gitlab-api
  • gitlab-ci/cd
  • gitlaba-webhooks
  • google-cloud
Share this project:

Updates