Self-Healing Pipeline

Inspiration

Every developer has faced this. 2am. Pipeline is red. You open the logs, find a bad version pin, spend 20 minutes reading the error, finding the correct version, updating the file, opening an MR, waiting for review.For a version pin.

We asked ourselves, why is a human doing this? The log tells you exactly what broke. The fix is mechanical. The MR is boilerplate. This is pure toil.

That question became Self-Healing Pipeline.

What we learned

The GitLab Duo Agent Platform is powerful but has real constraints in beta multi-agent chaining via routers doesn't execute subsequent agents yet. We adapted by building a single healer agent that maintains full context across all three phases, producing more coherent fixes than siloed agents would.
Prompt engineering for reliability matters more than architecture complexity. Calling get_job_logs exactly once, using unique branch names per job ID, detecting existing MRs before creating duplicates these small decisions made the difference between a demo that works once and one that works every time.
The hardest problem wasn't diagnosis Claude Sonnet classifies failure types correctly every time. The hardest problem was making the fix writing reliable across repeated runs without leaving orphaned branches or duplicate MRs.

How we built it

The agent runs three phases in sequence inside a single flow:

Phase 1 — Diagnose Calls get_job_logs exactly once, classifies the failure into one of four types: broken_dependency, flaky_test, real_bug, config_error. Claude Sonnet reasons about the error patterns in the raw log output.

Phase 2 — Fix For broken_dependency: reads the requirements file, replaces the bad version pin with the latest stable version, commits to a unique branch fix/auto-heal-{job_id}, opens an MR.

For flaky_test: reads the failing test files, adds @pytest.mark.flaky(reruns=3, reruns_delay=1) to each non-deterministic test function, adds pytest-rerunfailures to requirements, commits both files, opens an MR.

For real_bug and config_error: skips Phase 2 entirely — logic errors are unsafe to auto-fix. Goes straight to report.

Phase 3 — Report Posts a structured summary comment on the triggering issue with root cause, what was fixed, and a direct link to the MR.

Auto-trigger A .post CI job in the target project fires automatically on pipeline failure, mentions the agent in a tracking issue, and the full repair loop runs without any human involvement.

Challenges

Platform constraints The service account can only write to projects inside the hackathon group. We worked around this by making 19983282 the write target for all fix commits while still reading job logs from any accessible project.

Multi-agent chaining We originally built this as three separate specialist agents chained via routers. The platform's beta only executes the first agent and closes the session. We collapsed everything into a single healer agent — which turned out to produce better results anyway since it maintains full context across all three phases.

Reliability across repeated runs Early sessions would fail if the fix branch already existed from a previous run. We added unique branch naming per job ID, branch collision fallback with -2, -3 suffixes, and existing MR detection — so the agent handles repeated triggers gracefully without manual cleanup.

Knowing what not to fix The hardest design decision was intentional restraint. The agent does not touch business logic. It does not guess at algorithm fixes. It repairs mechanical failures and escalates everything else. Teaching an AI agent what it should NOT do was harder than teaching it what to do.

Built With

agent
anthropic
api
claude
duo
gitlab
pip
pytest
pytest-rerunfailures
python
yaml

Updates

Shubham Kumar started this project — Mar 19, 2026 10:48 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.