Pipeline Doctor 1.0

Introduction

Pipeline Doctor 1.0 is an advanced software reliability engineer agent built on the GitLab Duo Agent Platform. It analyzes failed pipelines using logs, repository context, and issue history, then returns structured diagnostics with confidence scoring and fix recommendations.

Inspiration

CI/CD failures are one of the biggest productivity drains in software teams. Engineers often spend hours manually parsing long job traces to find one meaningful error. We built Pipeline Doctor 1.0 to automate that triage process and turn noisy failure output into clear, actionable reliability guidance.

What It Does

Pipeline Doctor 1.0 performs SRE-style pipeline diagnostics through a structured framework:

Failure categorization
- Classifies failures as INFRA, DEPENDENCY, SYNTAX, SECURITY, FUNCTIONAL, or FLAKY.
Evidence gathering
- Reads job traces and source files to extract concrete failure evidence (error lines, stack traces, file-level context).
Context correlation
- Correlates failure evidence with code changes and historical issue patterns.
Flakiness detection
- Flags likely non-deterministic failures based on repeated historical signatures.
Prescriptive remediation
- Recommends concrete next actions: code patch, CI variable update, pipeline configuration fix, or infrastructure adjustment.

How We Built It

We implemented Pipeline Doctor 1.0 as a GitLab Duo custom agent and flow using the official hackathon template structure.

Core components:

Agent definition: agents/agent.yml
Flow definition: flows/flow.yml
Catalog mapping: .ai-catalog-mapping.json

Tooling used by the agent:

get_job_logs
read_file
read_files
list_issues

The agent prompt was upgraded into a multi-phase SRE reasoning engine with explicit output schema and confidence scoring.

Tech Stack

GitLab Duo Agent Platform
YAML agent/flow orchestration
Python (test scenarios)
GitLab CI/CD
Markdown documentation for operations and submission packaging

Challenges We Ran Into

CI behavior differed between branch/test contexts and project-level settings.
Validating exact catalog tool names was strict (schema mismatch errors on near-miss names).
Distinguishing platform/enablement issues from agent-logic issues required isolating variables quickly.
Building prompts that are both strict and useful without becoming overly verbose.

Accomplishments That We're Proud Of

Built a robust, structured SRE diagnostic framework instead of a generic log summarizer.
Added confidence-aware output schema to improve trust and decision-making speed.
Designed category-specific remediation paths (code/config/env/infra/flake handling).
Produced a reusable baseline architecture that can scale to richer autonomous reliability workflows.

What We Learned

Prompt quality is architecture: explicit workflows outperform vague instructions.
Reliability automation needs both technical diagnosis and confidence transparency.
DevEx improves dramatically when AI output is structured for fast operational action.
Catalog/schema alignment is as important as model quality for production workflows.

What's Next for Pipeline Doctor

Add auto-remediation MRs for high-confidence failures.
Add pipeline cost and waste analytics for failed runs.
Introduce pattern-learning from resolved incidents to improve recommendations over time.
Add team-facing integrations (MR comments, chatops summaries, incident routing).
Expand to cross-repo reliability intelligence for platform teams.

Impact

Pipeline Doctor 1.0 is designed to reduce mean time to repair by converting unstructured pipeline failure noise into focused, actionable diagnostics. The expected impact is faster triage, fewer context switches, and improved delivery reliability.

Built With

gitlab-ci/cd
gitlab-duo-agent
markdown
python
yaml

Updates

John Mamodu started this project — Mar 25, 2026 03:29 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.