Introduction
Pipeline Doctor 1.0 is an advanced software reliability engineer agent built on the GitLab Duo Agent Platform. It analyzes failed pipelines using logs, repository context, and issue history, then returns structured diagnostics with confidence scoring and fix recommendations.
Inspiration
CI/CD failures are one of the biggest productivity drains in software teams. Engineers often spend hours manually parsing long job traces to find one meaningful error. We built Pipeline Doctor 1.0 to automate that triage process and turn noisy failure output into clear, actionable reliability guidance.
What It Does
Pipeline Doctor 1.0 performs SRE-style pipeline diagnostics through a structured framework:
- Failure categorization
- Classifies failures as INFRA, DEPENDENCY, SYNTAX, SECURITY, FUNCTIONAL, or FLAKY.
- Evidence gathering
- Reads job traces and source files to extract concrete failure evidence (error lines, stack traces, file-level context).
- Context correlation
- Correlates failure evidence with code changes and historical issue patterns.
- Flakiness detection
- Flags likely non-deterministic failures based on repeated historical signatures.
- Prescriptive remediation
- Recommends concrete next actions: code patch, CI variable update, pipeline configuration fix, or infrastructure adjustment.
How We Built It
We implemented Pipeline Doctor 1.0 as a GitLab Duo custom agent and flow using the official hackathon template structure.
Core components:
- Agent definition: agents/agent.yml
- Flow definition: flows/flow.yml
- Catalog mapping: .ai-catalog-mapping.json
Tooling used by the agent:
- get_job_logs
- read_file
- read_files
- list_issues
The agent prompt was upgraded into a multi-phase SRE reasoning engine with explicit output schema and confidence scoring.
Tech Stack
- GitLab Duo Agent Platform
- YAML agent/flow orchestration
- Python (test scenarios)
- GitLab CI/CD
- Markdown documentation for operations and submission packaging
Challenges We Ran Into
- CI behavior differed between branch/test contexts and project-level settings.
- Validating exact catalog tool names was strict (schema mismatch errors on near-miss names).
- Distinguishing platform/enablement issues from agent-logic issues required isolating variables quickly.
- Building prompts that are both strict and useful without becoming overly verbose.
Accomplishments That We're Proud Of
- Built a robust, structured SRE diagnostic framework instead of a generic log summarizer.
- Added confidence-aware output schema to improve trust and decision-making speed.
- Designed category-specific remediation paths (code/config/env/infra/flake handling).
- Produced a reusable baseline architecture that can scale to richer autonomous reliability workflows.
What We Learned
- Prompt quality is architecture: explicit workflows outperform vague instructions.
- Reliability automation needs both technical diagnosis and confidence transparency.
- DevEx improves dramatically when AI output is structured for fast operational action.
- Catalog/schema alignment is as important as model quality for production workflows.
What's Next for Pipeline Doctor
- Add auto-remediation MRs for high-confidence failures.
- Add pipeline cost and waste analytics for failed runs.
- Introduce pattern-learning from resolved incidents to improve recommendations over time.
- Add team-facing integrations (MR comments, chatops summaries, incident routing).
- Expand to cross-repo reliability intelligence for platform teams.
Impact
Pipeline Doctor 1.0 is designed to reduce mean time to repair by converting unstructured pipeline failure noise into focused, actionable diagnostics. The expected impact is faster triage, fewer context switches, and improved delivery reliability.
Log in or sign up for Devpost to join the conversation.