Inspiration

Every DevOps team has experienced this: a pipeline fails at 2 AM, a production alert fires, and engineers scramble to diagnose and fix the issue. We asked ourselves — what if GitLab could heal itself?

SentryFlow was inspired by the concept of self-healing systems in Site Reliability Engineering. Instead of engineers manually reading logs, tracing errors, and writing fixes, we built an AI-powered agent flow that does it all automatically — from detection to diagnosis to fix deployment.

What it does

SentryFlow is a four-agent self-healing DevOps flow built on the GitLab Duo Agent Platform:

  1. 🔍 Sentinel Agent — Ingests pipeline failure events and GCP Cloud Monitoring alerts, normalizing them into a unified incident schema
  2. 🔬 Diagnostician Agent — Performs root cause analysis by reading job logs, analyzing code diffs, checking recent commits, and classifying failures (test failures, dependency issues, config errors, infrastructure timeouts)
  3. 🔧 Surgeon Agent — Takes action based on confidence level: creates auto-fix merge requests (high confidence), rollback MRs (production incidents), or triage issues (low confidence)
  4. 📋 Reporter Agent — Posts structured summaries, links related artifacts, and maintains an audit trail

The flow handles edge cases like flaky test detection, multi-job failure analysis, and duplicate alert suppression.

How we built it

SentryFlow is built entirely on the GitLab Duo Agent Platform using:

  • Custom Agents (YAML) — Four agents with specialized system prompts and curated tool sets
  • Custom Flow (YAML v1 schema) — Linear orchestration: Sentinel → Diagnostician → Surgeon → Reporter
  • AGENTS.md — Project-level context defining the unified incident schema, failure categories, confidence levels, and coding conventions
  • SKILL.md files — Reusable patterns for pipeline diagnosis, incident correlation, and GCP integration
  • Unified Incident Schema — A JSON contract that flows between all four agents, ensuring structured communication

The architecture follows a clear separation of concerns — each agent has a single responsibility and communicates through a well-defined schema.

Challenges we faced

  1. Group CI Policy Override — The hackathon group's pipeline execution policy overrides project-level .gitlab-ci.yml, preventing us from triggering real pipeline failures. We adapted by using the Mention trigger with pre-filled incident context.

  2. No Pipeline Event Trigger — GitLab Duo flows currently only support Mention, Assign, and Assign Reviewer triggers — not pipeline failure events. We designed the Sentinel agent to handle mention-based triggers as a third path alongside pipeline failures and GCP alerts.

  3. Template Variable Mapping — Discovered that flow component input names must exactly match prompt template variable names ({{sentinel}} not {{incident_context}}). Debugging this required analyzing session logs and understanding the flow engine's variable injection mechanism.

  4. Platform Configuration — Encountered rootNamespaceId resolution issues and token scope limitations in the hackathon sandbox environment, requiring careful debugging of the Duo CLI execution logs.

What we learned

  • The GitLab Duo Agent Platform is a powerful framework for building multi-agent workflows
  • Designing a clear data contract (unified incident schema) between agents is critical for reliable multi-agent systems
  • Self-healing systems need confidence-based routing — not every fix should be auto-applied
  • Edge case handling (flaky tests, cascading failures, duplicate alerts) separates a toy demo from a production-ready system

What's next for SentryFlow

  • GCP Cloud Function webhook for fully automated pipeline failure → agent trigger pipeline
  • BigQuery audit trail for tracking all incidents and resolutions
  • Learning loop — agents improve diagnosis accuracy based on whether their MRs were accepted or rejected
  • Multi-project support — one SentryFlow instance monitoring an entire group's pipelines

Built With

  • ai-agents
  • ci-cd
  • cloud-monitoring
  • devops
  • gcp
  • gitlab
  • gitlab-duo
  • python
  • yaml
Share this project:

Updates