πŸ’‘ Inspiration

Every team that runs CI knows the quiet betrayal of a flaky test: it passes on one run and fails on the next β€” on the exact same commit. No code changed. Nothing is "broken." But the red ❌ is there.

So people learn the worst habit in software: "just re-run it until it's green." Real failures start hiding behind fake ones, trust in the pipeline erodes, and eventually a genuine bug walks straight through a suite nobody believes anymore.

Formally, a test $t$ is flaky when its verdict is non-deterministic for a fixed commit:

$$\text{flaky}(t)\iff \exists\,c:\quad 0 < \Pr\big[\,t\text{ passes}\mid \text{commit}=c\,\big] < 1$$

We wanted an agent that doesn't just flag that lie, but reproduces it, explains it, fixes it, proves the fix, and then stops β€” handing the final decision to a human. Acting through GitLab's own first-party tooling, with least privilege and a full audit trail. That's Aegis Guardian β€” an autonomous flaky-test doctor for GitLab.

🩺 What it does

Aegis Guardian is a crew of five specialist AI agents that runs an end-to-end remediation for a flaky test:

  1. Detector β€” reads pipeline history and finds the "lie": a test that both passed and failed on the same SHA.
  2. Reproducer β€” re-triggers the pipeline to capture hard evidence of non-determinism.
  3. Diagnostician β€” reads job logs + source and classifies the root cause (random input, race condition, state leak) with a confidence score.
  4. Fixer β€” writes the minimal fix on a new branch and opens a Merge Request β€” but never merges.
  5. Validator β€” runs the fix N times, proves it green every time, comments the verdict, and hands off to a human.

Everything streams live to a command center UI as it happens.

🧠 How it works β€” the crew pipeline

   MCP read        MCP read           MCP read         REST writes      REST verify
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ DETECTOR  │─▢│ REPRODUCER │─▢│ DIAGNOSTICIAN  │─▢│  FIXER   │─▢│  VALIDATOR   β”‚
 β”‚ history β†’ β”‚  β”‚ re-runs β†’  β”‚  β”‚ logs+source β†’  β”‚  β”‚ branch + β”‚  β”‚ green N/N β†’  β”‚
 │ candidate │  │ evidence   │  │ root cause     │  │ commit→MR│  │ human        │
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
 flaky_candidates   evidence       diagnosis              β”‚ patch        β”‚ verdict
        └──────────── shared session state flows down β”€β”€β”€β”€β”˜              β”‚
                                                    β–²                     β–Ό
                                            HUMAN-IN-THE-LOOP      comment on MR
                                            "Authorize MR"         "a human merges"

State is passed agent-to-agent as a typed chain: flaky_candidates β†’ evidence β†’ diagnosis β†’ patch β†’ verdict.

The Detector's heuristic is simple and explainable β€” a commit is suspect when its run outcomes contain both verdicts:

$${\text{pass},\text{fail}}\subseteq \mathcal{O}(c)$$

The Validator's gate only accepts a fix when the target test passes in all $N$ runs ($N=4$):

$$\text{accept}\iff\bigwedge_{i=1}^{N}\text{pass}_i$$

# Agent Reads / Writes Tools Output
1 Detector read (MCP) list_pipelines, get_pipeline_jobs candidates
2 Reproducer read (MCP) trigger_pipeline, get_job_log evidence
3 Diagnostician read (MCP) get_file_contents, search_labels diagnosis
4 Fixer write (REST bot) create_branch, commit_file, create_merge_request patch / MR
5 Validator write (REST bot) verify_test_on_branch, comment_on_mr verdict

πŸ—οΈ Architecture

   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚                  Cloud Run  (serverless)                      β”‚
   β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
 Browser β”‚  FastAPI Command Center                               β”‚ β”‚
 (judge) β”‚  GET /  Β·  POST /scan  Β·  POST /demo  Β·  GET /events   β”‚ β”‚
   β”‚  ◀── Server-Sent Events (live agent feed + audit trail) ──   β”‚ β”‚
   β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
   β”‚            β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”   guard_write()  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
   β”‚            β”‚ ADK Sequential│─────────────────▢│  AUDIT    β”‚   β”‚
   β”‚            β”‚ Agent crew Γ—5 β”‚  scans EVERY     β”‚  trail    β”‚   β”‚
   β”‚            β”‚ (Gemini 3.5)  β”‚  write payload   β”‚ (append-  β”‚   β”‚
   β”‚            β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”˜                  β”‚  only)    β”‚   β”‚
   β”‚     MCP reads β”‚         β”‚ REST writes         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   human OAuth ── β”‚         β”‚ ── least-privilege bot PAT
   (scope = mcp)  β–Ό         β–Ό   (Developer Β· CANNOT merge protected main)
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚            GitLab.com             β”‚
            β”‚   pipelines Β· jobs Β· files Β· MRs  β”‚
            β”‚   + official GitLab MCP server    β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The key design choice is a split identity:

  • Reads flow through GitLab's official first-party MCP server under a human OAuth token (scope=mcp).
  • Writes flow through a least-privilege project bot (Developer role). The bot can open an MR but is physically incapable of merging a protected branch β€” so human-in-the-loop is enforced by the platform, not by a prompt.

πŸ” Trust by construction

Guarantee How
First-party Acts through GitLab's own MCP server β€” native, not a scraper
Least privilege Bot scoped to Developer; the platform forbids it from merging main
Every write screened guard_write() blocks leaked secrets (glpat-…, API keys, private keys, JWTs) and prompt-injection markers before anything touches the repo
Nothing invisible An append-only audit trail logs every tool call β€” which agent, which transport (MCP/REST), which target

πŸ› οΈ How we built it

  • Agents: Google Agent Development Kit (ADK) SequentialAgent orchestrating five LlmAgents, each with a callable instruction, an output_key, and a per-agent least-privilege tool allowlist.
  • Model: Gemini 3.5 Flash via Vertex AI (us-central1).
  • GitLab access: the official GitLab MCP server over Streamable-HTTP JSON-RPC, authorized via OAuth Dynamic Client Registration + PKCE (scope=mcp); REST (via a bot PAT) for writes and tool gaps.
  • Backend: FastAPI + Server-Sent Events for the live command center; a dependency-light security/audit core.
  • Deploy: Cloud Run (serverless), secrets in Secret Manager, container via Cloud Build.

πŸ§— Challenges we ran into

  • Unlocking the official MCP server. The endpoint only accepts scope=mcp, which a manually-created OAuth app can't even request ("scope is invalid"), and a bot PAT gets 403 insufficient_scope. The fix was unauthenticated Dynamic Client Registration (POST /oauth/register) to mint a public PKCE client allowed mcp, then a standard authorizeβ†’token PKCE flow. We also learned its handshake is stateless and requires the negotiated MCP-Protocol-Version header on every call.
  • The endpoint moved under us, mid-hackathon. GitLab split its MCP service (/api/v4/mcp ↔ a new /api/v4/orbit/mcp) and the full server briefly returned 404. So we made reads MCP-first with an auditable REST fallback β€” the app never hard-fails, and flips back to green MCP automatically the moment the endpoint returns. Graceful degradation as a feature.
  • Serverless + rotating OAuth tokens. The human token lives 2h and its refresh token is single-use/rotating β€” deadly for a stateless Cloud Run service. We added half-life proactive refresh, write-back of the rotated token to Secret Manager, and pinned a single warm instance so the refresh chain is never forked.
  • ADK response schemas. The global Gemini endpoint rejected ADK's array response-schemas (400, missing items), which drove our model + region choices.
  • 30-minute pipelines vs. a 2-minute pitch. Real verification triggers real CI and waits. We built a scripted Demo Mode that emits the exact same SSE events as the real crew β€” a full, faithful run in ~15 seconds, with zero real writes β€” so judges see everything without the wait.

πŸ“š What we learned

  • Least privilege beats guardrails-by-prompt. The strongest safety control wasn't a clever instruction β€” it was a bot that can't merge.
  • Agents are only as trustworthy as their audit trail. Streaming every tool call made the system legible and debuggable.
  • MCP is powerful but young β€” first-party agent access to a platform is a glimpse of the future, and resilience (fallbacks, token lifecycle) matters as much as capability.

πŸš€ What's next

  • Multi-repo / org-wide scheduled scans, with an auto-opened "flaky digest."
  • A richer flaky taxonomy and automatic flaky-test::* labeling.
  • Quarantine + auto-retry policies, and a PR-bot mode for GitHub.
  • Confidence-gated auto-merge for trivial fixes (still human-approved by default).

🧰 Built with (tags for the "Built with" field)

python, fastapi, google-adk, gemini, vertex-ai, google-cloud-run,
secret-manager, gitlab, mcp, model-context-protocol, oauth, pkce,
server-sent-events, pytest, docker, httpx

Built With

  • docker
  • fastapi
  • gemini
  • gitlab
  • google-adk
  • google-cloud-run
  • mcp
  • model-context-protocol
  • oauth
  • pkce
  • pytest
  • python
  • secret-manager
  • server-sent-events
  • vertex-ai
Share this project:

Updates