π‘ Inspiration
Every team that runs CI knows the quiet betrayal of a flaky test: it passes on one run and fails on the next β on the exact same commit. No code changed. Nothing is "broken." But the red β is there.
So people learn the worst habit in software: "just re-run it until it's green." Real failures start hiding behind fake ones, trust in the pipeline erodes, and eventually a genuine bug walks straight through a suite nobody believes anymore.
Formally, a test $t$ is flaky when its verdict is non-deterministic for a fixed commit:
$$\text{flaky}(t)\iff \exists\,c:\quad 0 < \Pr\big[\,t\text{ passes}\mid \text{commit}=c\,\big] < 1$$
We wanted an agent that doesn't just flag that lie, but reproduces it, explains it, fixes it, proves the fix, and then stops β handing the final decision to a human. Acting through GitLab's own first-party tooling, with least privilege and a full audit trail. That's Aegis Guardian β an autonomous flaky-test doctor for GitLab.
π©Ί What it does
Aegis Guardian is a crew of five specialist AI agents that runs an end-to-end remediation for a flaky test:
- Detector β reads pipeline history and finds the "lie": a test that both passed and failed on the same SHA.
- Reproducer β re-triggers the pipeline to capture hard evidence of non-determinism.
- Diagnostician β reads job logs + source and classifies the root cause (random input, race condition, state leak) with a confidence score.
- Fixer β writes the minimal fix on a new branch and opens a Merge Request β but never merges.
- Validator β runs the fix N times, proves it green every time, comments the verdict, and hands off to a human.
Everything streams live to a command center UI as it happens.
π§ How it works β the crew pipeline
MCP read MCP read MCP read REST writes REST verify
βββββββββββββ ββββββββββββββ ββββββββββββββββββ ββββββββββββ ββββββββββββββββ
β DETECTOR βββΆβ REPRODUCER βββΆβ DIAGNOSTICIAN βββΆβ FIXER βββΆβ VALIDATOR β
β history β β β re-runs β β β logs+source β β β branch + β β green N/N β β
β candidate β β evidence β β root cause β β commitβMRβ β human β
βββββββββββββ ββββββββββββββ ββββββββββββββββββ ββββββ¬ββββββ ββββββββ¬ββββββββ
flaky_candidates evidence diagnosis β patch β verdict
βββββββββββββ shared session state flows down βββββ β
β² βΌ
HUMAN-IN-THE-LOOP comment on MR
"Authorize MR" "a human merges"
State is passed agent-to-agent as a typed chain:
flaky_candidates β evidence β diagnosis β patch β verdict.
The Detector's heuristic is simple and explainable β a commit is suspect when its run outcomes contain both verdicts:
$${\text{pass},\text{fail}}\subseteq \mathcal{O}(c)$$
The Validator's gate only accepts a fix when the target test passes in all $N$ runs ($N=4$):
$$\text{accept}\iff\bigwedge_{i=1}^{N}\text{pass}_i$$
| # | Agent | Reads / Writes | Tools | Output |
|---|---|---|---|---|
| 1 | Detector | read (MCP) | list_pipelines, get_pipeline_jobs |
candidates |
| 2 | Reproducer | read (MCP) | trigger_pipeline, get_job_log |
evidence |
| 3 | Diagnostician | read (MCP) | get_file_contents, search_labels |
diagnosis |
| 4 | Fixer | write (REST bot) | create_branch, commit_file, create_merge_request |
patch / MR |
| 5 | Validator | write (REST bot) | verify_test_on_branch, comment_on_mr |
verdict |
ποΈ Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Cloud Run (serverless) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
Browser β FastAPI Command Center β β
(judge) β GET / Β· POST /scan Β· POST /demo Β· GET /events β β
β βββ Server-Sent Events (live agent feed + audit trail) ββ β β
β βββββββββββββββββ¬ββββββββββββββββββββββββ¬βββββββββββββββββ β
β ββββββββΌββββββββ guard_write() βββββββββββββ β
β β ADK SequentialβββββββββββββββββββΆβ AUDIT β β
β β Agent crew Γ5 β scans EVERY β trail β β
β β (Gemini 3.5) β write payload β (append- β β
β ββββ¬ββββββββββ¬βββ β only) β β
β MCP reads β β REST writes βββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
human OAuth ββ β β ββ least-privilege bot PAT
(scope = mcp) βΌ βΌ (Developer Β· CANNOT merge protected main)
ββββββββββββββββββββββββββββββββββββ
β GitLab.com β
β pipelines Β· jobs Β· files Β· MRs β
β + official GitLab MCP server β
ββββββββββββββββββββββββββββββββββββ
The key design choice is a split identity:
- Reads flow through GitLab's official first-party MCP server under a human OAuth token (
scope=mcp). - Writes flow through a least-privilege project bot (Developer role). The bot can open an MR but is physically incapable of merging a protected branch β so human-in-the-loop is enforced by the platform, not by a prompt.
π Trust by construction
| Guarantee | How |
|---|---|
| First-party | Acts through GitLab's own MCP server β native, not a scraper |
| Least privilege | Bot scoped to Developer; the platform forbids it from merging main |
| Every write screened | guard_write() blocks leaked secrets (glpat-β¦, API keys, private keys, JWTs) and prompt-injection markers before anything touches the repo |
| Nothing invisible | An append-only audit trail logs every tool call β which agent, which transport (MCP/REST), which target |
π οΈ How we built it
- Agents: Google Agent Development Kit (ADK)
SequentialAgentorchestrating fiveLlmAgents, each with a callable instruction, anoutput_key, and a per-agent least-privilege tool allowlist. - Model: Gemini 3.5 Flash via Vertex AI (
us-central1). - GitLab access: the official GitLab MCP server over Streamable-HTTP JSON-RPC, authorized via OAuth Dynamic Client Registration + PKCE (
scope=mcp); REST (via a bot PAT) for writes and tool gaps. - Backend: FastAPI + Server-Sent Events for the live command center; a dependency-light security/audit core.
- Deploy: Cloud Run (serverless), secrets in Secret Manager, container via Cloud Build.
π§ Challenges we ran into
- Unlocking the official MCP server. The endpoint only accepts
scope=mcp, which a manually-created OAuth app can't even request ("scope is invalid"), and a bot PAT gets403 insufficient_scope. The fix was unauthenticated Dynamic Client Registration (POST /oauth/register) to mint a public PKCE client allowedmcp, then a standard authorizeβtoken PKCE flow. We also learned its handshake is stateless and requires the negotiatedMCP-Protocol-Versionheader on every call. - The endpoint moved under us, mid-hackathon. GitLab split its MCP service (
/api/v4/mcpβ a new/api/v4/orbit/mcp) and the full server briefly returned404. So we made reads MCP-first with an auditable REST fallback β the app never hard-fails, and flips back to green MCP automatically the moment the endpoint returns. Graceful degradation as a feature. - Serverless + rotating OAuth tokens. The human token lives 2h and its refresh token is single-use/rotating β deadly for a stateless Cloud Run service. We added half-life proactive refresh, write-back of the rotated token to Secret Manager, and pinned a single warm instance so the refresh chain is never forked.
- ADK response schemas. The global Gemini endpoint rejected ADK's array response-schemas (
400, missingitems), which drove our model + region choices. - 30-minute pipelines vs. a 2-minute pitch. Real verification triggers real CI and waits. We built a scripted Demo Mode that emits the exact same SSE events as the real crew β a full, faithful run in ~15 seconds, with zero real writes β so judges see everything without the wait.
π What we learned
- Least privilege beats guardrails-by-prompt. The strongest safety control wasn't a clever instruction β it was a bot that can't merge.
- Agents are only as trustworthy as their audit trail. Streaming every tool call made the system legible and debuggable.
- MCP is powerful but young β first-party agent access to a platform is a glimpse of the future, and resilience (fallbacks, token lifecycle) matters as much as capability.
π What's next
- Multi-repo / org-wide scheduled scans, with an auto-opened "flaky digest."
- A richer flaky taxonomy and automatic
flaky-test::*labeling. - Quarantine + auto-retry policies, and a PR-bot mode for GitHub.
- Confidence-gated auto-merge for trivial fixes (still human-approved by default).
π§° Built with (tags for the "Built with" field)
python, fastapi, google-adk, gemini, vertex-ai, google-cloud-run,
secret-manager, gitlab, mcp, model-context-protocol, oauth, pkce,
server-sent-events, pytest, docker, httpx
Log in or sign up for Devpost to join the conversation.