Aegis Guardian: Autonomous Flaky-Test Doctor

💡 Inspiration

Every team that runs CI knows the quiet betrayal of a flaky test: it passes on one run and fails on the next — on the exact same commit. No code changed. Nothing is "broken." But the red ❌ is there.

So people learn the worst habit in software: "just re-run it until it's green." Real failures start hiding behind fake ones, trust in the pipeline erodes, and eventually a genuine bug walks straight through a suite nobody believes anymore.

Formally, a test $t$ is flaky when its verdict is non-deterministic for a fixed commit:

$$\text{flaky}(t)\iff \exists\,c:\quad 0 < \Pr\big[\,t\text{ passes}\mid \text{commit}=c\,\big] < 1$$

We wanted an agent that doesn't just flag that lie, but reproduces it, explains it, fixes it, proves the fix, and then stops — handing the final decision to a human. Acting through GitLab's own first-party tooling, with least privilege and a full audit trail. That's Aegis Guardian — an autonomous flaky-test doctor for GitLab.

🩺 What it does

Aegis Guardian is a crew of five specialist AI agents that runs an end-to-end remediation for a flaky test:

Detector — reads pipeline history and finds the "lie": a test that both passed and failed on the same SHA.
Reproducer — re-triggers the pipeline to capture hard evidence of non-determinism.
Diagnostician — reads job logs + source and classifies the root cause (random input, race condition, state leak) with a confidence score.
Fixer — writes the minimal fix on a new branch and opens a Merge Request — but never merges.
Validator — runs the fix N times, proves it green every time, comments the verdict, and hands off to a human.

Everything streams live to a command center UI as it happens.

🧠 How it works — the crew pipeline

   MCP read        MCP read           MCP read         REST writes      REST verify
 ┌───────────┐  ┌────────────┐  ┌────────────────┐  ┌──────────┐  ┌──────────────┐
 │ DETECTOR  │─▶│ REPRODUCER │─▶│ DIAGNOSTICIAN  │─▶│  FIXER   │─▶│  VALIDATOR   │
 │ history → │  │ re-runs →  │  │ logs+source →  │  │ branch + │  │ green N/N →  │
 │ candidate │  │ evidence   │  │ root cause     │  │ commit→MR│  │ human        │
 └───────────┘  └────────────┘  └────────────────┘  └────┬─────┘  └──────┬───────┘
 flaky_candidates   evidence       diagnosis              │ patch        │ verdict
        └──────────── shared session state flows down ────┘              │
                                                    ▲                     ▼
                                            HUMAN-IN-THE-LOOP      comment on MR
                                            "Authorize MR"         "a human merges"

State is passed agent-to-agent as a typed chain: flaky_candidates → evidence → diagnosis → patch → verdict.

The Detector's heuristic is simple and explainable — a commit is suspect when its run outcomes contain both verdicts:

$${\text{pass},\text{fail}}\subseteq \mathcal{O}(c)$$

The Validator's gate only accepts a fix when the target test passes in all $N$ runs ($N=4$):

$$\text{accept}\iff\bigwedge_{i=1}^{N}\text{pass}_i$$

#	Agent	Reads / Writes	Tools	Output
1	Detector	read (MCP)	`list_pipelines`, `get_pipeline_jobs`	candidates
2	Reproducer	read (MCP)	`trigger_pipeline`, `get_job_log`	evidence
3	Diagnostician	read (MCP)	`get_file_contents`, `search_labels`	diagnosis
4	Fixer	write (REST bot)	`create_branch`, `commit_file`, `create_merge_request`	patch / MR
5	Validator	write (REST bot)	`verify_test_on_branch`, `comment_on_mr`	verdict

🏗️ Architecture

   ┌──────────────────────────────────────────────────────────────┐
   │                  Cloud Run  (serverless)                      │
   │   ┌────────────────────────────────────────────────────────┐ │
 Browser │  FastAPI Command Center                               │ │
 (judge) │  GET /  ·  POST /scan  ·  POST /demo  ·  GET /events   │ │
   │  ◀── Server-Sent Events (live agent feed + audit trail) ──   │ │
   │   └───────────────┬───────────────────────┬────────────────┘ │
   │            ┌──────▼───────┐   guard_write()  ┌───────────┐    │
   │            │ ADK Sequential│─────────────────▶│  AUDIT    │   │
   │            │ Agent crew ×5 │  scans EVERY     │  trail    │   │
   │            │ (Gemini 3.5)  │  write payload   │ (append-  │   │
   │            └──┬─────────┬──┘                  │  only)    │   │
   │     MCP reads │         │ REST writes         └───────────┘   │
   └──────────────│─────────│──────────────────────────────────────┘
   human OAuth ── │         │ ── least-privilege bot PAT
   (scope = mcp)  ▼         ▼   (Developer · CANNOT merge protected main)
            ┌──────────────────────────────────┐
            │            GitLab.com             │
            │   pipelines · jobs · files · MRs  │
            │   + official GitLab MCP server    │
            └──────────────────────────────────┘

The key design choice is a split identity:

Reads flow through GitLab's official first-party MCP server under a human OAuth token (scope=mcp).
Writes flow through a least-privilege project bot (Developer role). The bot can open an MR but is physically incapable of merging a protected branch — so human-in-the-loop is enforced by the platform, not by a prompt.

🔐 Trust by construction

Guarantee	How
First-party	Acts through GitLab's own MCP server — native, not a scraper
Least privilege	Bot scoped to Developer; the platform forbids it from merging `main`
Every write screened	`guard_write()` blocks leaked secrets (`glpat-…`, API keys, private keys, JWTs) and prompt-injection markers before anything touches the repo
Nothing invisible	An append-only audit trail logs every tool call — which agent, which transport (MCP/REST), which target

🛠️ How we built it

Agents: Google Agent Development Kit (ADK) SequentialAgent orchestrating five LlmAgents, each with a callable instruction, an output_key, and a per-agent least-privilege tool allowlist.
Model: Gemini 3.5 Flash via Vertex AI (us-central1).
GitLab access: the official GitLab MCP server over Streamable-HTTP JSON-RPC, authorized via OAuth Dynamic Client Registration + PKCE (scope=mcp); REST (via a bot PAT) for writes and tool gaps.
Backend: FastAPI + Server-Sent Events for the live command center; a dependency-light security/audit core.
Deploy: Cloud Run (serverless), secrets in Secret Manager, container via Cloud Build.

🧗 Challenges we ran into

Unlocking the official MCP server. The endpoint only accepts scope=mcp, which a manually-created OAuth app can't even request ("scope is invalid"), and a bot PAT gets 403 insufficient_scope. The fix was unauthenticated Dynamic Client Registration (POST /oauth/register) to mint a public PKCE client allowed mcp, then a standard authorize→token PKCE flow. We also learned its handshake is stateless and requires the negotiated MCP-Protocol-Version header on every call.
The endpoint moved under us, mid-hackathon. GitLab split its MCP service (/api/v4/mcp ↔ a new /api/v4/orbit/mcp) and the full server briefly returned 404. So we made reads MCP-first with an auditable REST fallback — the app never hard-fails, and flips back to green MCP automatically the moment the endpoint returns. Graceful degradation as a feature.
Serverless + rotating OAuth tokens. The human token lives 2h and its refresh token is single-use/rotating — deadly for a stateless Cloud Run service. We added half-life proactive refresh, write-back of the rotated token to Secret Manager, and pinned a single warm instance so the refresh chain is never forked.
ADK response schemas. The global Gemini endpoint rejected ADK's array response-schemas (400, missing items), which drove our model + region choices.
30-minute pipelines vs. a 2-minute pitch. Real verification triggers real CI and waits. We built a scripted Demo Mode that emits the exact same SSE events as the real crew — a full, faithful run in ~15 seconds, with zero real writes — so judges see everything without the wait.

📚 What we learned

Least privilege beats guardrails-by-prompt. The strongest safety control wasn't a clever instruction — it was a bot that can't merge.
Agents are only as trustworthy as their audit trail. Streaming every tool call made the system legible and debuggable.
MCP is powerful but young — first-party agent access to a platform is a glimpse of the future, and resilience (fallbacks, token lifecycle) matters as much as capability.

🚀 What's next

Multi-repo / org-wide scheduled scans, with an auto-opened "flaky digest."
A richer flaky taxonomy and automatic flaky-test::* labeling.
Quarantine + auto-retry policies, and a PR-bot mode for GitHub.
Confidence-gated auto-merge for trivial fixes (still human-approved by default).

🧰 Built with (tags for the "Built with" field)

python, fastapi, google-adk, gemini, vertex-ai, google-cloud-run,
secret-manager, gitlab, mcp, model-context-protocol, oauth, pkce,
server-sent-events, pytest, docker, httpx

Built With

docker
fastapi
gemini
gitlab
google-adk
google-cloud-run
mcp
model-context-protocol
oauth
pkce
pytest
python
secret-manager
server-sent-events
vertex-ai

Updates

Trishit Bodkhe started this project — Jun 11, 2026 03:38 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.