Inspiration

Every engineering team has been there: a reviewer opens a merge request to find a single-line description that says "fix", a commit history full of "wip wip wip final FINAL", and zero tests for three new services. The reviewer now has to spend 20 minutes just figuring out what the MR is even trying to do before they can evaluate whether it does it correctly.

That frustration was the seed of this project. The core question was simple: what if we could stop low-effort MRs before a human ever had to look at them?

Code review is expensive. Treating reviewer attention as a finite and valuable resource means protecting it from noise is just as important as the review itself. MR Quality Gate was built to be that first line of defense.


What it does

MR Quality Gate is a fully automated GitLab agent that intercepts every merge request and scores it across 6 quality dimensions before any human reviewer sees it.

It fetches the MR metadata, diffs, commits, linked issues, and scans for code smells, then computes a weighted score out of 100. Based on the result, it posts a structured scorecard as an MR comment, applies a quality tier label, and takes action:

  • EXCELLENT (85-100): Posts praise with minor suggestions
  • GOOD (70-84): Posts improvement suggestions
  • NEEDS WORK (50-69): Posts a numbered issues list and requests changes
  • POOR (30-49): Requests changes and flags for maintainer attention
  • JUNK (0-29): Posts an explanation and automatically closes the MR

No human has to triage noise. The agent handles it.


How we built it

The agent is defined as a YAML workflow using GitLab Duo Workflow's AgentComponent model, running in ambient mode so it triggers automatically on MR events rather than waiting for user input.

The scoring formula is a weighted sum across six dimensions:

$$ S_{total} = (D_1 \times 2.5) + (D_2 \times 2.0) + (D_3 \times 2.0) + (D_4 \times 2.0) + (D_5 \times 1.0) + (D_6 \times 0.5) $$

Where $D_i \in [1, 10]$ for Description Quality, Commit Hygiene, Scope Integrity, Test Coverage, Code Smell Signals, and MR Hygiene respectively.

The agent follows a strict 11-step tool sequence: checking for duplicate scorecards first, fetching all MR data, running grep for code smells, scoring, posting the comment, labeling, and optionally closing. It is designed as a zero-interaction flow, meaning it never asks clarifying questions and always proceeds from the inputs it is given.


Challenges we ran into

Prompt injection defense was a real concern. The agent reads file contents, diffs, and commit messages directly, meaning a malicious contributor could embed instructions inside a commit message. Explicit boundaries in the system prompt ensure the agent treats all file and diff content purely as data, never as instructions to follow.

Graceful degradation took careful thought. If one tool call fails (say, list_merge_request_diffs is unavailable), the agent should not abort the entire run. Each tool failure has a defined fallback so the agent always produces a complete report, even if some dimensions are marked N/A.

Scoring calibration was the most time-consuming challenge. Deciding what a "5 out of 10" means for commit hygiene versus description quality required multiple rubric iterations to feel consistent and fair across very different MR types.


Accomplishments that we're proud of

  • A rubric that feels fair and consistent across tiny config patches and large feature branches alike
  • A tone that is helpful even when closing an MR, always including at least one positive observation regardless of score
  • A duplicate detection guard that checks existing notes before posting, so re-triggers never pile up multiple scorecards
  • The agent running fully autonomously with zero human input needed at any point in the flow

What we learned

Agent behavior is only as good as the prompt contract. The tool sequence, input parsing rules, and boundary conditions all had to be written with the same rigor you would apply to a formal specification. Vague instructions produce inconsistent behavior.

Weighting reflects values. Description quality and test coverage together account for 45% of the total score. That is a deliberate statement: an MR that explains itself and proves it works is more valuable than one with clean commits but no context.

Automation should be respectful. The agent is enforcing standards, not punishing people. That distinction had to be built into the tone guidelines explicitly, not assumed.


What's next for MR Gate

CI pipeline integration to optionally block merges below a configurable score threshold

Built With

  • claude
  • gitlab
  • yaml
Share this project:

Updates