Inspiration
Data engineering teams on Google Cloud often catch problems too late: after a broken dashboard, a surprise BigQuery bill, a failed Airflow run, or a Spark job that crashes in production. We wanted to move that review upstream into the merge request itself.
The idea behind DataFleet was simple: what if a merge request could be reviewed by a fleet of specialized AI agents that understand data engineering, not just generic application code? Instead of one general reviewer, we designed a system where different agents focus on BigQuery cost, schema risk, data quality, infrastructure efficiency, deployment readiness, and recurring anti-patterns.
We were especially inspired by the fact that small SQL mistakes can create huge cost differences. If a query scans 365x more data because of a missing partition filter, that is not just a code smell, it is a real operational and financial problem.
What we built
We built DataFleet, an AI-powered review fleet for Google Cloud data engineering teams using the GitLab Duo Agent Platform, GitLab CI, and Google Cloud.
DataFleet reviews merge requests for:
- Spark anti-patterns
- dbt quality and contract issues
- Airflow reliability risks
- BigQuery SQL cost and optimization problems
- schema blast radius
- deployment readiness
The system does more than comment on code. It can also:
- run BigQuery dry-run checks to estimate real scan cost
- use BigQuery as memory to detect recurring anti-patterns by author
- auto-commit SQL fixes for selected issues
- create incident issues for severe schema risk
- close merge requests on
NO-GO - block CI with a second enforcement layer
That combination made the project feel less like a chatbot and more like an active review system.
How we built it
We designed the project around three layers:
- GitLab CI prepares context and enforces the CI-side gate.
- GitLab Duo flows and agents analyze the merge request and produce decisions.
- Google Cloud provides memory, pricing, dry-run cost validation, secure auth, and webhook automation.
GitLab Duo Agent Platform
We modeled the review process as a fleet of flows and agents instead of a single monolithic reviewer. The centerpiece is the Master Orchestrator, which runs an 8-stage pipeline from analysis to final readiness reporting.
A major design choice was the split between:
AgentComponentfor reading and analysisOneOffComponentfor write actions like posting notes, creating issues, committing fixes, and closing MRs
This separation helped us reduce prompt-injection risk by ensuring the component that reads untrusted MR content is not the component that performs sensitive write actions.
Google Cloud integrations
We used Google Cloud as an active part of the review loop:
- BigQuery stores historical findings so the system can detect repeated mistakes
- BigQuery dry-run provides real query scan estimates
- Cloud Billing pricing data helps agents estimate financial impact
- Cloud Run powers zero-touch reviewer assignment through webhooks
- Workload Identity Federation enables keyless authentication from CI
This gave the project a stronger grounding in real platform behavior instead of relying only on model reasoning.
Challenges we faced
One of the biggest challenges was turning AI review from something advisory into something operationally meaningful.
1. Making the system enforceable
It is easy to build an agent that comments on code. It is much harder to build one that can safely participate in merge governance. We solved this with a two-layer gate:
- Layer 1: the agent closes the MR on
NO-GO - Layer 2: CI polls the verdict and exits with failure
That redundancy made the gate much harder to bypass.
2. Handling untrusted input safely
Merge requests, SQL files, and comments are all untrusted input. Prompt injection was a real concern. We addressed this by separating readers from writers and by explicitly instructing prompts to ignore instructions embedded in repository content.
3. Grounding cost analysis in reality
Cost analysis is only useful if it is believable. Rather than guessing BigQuery cost from patterns alone, we integrated bq --dry_run to get planned bytes scanned. That gave us a more reliable signal and made the feedback much stronger.
If a query scans $B$ bytes, then the approximate on-demand BigQuery cost is:
$$ \text{Cost} \approx \frac{B}{2^{40}} \times 5 $$
where the 5 represents roughly $5 per TB scanned.
4. Orchestrating many specialized agents
A fleet is more powerful than a single reviewer, but it also introduces coordination overhead. We had to think carefully about stage sequencing, data handoff, report structure, and how one agent's output becomes another agent's input without losing clarity.
What we learned
We learned that domain-specific AI review becomes dramatically more useful when it is connected to real infrastructure signals.
A few key lessons stood out:
- Specialized agents beat generic review for complex platforms like GCP data engineering.
- Ground truth matters: using dry-run bytes, pricing feeds, and historical findings made the system more trustworthy.
- Security boundaries matter just as much as model quality.
- Automation is most valuable at merge time, when teams can still prevent cost, breakage, and incidents before deployment.
We also learned that good architecture diagrams and documentation are not just polish. They are essential when a project combines CI, cloud infrastructure, AI orchestration, and security design.
Why this project matters
Data engineering failures are often expensive in ways that do not show up in traditional code review. They show up later as cloud spend, operational toil, broken pipelines, stale dashboards, and poor data trust.
DataFleet is our attempt to make merge requests smarter, stricter, and more aware of how data systems actually fail in production. It brings AI review closer to the realities of modern GCP data engineering, where correctness, cost, reliability, and governance all matter at the same time.
Built With
- gcp
- gitlabduoagentplatform
- python
- yaml
Log in or sign up for Devpost to join the conversation.