DataFleet

DataFleet
Problem we are solving
Solution we Built
CI Pipeline Architecture Diagram
GCP Architecture Diagram
Fleet Agents Details
Security Architecture

Inspiration

Data engineering teams on Google Cloud often catch problems too late: after a broken dashboard, a surprise BigQuery bill, a failed Airflow run, or a Spark job that crashes in production. We wanted to move that review upstream into the merge request itself.

The idea behind DataFleet was simple: what if a merge request could be reviewed by a fleet of specialized AI agents that understand data engineering, not just generic application code? Instead of one general reviewer, we designed a system where different agents focus on BigQuery cost, schema risk, data quality, infrastructure efficiency, deployment readiness, and recurring anti-patterns.

We were especially inspired by the fact that small SQL mistakes can create huge cost differences. If a query scans 365x more data because of a missing partition filter, that is not just a code smell, it is a real operational and financial problem.

What we built

We built DataFleet, an AI-powered review fleet for Google Cloud data engineering teams using the GitLab Duo Agent Platform, GitLab CI, and Google Cloud.

DataFleet reviews merge requests for:

Spark anti-patterns
dbt quality and contract issues
Airflow reliability risks
BigQuery SQL cost and optimization problems
schema blast radius
deployment readiness

The system does more than comment on code. It can also:

run BigQuery dry-run checks to estimate real scan cost
use BigQuery as memory to detect recurring anti-patterns by author
auto-commit SQL fixes for selected issues
create incident issues for severe schema risk
close merge requests on NO-GO
block CI with a second enforcement layer

That combination made the project feel less like a chatbot and more like an active review system.

How we built it

We designed the project around three layers:

GitLab CI prepares context and enforces the CI-side gate.
GitLab Duo flows and agents analyze the merge request and produce decisions.
Google Cloud provides memory, pricing, dry-run cost validation, secure auth, and webhook automation.

GitLab Duo Agent Platform

We modeled the review process as a fleet of flows and agents instead of a single monolithic reviewer. The centerpiece is the Master Orchestrator, which runs an 8-stage pipeline from analysis to final readiness reporting.

A major design choice was the split between:

AgentComponent for reading and analysis
OneOffComponent for write actions like posting notes, creating issues, committing fixes, and closing MRs

This separation helped us reduce prompt-injection risk by ensuring the component that reads untrusted MR content is not the component that performs sensitive write actions.

Google Cloud integrations

We used Google Cloud as an active part of the review loop:

BigQuery stores historical findings so the system can detect repeated mistakes
BigQuery dry-run provides real query scan estimates
Cloud Billing pricing data helps agents estimate financial impact
Cloud Run powers zero-touch reviewer assignment through webhooks
Workload Identity Federation enables keyless authentication from CI

This gave the project a stronger grounding in real platform behavior instead of relying only on model reasoning.

Challenges we faced

One of the biggest challenges was turning AI review from something advisory into something operationally meaningful.

1. Making the system enforceable

It is easy to build an agent that comments on code. It is much harder to build one that can safely participate in merge governance. We solved this with a two-layer gate:

Layer 1: the agent closes the MR on NO-GO
Layer 2: CI polls the verdict and exits with failure

That redundancy made the gate much harder to bypass.

2. Handling untrusted input safely

Merge requests, SQL files, and comments are all untrusted input. Prompt injection was a real concern. We addressed this by separating readers from writers and by explicitly instructing prompts to ignore instructions embedded in repository content.

3. Grounding cost analysis in reality

Cost analysis is only useful if it is believable. Rather than guessing BigQuery cost from patterns alone, we integrated bq --dry_run to get planned bytes scanned. That gave us a more reliable signal and made the feedback much stronger.

If a query scans $B$ bytes, then the approximate on-demand BigQuery cost is:

$$ \text{Cost} \approx \frac{B}{2^{40}} \times 5 $$

where the 5 represents roughly $5 per TB scanned.

4. Orchestrating many specialized agents

A fleet is more powerful than a single reviewer, but it also introduces coordination overhead. We had to think carefully about stage sequencing, data handoff, report structure, and how one agent's output becomes another agent's input without losing clarity.

What we learned

We learned that domain-specific AI review becomes dramatically more useful when it is connected to real infrastructure signals.

A few key lessons stood out:

Specialized agents beat generic review for complex platforms like GCP data engineering.
Ground truth matters: using dry-run bytes, pricing feeds, and historical findings made the system more trustworthy.
Security boundaries matter just as much as model quality.
Automation is most valuable at merge time, when teams can still prevent cost, breakage, and incidents before deployment.

We also learned that good architecture diagrams and documentation are not just polish. They are essential when a project combines CI, cloud infrastructure, AI orchestration, and security design.

Why this project matters

Data engineering failures are often expensive in ways that do not show up in traditional code review. They show up later as cloud spend, operational toil, broken pipelines, stale dashboards, and poor data trust.

DataFleet is our attempt to make merge requests smarter, stricter, and more aware of how data systems actually fail in production. It brings AI review closer to the realities of modern GCP data engineering, where correctness, cost, reliability, and governance all matter at the same time.

Built With

gcp
gitlabduoagentplatform
python
yaml

Updates

Lajwith Singh Ranawat started this project — Mar 25, 2026 12:48 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.