Inspiration
Every data engineer has the same problem: copying production data into staging is illegal in half the world (GDPR, HIPAA, CCPA). Most teams ship with broken fake data and discover bugs at 3 a.m. Existing synthetic-data tools (SDV, Gretel, MOSTLY AI) are one-shot scripts no one wants to maintain.
We wanted a real product surface: an autonomous agent that runs nightly, generates realistic-but-fake data, validates its fidelity against four statistical metrics, asks a human to approve once, and ships it via the same managed pipeline (Fivetran) the data team already trusts in production.
What it does
Synth is an autonomous agent built on Google Cloud Agent Builder, Gemini 3.1 Pro Preview, and the Fivetran MCP. When triggered (button click in the dashboard, HTTP call, or schedule):
- Discovers the source schema via the Engine API
- Generates synthetic rows using a Gaussian Copula (preserves column correlations) or Conditional Histogram (preserves marginals)
- Validates fidelity with four metrics:
- TSTR (Train Synthetic, Test Real): XGBoost AUC trained on synth, scored on a real holdout — measures utility
- KS_avg: Kolmogorov–Smirnov per-column mean — numeric distribution similarity
- JS_avg: Jensen–Shannon divergence per-column — categorical similarity
- DCR_min: Distance to Closest Record — privacy (synthetic rows aren't too close to any real row)
- Branches on failure: if the gate fails (TSTR < 0.75, or KS > 0.15, or DCR < 0.10), the agent retries once with the alternate engine. If still failing, it halts.
- Requests human approval by writing the run to Firestore. The dashboard polls and shows the metrics + Plotly distribution-comparison plots overlaying real (gray) vs synth (mint).
- On approval: uploads the synth as parquet to a GCS bucket, calls
sync_connectionon a pre-configured Fivetran GCS→BigQuery connection, polls untilsucceeded_atadvances. - Notifies Slack with the result. The synth lives in
synth_staging.loan_applicationsin BigQuery.
The 9 tools the agent calls — discover_schema, generate_synthetic, validate_fidelity, write_run, request_human_approval, notify_slack, upload_synthetic_to_gcs, trigger_fivetran_sync, wait_for_sync_complete — are HTTP endpoints on a Cloud Run service. The Gemini agent calls them by name with structured arguments and gets JSON back.
How we built it
Three Cloud Run services + the Fivetran-managed pipeline + the Gemini agent itself:
- Engine (FastAPI, Python): generators + validators + GCS upload. Reads from Cloud SQL Postgres (100k Lending Club loans seeded from Kaggle as our "prod" data). XGBoost handles the TSTR scoring including the gnarly string-multiclass case (
loan_statushas 7 classes; we label-encode after filtering classes absent from synth-train). - Agent Tools (FastAPI, Python): the 9 HTTP tools the agent calls +
/agent/triggerto start a new run from the dashboard. Background thread runs the Gemini loop without blocking the HTTP response. - Dashboard (Next.js 14, App Router, Geist + Plotly): dark B2B-infrastructure aesthetic (Linear/Vercel-class), single mint accent, sparklines + threshold bars on each metric card, status-aware row tints, real-time Firestore polling.
- Fivetran: GCS source connector → BigQuery destination. The agent's tool calls are the REST equivalents of the official
fivetran-mcpserver'ssync_connectionandget_connection_detailstools — same API surface, just invoked in-process so the agent can run inside Cloud Run without spawning a subprocess. - Gemini agent: built with the Vertex AI SDK's
GenerativeModel+FunctionDeclaration+ function-calling loop. Runs at theglobalVertex AI endpoint. System prompt encodes the 9-step decision graph; the model picks tools and arguments autonomously.
Challenges we ran into
- Gemini 3 access: initial probing returned 404 NOT_FOUND for every model name we tried (
gemini-3.0-pro,gemini-3-pro,gemini-3-pro-preview, etc.) across every region. We almost posted on the hackathon forum asking for allowlist access. After enumerating the Model Garden, we discovered the working combination: model namegemini-3.1-pro-preview(the.0is dropped) + endpointlocations/global(not us-central1) + thex-goog-user-projectheader. Documented indocs/specs/2026-05-23-gemini3-access.md. - Fivetran MCP capabilities: we initially assumed the MCP could push synthetic rows directly into a destination. It can't — the Fivetran MCP is for managing the Fivetran control plane (connectors, syncs). We pivoted the architecture to use Fivetran for what it's built for: agent writes parquet to GCS, then triggers
sync_connectionto move GCS → BigQuery. Better architecture; better story for the demo. - String-multiclass TSTR: real
loan_statushas 7 string-valued classes with extreme imbalance (62% "Fully Paid", < 1% "Default"). Naive XGBoost on those labels exploded with non-contiguous class indices after filtering. We label-encode after filtering to contiguous integers and switched tomulti_class="ovr", average="weighted"so rare-class undefined AUCs don't NaN the result. - Cloud Run in-memory cache: the engine caches synthetic outputs in a Python dict keyed by
run_id. Cloud Run scales horizontally — different requests can hit different instances and miss the cache. We pinned the engine tomin-instances=1, max-instances=1so the cache survives. - Dashboard design: our first dashboard iteration was "editorial / scientific journal" (Fraunces serif, warm newsprint paper). It was distinctive, but it didn't look like a B2B SaaS product — it looked like a typography project. We rebuilt it as a dark infrastructure dashboard (Linear / Vercel / Tailscale-class), added subtle color washes, sparklines, threshold bars, micro-animations. Documented in
docs/specs/2026-05-22-synth-design.mdand the dashboard's git history.
Accomplishments we're proud of
- The full decision graph runs autonomously — Gemini picks each tool and argument on its own. No hardcoded sequence.
- The agent retries on fidelity failure with the alternate engine by itself — it's a real agent loop, not a pipeline.
- TSTR on real Lending Club data lands at 0.96+ with the Gaussian Copula, validated end-to-end.
- The Fivetran MCP integration is real — synth data flows GCS → Fivetran sync → BigQuery in ~30-45 seconds per run.
- The dashboard's "Approve & push" button completes the loop with a human-in-the-loop signal back to the running agent via Firestore polling — no webhooks, no complicated queue, just one document field.
- Built in 14 days by one developer.
What we learned
- The Fivetran MCP is a CONTROL PLANE, not a data plane. Once we understood that, the architecture became cleaner.
- Gemini 3 / Vertex AI naming has subtle traps:
globallocation,x-goog-user-projectheader, no.0in the model name. None of these are obvious from the docs. - B2B SaaS dashboards win on restraint, not chrome. Linear, Vercel, and Tailscale all converge on the same pattern for good reasons.
- Function calling in Gemini is dramatically easier than spinning up Agent Engine for the same job. A
FunctionDeclarationper tool + agenerate_contentloop = working agent in ~150 lines.
What's next for Synth
- Cloud Scheduler: nightly trigger for every staging environment
- Multi-source federation: one agent run, multiple source tables, parallelized
- PR-driven runs: GitHub webhook triggers a run when a schema migration lands in prod
- Differential privacy budget tracking: track ε across runs so the cumulative privacy cost is bounded
- CTGAN / TVAE engines via SDV: deep-learning generators for the cases where copula's marginal-shape assumptions break
- Connector SDK source: turn Synth itself into a Fivetran source so the entire flow is one
sync_connectioncall
Built With
- agent-builder
- bigquery
- cloud-build
- cloud-run
- cloud-scheduler
- cloud-sql
- fastapi
- firestore
- fivetran
- geist
- gemini
- google-cloud
- mcp
- nextjs
- numpy
- pandas
- plotly
- postgresql
- pyarrow
- pydantic
- python
- react
- scikit-learn
- scipy
- tailwindcss
- typescript
- vertex-ai
- xgboost
Log in or sign up for Devpost to join the conversation.