GenAI Powered Synthetic Data Generation and Validation

Inspiration

In today's data-driven world, access to high-quality datasets is often blocked by privacy concerns, regulatory hurdles, or data scarcity—especially in sensitive fields like healthcare and finance. We were inspired to build a GenAI-powered pipeline that can bridge the gap between data privacy and innovation by generating synthetic data that is realistic, validated, and secure, unlocking new possibilities for AI, analytics, and testing without the risk of exposing sensitive information.

What it does

Our solution is an end-to-end GenAI-Powered Synthetic Data Generation and Validation Pipeline that:

Ingests data from multiple sources (BigQuery, Firestore, CloudSQL, CSV, JSON, Excel)

Reads schema and generates realistic synthetic data using Claude 3.7 (via AWS Bedrock)

Scales data using CTGAN for large-scale synthesis

Validates data quality using Great Expectations, ensuring accuracy levels above 92%

Pushes the validated synthetic data back into the source system or allows downloads

Protects all sensitive fields using Cloud DLP, ensuring compliance with regulations like GDPR and HIPAA

How we built it

We architected a modular pipeline using the following technologies:

GenAI Model (Claude 4) via AWS Bedrock for schema-aware data generation

CTGAN (Conditional GAN) for scalable tabular data synthesis

Great Expectations for data validation, profiling, and quality thresholds

Lambda Functions to orchestrate data extraction, generation, validation, and export

Cloud DLP to protect and mask sensitive data during processing

Support for Multi-Source Ingestion (BigQuery, Firestore, CloudSQL, and files like CSV/Excel/JSON)

Each step is automated, ensuring high accuracy, traceability, and end-to-end flexibility.

Challenges we ran into

Field Mapping Complexity: Especially when using models like Faker or multiple LLMs (Claude, Gemini, Mistral), we saw schema mismatch and parameter alignment issues.

Processing Time: Generating large datasets (250+ rows) with GenAI took time (~7–8 minutes), requiring us to optimize flow with CTGAN.

Model Compatibility: Some models performed well on simple data but failed with complex or relational schemas.

Validation Accuracy: Ensuring generated data consistently passed 90%+ accuracy across formats and data types took significant tuning.

Accomplishments that we're proud of

Successfully built a cross-compatible pipeline capable of ingesting and pushing data across cloud and file-based sources.

Achieved 92–98% validation accuracy using a hybrid GenAI + CTGAN approach.

Seamlessly integrated Great Expectations to generate human-readable validation reports.

Ensured privacy compliance using automated DLP masking—no real data was exposed at any stage.

Created a plug-and-play framework adaptable to various industries like healthcare, finance, and research.

What we learned

GenAI models like Claude 3.7 and Claude 4 excel in contextual generation, but benefit from prompt engineering + schema sampling for structured tasks.

Hybrid pipelines (GenAI + GANs) can outperform single-model solutions, especially in scale.

Validation isn't just a final step—it must be embedded in the generation flow to maintain consistency.

Modularity matters—designing each phase (generation, validation, ingestion) as reusable components allows for rapid scaling and customization.

What's next for GenAI Powered Synthetic Data Generation and Validation

Real-Time Streaming Support: Add capability to generate and validate synthetic data on-the-fly from streaming platforms like Kafka or Pub/Sub.

Intelligent Schema Mapping Assistant: Automate field-type detection and transformation using advanced GenAI logic.

Enhanced Analytics Layer: Integrate dashboards to visualize data quality metrics, lineage, and synthetic-to-real comparison.

Multi-Model Flexibility: Expand model options (Mistral, GPT-4, Gemini) based on data type and complexity.

Synthetic Data as a Service (S-DaaS): Package the solution as a cloud-native service with APIs and UI, enabling plug-and-play data generation for enterprises.

Use the Agentic AI to generate the synthetic data using AWS MCP servers and save the data back to the database.

Built With

amazon-web-services
aws-bedrock-(claude-3.7)
aws-lambda
bigquery
cloud-dlp-(sensitive-data-protection-api)
cloudsql
csv-parser
ctgan
docker
excel-parser
firestore
github
great-expectations
json-parser
numpy
pandas
s3
scikit-learn
serverless-framework
terraform

Updates

Komal Kekare started this project — Jul 09, 2025 03:00 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.