Inspiration

In many real-world domains like healthcare, finance, and education, acquiring high-quality labeled data is extremely difficult due to privacy concerns, legal restrictions, or simply the lack of availability. While working on AI models during academic and hackathon projects, I faced repeated setbacks due to small or incomplete datasets. This motivated me to build a tool that could generate realistic, domain-specific synthetic data — fast, private, and highly customizable — to fill that critical gap.

What It Does

The Smart Synthetic Data Generator allows users to:

  • Select a domain (e.g., healthcare, finance, retail, education)
  • Instantly generate structured, realistic datasets using intelligent field-type logic
  • Export synthetic data as CSV for immediate use in model training or testing

It supports deep customization through schema JSON files and can be extended easily for new domains.

How I Built It

  • Frontend & Hosting: Built using Streamlit for rapid prototyping and interactive UI.
  • Backend Logic: Python-based logic using Faker, Pandas, and custom field rules for realism.
  • Schemas: Domain-specific .json schemas define field types and relations.
  • Deployment: Deployed using platforms like Render/Streamlit Cloud for public access.

Code example:

row[field_name] = self.generate_field_value(field_type)

Built With

Share this project:

Updates