Inspiration

The need for high-quality, privacy-preserving synthetic data is growing rapidly across industries like healthcare, finance, and retail. Real-world datasets are often limited by privacy concerns, regulatory restrictions, or data scarcity. SmartSynth was inspired by the desire to empower data scientists and ML practitioners to generate realistic, safe-to-share synthetic datasets that accelerate innovation while protecting sensitive information.

What it does

SmartSynth is a domain-agnostic synthetic data generation framework that supports multiple data modalities, including tabular, time-series, text, and image data. It provides both a user-friendly web interface (built with Streamlit) and a flexible command-line interface. Users can upload datasets, profile and visualize them, configure advanced generation and privacy settings, generate synthetic data using state-of-the-art models (CTGAN, TVAE, CopulaGAN, TimeGAN, transformer-based text, diffusion-based images), and evaluate the quality and privacy of the results.

How we built it

  • Frontend: Streamlit for an interactive web UI.
  • Backend: Python, leveraging PyTorch, scikit-learn, SDV, ydata-synthetic, transformers, and diffusers.
  • Architecture: Modular design with a generator factory pattern for easy extension, utility modules for profiling, visualization, and evaluation, and privacy filters for differential privacy, k-anonymity, and membership inference protection.

Challenges we ran into

  • Integrating multiple generative models and ensuring seamless switching between data modalities.
  • Balancing data utility and privacy, especially when applying differential privacy and k-anonymity.
  • Designing a user interface that is both powerful for advanced users and accessible for beginners.
  • Ensuring compatibility and performance across different environments and hardware setups.

Accomplishments that we're proud of

  • Supporting end-to-end synthetic data workflows for tabular, time-series, text, and image data.
  • Implementing robust privacy-preserving techniques and comprehensive evaluation metrics.
  • Delivering both a web-based and command-line interface for maximum flexibility.
  • Achieving a modular, extensible codebase that can be easily adapted for new data types and models.

What we learned

  • The importance of modular design for supporting rapid prototyping and extension.
  • The trade-offs between privacy and data utility, and how different privacy techniques impact downstream ML tasks.
  • How to leverage open-source generative models and adapt them for practical, real-world synthetic data generation.
  • The value of clear documentation and intuitive UI/UX in driving adoption.

What's next for SmartSynth

  • Adding more advanced generative models, such as TimeLLMs for time-series and prompt-controlled LLMs for text.
  • Fine-grained privacy controls, including customizable differential privacy parameters.
  • Enhanced evaluation dashboards with more visualizations and interpretability tools.
  • REST API for programmatic access and integration with ML pipelines.
  • Expanded deployment options, including cloud-native and enterprise-ready solutions.

Built With

  • matplotlib
  • python
  • sdv
  • streamlit
  • ydata-synthetic
Share this project:

Updates