SmartSynth

Inspiration

The need for high-quality, privacy-preserving synthetic data is growing rapidly across industries like healthcare, finance, and retail. Real-world datasets are often limited by privacy concerns, regulatory restrictions, or data scarcity. SmartSynth was inspired by the desire to empower data scientists and ML practitioners to generate realistic, safe-to-share synthetic datasets that accelerate innovation while protecting sensitive information.

What it does

SmartSynth is a domain-agnostic synthetic data generation framework that supports multiple data modalities, including tabular, time-series, text, and image data. It provides both a user-friendly web interface (built with Streamlit) and a flexible command-line interface. Users can upload datasets, profile and visualize them, configure advanced generation and privacy settings, generate synthetic data using state-of-the-art models (CTGAN, TVAE, CopulaGAN, TimeGAN, transformer-based text, diffusion-based images), and evaluate the quality and privacy of the results.

How we built it

Frontend: Streamlit for an interactive web UI.
Backend: Python, leveraging PyTorch, scikit-learn, SDV, ydata-synthetic, transformers, and diffusers.
Architecture: Modular design with a generator factory pattern for easy extension, utility modules for profiling, visualization, and evaluation, and privacy filters for differential privacy, k-anonymity, and membership inference protection.

Challenges we ran into

Integrating multiple generative models and ensuring seamless switching between data modalities.
Balancing data utility and privacy, especially when applying differential privacy and k-anonymity.
Designing a user interface that is both powerful for advanced users and accessible for beginners.
Ensuring compatibility and performance across different environments and hardware setups.

Accomplishments that we're proud of

Supporting end-to-end synthetic data workflows for tabular, time-series, text, and image data.
Implementing robust privacy-preserving techniques and comprehensive evaluation metrics.
Delivering both a web-based and command-line interface for maximum flexibility.
Achieving a modular, extensible codebase that can be easily adapted for new data types and models.

What we learned

The importance of modular design for supporting rapid prototyping and extension.
The trade-offs between privacy and data utility, and how different privacy techniques impact downstream ML tasks.
How to leverage open-source generative models and adapt them for practical, real-world synthetic data generation.
The value of clear documentation and intuitive UI/UX in driving adoption.

What's next for SmartSynth

Adding more advanced generative models, such as TimeLLMs for time-series and prompt-controlled LLMs for text.
Fine-grained privacy controls, including customizable differential privacy parameters.
Enhanced evaluation dashboards with more visualizations and interpretability tools.
REST API for programmatic access and integration with ML pipelines.
Expanded deployment options, including cloud-native and enterprise-ready solutions.

Built With

matplotlib
python
sdv
streamlit
ydata-synthetic

Updates

Vishal Kumar started this project — Jul 09, 2025 09:41 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.