Inspiration
The need for high-quality, privacy-preserving synthetic data is growing rapidly across industries like healthcare, finance, and retail. Real-world datasets are often limited by privacy concerns, regulatory restrictions, or data scarcity. SmartSynth was inspired by the desire to empower data scientists and ML practitioners to generate realistic, safe-to-share synthetic datasets that accelerate innovation while protecting sensitive information.
What it does
SmartSynth is a domain-agnostic synthetic data generation framework that supports multiple data modalities, including tabular, time-series, text, and image data. It provides both a user-friendly web interface (built with Streamlit) and a flexible command-line interface. Users can upload datasets, profile and visualize them, configure advanced generation and privacy settings, generate synthetic data using state-of-the-art models (CTGAN, TVAE, CopulaGAN, TimeGAN, transformer-based text, diffusion-based images), and evaluate the quality and privacy of the results.
How we built it
- Frontend: Streamlit for an interactive web UI.
- Backend: Python, leveraging PyTorch, scikit-learn, SDV, ydata-synthetic, transformers, and diffusers.
- Architecture: Modular design with a generator factory pattern for easy extension, utility modules for profiling, visualization, and evaluation, and privacy filters for differential privacy, k-anonymity, and membership inference protection.
Challenges we ran into
- Integrating multiple generative models and ensuring seamless switching between data modalities.
- Balancing data utility and privacy, especially when applying differential privacy and k-anonymity.
- Designing a user interface that is both powerful for advanced users and accessible for beginners.
- Ensuring compatibility and performance across different environments and hardware setups.
Accomplishments that we're proud of
- Supporting end-to-end synthetic data workflows for tabular, time-series, text, and image data.
- Implementing robust privacy-preserving techniques and comprehensive evaluation metrics.
- Delivering both a web-based and command-line interface for maximum flexibility.
- Achieving a modular, extensible codebase that can be easily adapted for new data types and models.
What we learned
- The importance of modular design for supporting rapid prototyping and extension.
- The trade-offs between privacy and data utility, and how different privacy techniques impact downstream ML tasks.
- How to leverage open-source generative models and adapt them for practical, real-world synthetic data generation.
- The value of clear documentation and intuitive UI/UX in driving adoption.
What's next for SmartSynth
- Adding more advanced generative models, such as TimeLLMs for time-series and prompt-controlled LLMs for text.
- Fine-grained privacy controls, including customizable differential privacy parameters.
- Enhanced evaluation dashboards with more visualizations and interpretability tools.
- REST API for programmatic access and integration with ML pipelines.
- Expanded deployment options, including cloud-native and enterprise-ready solutions.
Built With
- matplotlib
- python
- sdv
- streamlit
- ydata-synthetic
Log in or sign up for Devpost to join the conversation.