🚀 Inspiration

Data privacy and scarcity are key challenges in AI model training, especially in healthcare, finance, and retail. We were inspired to build a secure, scalable tool that empowers developers to generate high-quality synthetic datasets, enabling faster, safer, and smarter model development without compromising real-world data.


💡 What it does

SynthGen AI is an intelligent synthetic data generation platform. Users can upload a simple CSV schema and, with a few clicks, generate a rich, domain-specific dataset. The platform features secure user authentication through AWS Cognito (with a seamless guest mode), and allows fine-grained control over the output, including adjustable noise levels, class balancing for ML training, and robust PII masking. To make the process even smoother, a smart AI agent is available to guide users through the configuration options.

**App has been receiving update even until 13/10/2025. Major changes for 13/10/2025* We enhanced the architectural diagram and the syn app in general. Changes were made to the flask code for better code execution Changes were made to the html template. Our aiagent.html and feedback.html were enhanced for better developer and code execution environment These changes were reflected also in our readme.md and the app in general Code repository change from private to public


🛠️ How we built it

We designed a robust, cloud-native architecture to power SynthGen AI, ensuring scalability, security, and performance. The application is built on a foundation of powerful AWS services, orchestrated by a Python backend.

  • Frontend: User interface built with Flask and styled with Bootstrap for a clean, responsive experience.
  • Backend & Application Logic: A core Flask application deployed on AWS Elastic Beanstalk, providing a scalable and managed environment.
  • AI & Data Generation: Integrated Amazon Bedrock (Titan Model) to power both the intelligent user-assistance agent and the core logic for generating high-quality, context-aware synthetic data.
  • Authentication: Secure and reliable user management handled by Amazon Cognito, which integrates seamlessly with our application and provides the Cognito Hosted UI.
  • Database: Amazon DynamoDB serves as our primary data store for generated synthetic data records and user feedback, offering high performance and scalability.
  • DevOps & Deployment: The entire stack is deployed via AWS Elastic Beanstalk for automated provisioning and management of the underlying EC2 infrastructure.

Check out our full architecture below:

SynthGen AI Architecture Diagram


⚠️ Challenges we ran into

  • Cognito & Flask Integration: Configuring the Cognito OAuth flow with Flask and Authlib required careful handling of redirect URIs and token exchanges, especially within the Elastic Beanstalk environment.
  • HTTPS & Certificates: Ensuring secure HTTPS on our custom Elastic Beanstalk domain involved troubleshooting certificate verification and load balancer listener rules.
  • Robust Data Parsing: Handling diverse CSV schema formats and potential edge cases in user uploads required building resilient error-handling logic.
  • Resource Management: Operating within a budget meant we had to be strategic about our AWS service choices, favoring serverless-like models where possible to manage costs effectively.

🏆 Accomplishments that we're proud of

  • End-to-End Authentication: We successfully implemented a full authentication system with Cognito, complete with a fallback guest mode for ease of access.
  • A Complete Generation Pipeline: Building a seamless workflow from schema input all the way to a downloadable, ready-to-use CSV output.
  • Human-in-the-Loop UX: Creating a smooth user experience where an AI assistant actively guides the user, making a complex process feel simple.
  • Rapid & Secure Deployment: We're incredibly proud of designing and deploying this entire secure application on AWS in under 48 hours.

📚 What we learned

This project was a deep dive into the practicalities of building and deploying a real-world AI application on AWS. We gained significant experience in integrating multiple AWS services (Cognito, DynamoDB, Bedrock, Elastic Beanstalk) into a cohesive Python application. We also learned valuable lessons in balancing the speed of data generation with the need for accuracy and realism, and reinforced the importance of great UX, even for developer-focused tools.


🔮 What's next for SynthGen AI

The foundation is strong, and we have an exciting roadmap ahead!

  • Multimodal Data Support: Expand beyond tabular data to generate datasets with images, text, and corresponding labels.
  • Fine-Tuned Generation Models: Leverage Amazon Bedrock to fine-tune Titan or Claude models on specific data schemas for even higher-fidelity output.
  • Enhanced Accessibility: Integrate voice input for commands and offer a multilingual UI.
  • True AI Copilot: Evolve the AI assistant from a guide into a natural-language Copilot, allowing users to generate data by simply describing what they need.
  • Expanded Integrations: Enable direct export of generated datasets to Amazon S3, RDS, and create connectors for direct ingestion into ML model training pipelines.

Built With

Share this project:

Updates