Inspiration:

Access to large and diverse datasets is critical, but often limited due to privacy concerns, costs, or logistical challenges. We were inspired to solve this problem by leveraging generative AI models to create high-quality synthetic datasets that can mimic real-world data. The aim was to provide researchers, clinicians, and data scientists with a tool that enables them to generate synthetic data without compromising patient privacy, accelerating the pace of innovation in research.

What it does:

Synthia is a synthetic data generation tool powered by a fine-tuned Phi 3 LLM and NVIDIA AI Workbench. It allows users to generate realistic, synthetic datasets based on real-world medical data. The tool employs a CTGAN model that can learn patterns from a given dataset and generate new, synthetic instances that retain the same statistical properties. Users can simply upload a CSV file, interact with the tool via a user-friendly web interface, and get synthetic data that’s suitable for analysis, machine learning models, or hypothesis testing.

How we built it:

We started by integrating the NVIDIA AI Workbench with a TensorFlow container to streamline our machine learning development environment. We then fine-tuned the Phi 3 model to act as a smart query handler for users who want to generate specific types of synthetic data. The synthetic data generation itself is powered by a CTGAN model, which was trained on various biomedical datasets, such as gene expression profiles. The web application was built using TypeScript for the frontend, ensuring a smooth and interactive experience for the users. We also developed a backend pipeline that bridges the fine-tuned Phi 3 model and the CTGAN model, ensuring interaction between them.

Challenges we ran into:

One of the biggest challenges was optimizing the fine-tuning of the Phi 3 model to understand domain-specific queries related to synthetic data generation. It was also a challenge to ensure the CTGAN model generated high-quality synthetic data that maintained the statistical integrity of the original datasets. Another challenge was integrating all the components—Phi 3, CTGAN, NVIDIA AI Workbench—into a single, cohesive platform while maintaining efficient performance and scalability.

Accomplishments that we're proud of:

We’re proud to have successfully fine-tuned the Phi 3 model for specialized queries, as well as developing a system that generates synthetic data with strong fidelity to the original datasets. We are also proud of the user interface, which makes it simple for researchers and developers to generate synthetic data in just a few clicks. Most importantly, we are proud of contributing a tool that can help accelerate biomedical research by providing a scalable and private way to generate synthetic datasets.

What we learned:

Through this project, we learned a lot about the complexities of synthetic data generation, particularly when dealing with sensitive data. Fine-tuning large language models like Phi 3 taught us how important it is to focus on domain-specific training. We also learned how to effectively combine multiple machine learning models—such as LLMs and generative adversarial networks (GANs)—in a single pipeline to solve a real-world problem. Lastly, we gained a deeper understanding of the challenges of integrating ML models with web applications and ensuring a smooth user experience.

What's next for Synthia: Synthetic Data Generation with Phi 3 & AI Workbench:

The next step for Synthia is to expand its capabilities by supporting more types of data, including medical industry data, image data, and multi-omics datasets. We also aim to improve the user experience by adding more customization options for the generated datasets, such as controlling for specific variables or setting conditions based on research needs. Another goal is to scale the platform to handle larger datasets and integrate with cloud-based storage solutions to make it accessible for larger research projects. Finally, we hope to further fine-tune the Phi 3 model to handle even more complex queries and provide more nuanced responses to users.

Built With

  • ctgan
  • next-js
  • phi-3
Share this project:

Updates