Inspiration

Ever found the perfect ML idea but got stuck hunting for a clean dataset? We did — many times. That’s what sparked DataReplica: a tool that turns small data samples into large, high-quality synthetic datasets in minutes.

What it does

DataReplica lets users upload a small dataset and instantly generate a large, realistic synthetic version, along with optional data quality reports.

How we built it

-Frontend: React + Tailwind (Dockerized) -Backend: FastAPI with SDV models (CTGAN, TVAE, GaussianCopula, DistilGPT2) -Deployment: Docker Compose on AWS EC2 with Nginx reverse proxy

Challenges we ran into

-Real-world EC2 deployment and Nginx config for production -Time consuming text generation

Accomplishments that we're proud of

-Fully containerised and deployed ML app -Clean, multi-step user interface with instant feedback -Reliable synthetic data generation from minimal input -Automatic suitable model detection according to dataset

What we learned

-Support for multi-table and time-series datasets -Advanced quality metrics and drift detection

What's next for DataReplica

-Support for multi-table and time-series datasets -Advanced quality metrics and drift detection -Making it faster

Built With

Share this project:

Updates