GenDataset–Gemini3Pro

Inspiration

Creating high-quality datasets for machine learning is one of the most time-consuming and repetitive tasks for engineers and researchers. We noticed that a significant amount of development time is spent on searching, cleaning, formatting, and structuring data before actual model training even begins.

We wanted to build a solution that automates dataset generation while maintaining realism, structure, and reusability. This idea led to the creation of GenDataset–Gemini3Pro.


What it does

GenDataset–Gemini3Pro is an AI-powered dataset generation platform that creates structured, domain-specific datasets based on user-defined requirements.

Users can:

  • Select a dataset domain
  • Provide a custom dataset description through a prompt
  • Define number of rows and columns
  • Customize schema and data types
  • Upload a sample reference dataset
  • Preview generated data
  • Export datasets in CSV, JSON, or Excel format

The system generates realistic datasets within minutes, ready for machine learning model training.


How we built it

The core intelligence layer is powered by Gemini 3 Pro, which handles AI-driven dataset generation.

Our system:

  • Uses structured prompts to generate datasets aligned with user-defined schema
  • Integrates the Kaggle API to reference real-world datasets for realistic feature relationships and value distributions
  • Allows optional sample dataset uploads to improve contextual accuracy
  • Stores generated datasets and metadata in MongoDB for versioning and reuse
  • Provides export functionality in multiple formats (CSV, JSON, Excel)

The backend handles generation logic and storage, while the frontend provides an interactive configuration interface.


Challenges we ran into

  • Ensuring generated datasets follow realistic data distributions instead of purely synthetic patterns
  • Designing flexible schema customization without making the interface complex
  • Integrating Kaggle API references effectively without overfitting to external data
  • Managing dataset storage and reuse logic for improving future generations

Balancing realism, flexibility, and generation speed was a key technical challenge.


Accomplishments that we're proud of

  • Successfully integrating Gemini 3 Pro as an intelligent dataset generation engine
  • Reducing dataset creation time from hours to minutes
  • Building schema-aware generation with domain-based suggestions
  • Implementing dataset versioning and reuse using MongoDB
  • Creating a scalable system that improves data quality over time

What we learned

  • Prompt engineering plays a critical role in structured AI output
  • Grounding AI-generated content with real-world data significantly improves realism
  • Dataset schema design impacts model training quality
  • Building reusable data systems creates long-term efficiency benefits

What's next for GenDataset–Gemini3Pro

  • Add support for larger-scale dataset generation
  • Introduce synthetic image and text dataset generation
  • Add automated data validation and quality scoring
  • Enable API access for direct integration into ML pipelines
  • Implement dataset visualization and analytics features

Our goal is to make GenDataset–Gemini3Pro a complete intelligent data generation platform for machine learning engineers.

Built With

Share this project:

Updates

Private user

Private user posted an update

GenDataset-Gemini3pro Creating reliable datasets for machine learning is time-consuming and often limited by data availability, quality, and reusability. Our goal is to automate dataset creation while maintaining realism, structural accuracy, and long-term usability across multiple machine learning workflows. To achieve this, the application uses Gemini 3 Pro as the core intelligence layer for custom AI-driven dataset generation. Users can configure dataset requirements through multiple customizable options (as demonstrated in the video), including dataset domain, column structure, data size, data types, and contextual constraints. Gemini 3 Pro leverages its advanced reasoning and instruction-following capabilities to generate structured, high-quality datasets that closely align with userdefined specifications. To further enhance data quality and realism, the system integrates the Kaggle API, allowing the generator to reference existing datasets across various categories. These reference datasets provide grounding patterns for feature relationships, value distributions, and schema consistency, enabling Gemini 3 Pro to optimize and enrich the generated data rather than relying solely on synthetic patterns. Once generated, datasets can be downloaded in the required format. Simultaneously, the dataset and its metadata are stored in MongoDB, enabling versioning and traceability. Stored datasets are also reused as references for future, similar requests, reducing redundant computation and continuously improving dataset quality over time. Overall, Gemini 3 Pro is central to enabling scalable, intelligent, and reusable dataset generation tailored for machine learning applications.

Log in or sign up for Devpost to join the conversation.