Inspiration

In the lifecycle of training Large Language Models, the need for high-quality, domain-specific data is a recurring and critical challenge. We found that existing tools for synthetic data generation were often limited, not scalable, or required deep prompt engineering expertise. While there is a growing plethora of work focusing on this topic—usually exposed publicly through research papers, or in best cases, a GitHub repo or a Hugging Face dataset—it remains challenging for the common developer to implement each of these approaches manually, make meaningful comparisons, and iterate quickly for their specific use cases.

Our inspiration was to build a platform that could bridge this gap, empowering developers to accelerate the creation of vast, nuanced datasets. We wanted to build a system that blindly trusts the generative power of LLMs but is guided by a sophisticated, human-driven characterization process. The goal was to transform a simple text input into tens of thousands of high-quality training samples, efficiently and intuitively, abstracting the state of the art synthetic data methodologies."

How We Built It

Our project is a robust, full-stack application designed for scalability and real-time interaction. The backend is built with Python and FastAPI, serving a powerful 8-step pipeline orchestrated by a central PipelineOrchestrator. When a user provides an input text, the pipeline begins:

  1. Concept Extraction: An LLM call extracts the core ideas from the input.
  2. Multi-Dimensional Characterization: We deployed five specialized "AI Agents" (Geographic, Cultural, Linguistic, Persona, and Domain), each powered by a dedicated LLM prompt, to enrich the core concepts with diverse contextual layers.
  3. Human-in-the-Loop Validation: The user is presented with the generated concepts and can prune or add to the lists, ensuring full control over the final output.
  4. Combinatorial Generation: The core of our scaling engine. The platform calculates all possible unique combinations of the validated concepts and, for each combination, uses a custom-templated prompt to generate a user-specified number of samples.
    1. Multi-Format Export: The generated data is parsed and made available for download in multiple formats, including SFT, DPO, Q&A, and raw text, with compatibility for the Hugging Face datasets library.

The frontend is a modern, responsive UI built with React and Vite. A key feature is its real-time nature, achieved using WebSockets. This allows the backend to stream progress updates directly to the user as it moves through the long-running generation process, providing a seamless and transparent experience.

The entire process is powered by local LLMs served via Ollama, giving the user privacy and control over the models used for generation.

Challenges We Faced

One of the main challenges was managing the "combinatorial explosion." With dozens of concepts across five dimensions, the number of possible combinations can become astronomical. We had to design the ConceptCombinator to intelligently limit and sample these combinations to keep generation times manageable while still ensuring diversity.

Another challenge was designing a responsive user interface for a long-running background task. Implementing the WebSocket architecture with custom React hooks (usePipelineWebSocket) was crucial to prevent the UI from feeling frozen and to provide the user with meaningful, real-time feedback on the pipeline's progress.

What We Learned

This project was a deep dive into the practicalities of applied AI. We learned that the most powerful solutions often lie in a hybrid human-AI approach; the LLM provides the creative scale, while human validation provides the necessary guardrails and quality control. We also learned a great deal about orchestrating complex, multi-stage AI workflows and the critical importance of asynchronous processing and real-time communication in creating a usable AI-powered tool.

What's next for synthetic-data-plataform

We are excited about the future of the platform and plan to focus on several key areas for enhancement:

  • Advanced Quality Assurance & Validation: A major focus will be on building a robust quality assurance framework. This includes integrating automated quality metrics (e.g., using BERT-based models for scoring), validating the coherence of generated samples, and providing users with detailed analytics on the dataset's statistical properties.
  • Agentic Enhancement: The future involves making our specialized agents even more powerful. We envision agents that can perform active web research for real-time data, self-correct their outputs, and even be fine-tuned by users for highly specific domains.
  • Deeper Input Customization: We plan to allow for more granular control over the initial input, enabling users to define negative constraints, set concept weights, and provide more nuanced domain context to guide the generation process.
  • Enhanced User Experience: We will continue to refine the user experience, focusing on a more interactive and visual concept management interface, better progress visualization for large-scale generation tasks, and providing more in-depth tools for analyzing the final generated dataset.

Built With

  • fastapi
  • huggingface
  • ollama
  • react
  • vite
Share this project:

Updates