About the Project

Inspiration

As the founder of Quark, a startup focused on developing compact, specialized AI models, I realized the critical need for vast amounts of high-quality supervised training data to fine-tune these models effectively. Collecting such data at scale is challenging, time-consuming, and often costly. This project was inspired by the need to streamline and automate the data collection and annotation process to accelerate model development.

What it does

DataAget is a prototype tool that automates the extraction of rich, technical content from curated web sources and converts it into structured, high-quality supervised training examples. It leverages the Gemini AI model to identify scrape-friendly URLs, scrape detailed content, and transform raw text into JSON-formatted training data ready for fine-tuning language models.

How we built it

The tool is built using Python with:

  • Streamlit for an interactive user interface,
  • Requests and BeautifulSoup for scraping,
  • Google Gemini API for generating search queries and converting raw data into structured annotations,
  • CSV export for easy integration with machine learning pipelines.

This modular design allows continuous updates and improvements.

Challenges we ran into

  • Ensuring the URLs returned are scrape-friendly and contain technical content suitable for training data.
  • Handling varied website structures and content formatting during scraping.
  • Processing large text data chunks within API limits while maintaining annotation quality.
  • Parsing and validating the JSON output from the AI model reliably.

Accomplishments that we're proud of

  • Successfully automating the entire pipeline from data discovery to training data generation.
  • Creating a manageable prototype that can be improved iteratively.
  • Demonstrating how AI can assist in generating high-quality supervised datasets from unstructured web data.

What we learned

  • The importance of carefully designing prompts for AI models to get precise and usable outputs.
  • Handling real-world scraping issues requires robust error handling and content cleaning.
  • Chunking large texts is essential to work within API constraints and improve annotation accuracy.
  • Prototyping with Streamlit accelerates development and user feedback cycles.

What's next for DataAget

  • Enhancing URL filtering to improve scrape quality and relevance.
  • Adding support for more complex annotation formats and multi-turn conversational data.
  • Integrating more sources beyond web scraping, like PDFs or APIs.
  • Building an API-first version for seamless integration with data pipelines.
  • Continuously updating the tool to address data collection bottlenecks faced by startups like Quark.

Note: Due to time constraints, I couldn't integrate sponsored tools or advanced features during this initial prototype development. Instead, I relied on large language models (LLMs) like Gemini to drive core functionality, proving the concept and laying the foundation for future enhancements.

Share this project:

Updates