Inspiration

I was working on a finetuning project. For which I needed a dataset and I struggled to find a dataset for it. So I decided to build a data operations AI agent that can discover and download datasets for you and also perform operations on it like data cleaning, validation, synthetic data generation, taxonomy generation, etc.

What it does

Datateam is a DataOps AI Agent — a system that automates the full lifecycle of dataset generation and processing. It sits between raw data and model training, doing the heavy lifting: collecting, cleaning, augmenting, validating, and preparing high-quality datasets.

In the first iteration, the target was to build a simple AI agent that can discover and download datasets, or create them from scratch from the live web.

As of now, It can perform the following actions:

  1. Dataset discovery
  2. Datasets download from kaggle
  3. Dataset generation by scraping the web

(what's next)

  1. Pre-processing (Cleaning, formatting, parsing, normalizing)
  2. Synthetic data generation
  3. Data validation
  4. Taxonomy generation
  5. Data augmentation

How we built it

Datateam is built using Python 3.12 or later and leverages the Google Agent Development Kit (ADK) for agent development. It integrates with Google Gemini API for AI capabilities and utilizes Bright Data MCP for web interaction, including web unblocker and scraping browser functionalities.

The agent can be run via the ADK Web Interface, accessible locally after starting the ADK web server.

Challenges we ran into

  1. UI provided by adk web isn't very user friendly
  2. Google adk docs is also not very detailed and feels outdated

Accomplishments that we're proud of

Having integrated bightdata MCP server to implement the dataset generation process that is the foundation of some very large AI companies.

What we learned

  • How to use google adk
  • Integrating MCP servers with google adk
  • Agentic AI design

What's next for Datateam AI

The next phase in Datateam AI is to convert it into a full blown dataops AI agent. By implementing the following features next :

  1. Pre-processing (Cleaning, formatting, parsing, normalizing)
  2. Synthetic data generation
  3. Data validation
  4. Taxonomy generation
  5. Data augmentation

Built With

  • brightdata
  • google-adk
  • perplexity
  • python
Share this project:

Updates