Inspiration

Universities are sitting on one of the most valuable untapped data assets in the world: decades of research papers, grant proposals, datasets, faculty work, and expert analysis. AI labs want this data. Institutions want new revenue streams. But the market is stuck.

The problem is not demand. The problem is preparation.

Universities usually do not know exactly what they own, what they are allowed to license, or which documents are blocked by publisher rights, funder restrictions, FERPA, HIPAA, IRB consent, or unclear ownership. Even when data is rights-clean, raw PDFs and institutional files are not immediately useful as training data. AI labs want structured, labeled, high-quality metadata: methodology, novelty claims, evidence quality, citation context, claim graphs, and domain tags.

Datalake was built to solve both problems at once.

What it does

Datalake is an agentic data preparation system for universities and research institutions.

It ingests a folder of raw institutional documents, research papers, grant proposals, datasets, faculty publications, and related files, and runs a dense multi-pass agent loop over each document.

For every document, Datalake produces two outputs:

  1. A catalog record
    It identifies what the document is, who likely owns it, which compliance regimes apply, whether it is commercially viable, and whether it is license-ready.

  2. A rich label payload
    It extracts training-grade metadata such as structured abstracts, methodology tags, novelty claims, evidence quality, claim graphs, citations, and domain classifications.

The result is a sellable, AI-lab-ready dataset. Institutions get both an inventory of what they own and a value-multiplied version of the subset they can actually license.

How we built it

The core of Datalake is a shared catalog-and-label agent loop.

Each document moves through several stages:

Read & extract parses the document and pulls out text, structure, and references.

Propose launches multiple parallel agents that independently generate draft catalog and label records.

Critique uses separate agents to review each proposal for methodology specificity, ownership defensibility, compliance coverage, and label quality.

Refine revises the proposals using the critiques.

Vote selects the strongest record or flags the document as low-confidence.

Enrich generates the final structured payload that makes the data valuable to AI labs.

We built the backend in Python with async parallel inference, document parsing, structured JSON outputs, trace logging, and dataset export. The dashboard shows documents streaming through the system, live agent traces, cost comparisons, quality metrics, and a filterable catalog view where users can isolate the license-ready subset.

Why Wafer matters

The product only works if inference is cheap and fast.

Datalake uses 10–20 inference calls per document. That would be economically infeasible with traditional frontier-model pricing at university scale. On a corpus of millions of files, GPT-4-style costs would erase the licensing margin before the institution ever made money.

Wafer changes the shape of the product. Because inference is cheap enough to treat agent loops as infrastructure, Datalake can run proposal, critique, refinement, voting, and enrichment passes on every document, not just a small sample.

The agent loop is not a feature layered on top of the product. The agent loop is the product.

Challenges we ran into

The hardest part was making the system do cataloging and labeling together instead of treating them as separate workflows.

Most data-labeling tools assume the customer already knows what the data is and already has the rights to use it. That assumption fails for universities. Institutional files are a mess, ownership is ambiguous, and compliance risk is the reason these datasets are not already being sold.

We also had to design outputs that were useful to two very different users: compliance teams need conservative, evidence-backed catalog records, while AI labs want rich, structured metadata that makes the data immediately useful for training.

Accomplishments that we’re proud of

We are proud that Datalake turns a vague institutional problem, “we have valuable data somewhere”, into a concrete workflow: ingest, catalog, label, filter, export, and sell.

We are also proud of the positioning. This is not a generic labeling platform. Existing labelers serve AI labs, which are data buyers. Datalake serves institutions, which are data sellers.

Our favorite part is the cost meter: watching the Wafer cost stay low while the GPT-4-equivalent counter climbs makes the business case immediately obvious.

What we learned

We learned that cataloging and labeling are more connected than they first appear. When an agent reads a paper to infer ownership, compliance, and commercial viability, it is already building the same understanding needed to extract methodology, claims, citations, and novelty.

Splitting cataloging and labeling into two vendor workflows would duplicate inference, lose context, and reintroduce the legal friction that prevents these deals from closing in the first place.

Catalog is the wedge. Labeling is the business.

What’s next for Datalake

Next, we want to move from a hackathon MVP to real institutional workflows.

Future versions would include connectors for S3, SharePoint, DSpace, Box, and institutional repositories; support for lectures, videos, datasets, and archival material; human review for low-confidence records; and one-click publishing to marketplaces or direct AI-lab licensing deals.

The long-term vision is bigger than universities. The same system could help hospital systems, museums, research labs, and archives turn messy institutional knowledge into rights-aware, AI-ready datasets.

Datalake makes university data labelable in the first place.

https://github.com/namdabest253/datalake

Built With

  • aiohttp
  • aiosqlite
  • arxiv
  • asyncio
  • framer-motion
  • hatchling
  • hugging-face-datasets
  • loguru
  • nsf
  • openai-gpt-4-turbo
  • postcss
  • pydantic
  • pydantic-settings
  • pymupdf
  • pytest
  • python
  • pyyaml
  • qwen-3.5-397b
  • react
  • react-router
  • ruff
  • sql
  • sqlite
  • streamlit
  • tailwind-css
  • tiktoken
  • typer
  • typescript
  • vite
  • wafer-serverless
Share this project:

Updates