Inspiration

When you build an ML project, you're pulling in licenses from everywhere - the Python packages, the models you download, the datasets you train on. Most developers only think about direct dependencies, and even then it's usually an afterthought. We wanted to build something that catches the problems before they become legal issues.

What it does

The ML License Compliance Agent scans your ML project and tells you what license violations you have and how to fix them. It checks three things:

  • Python packages - finds dangerous licenses hiding in your dependencies
  • Hugging Face models - checks if you're using models correctly (Llama has strict rules most people don't know about)
  • Training datasets - flags datasets you can't legally use for commercial products

You can trigger it by commenting on any GitLab issue or MR and it replies with a full report. It also runs automatically in CI and blocks merges when it finds critical violations.

How we built it

The scanner is Python code that reads your files and calls public APIs to look up license information. No AI needed for the actual scanning - it's just rules applied consistently.

Claude comes in through the GitLab Duo Agent Platform for the interactive part. When a developer doesn't understand a finding, they ask the agent and get a plain-English explanation with specific steps to fix it.

Challenges we ran into

AI licenses are genuinely confusing. The Llama license has a clause that says you can't use its outputs to train other AI models - most developers have never heard of this. CC-BY-NC datasets look fine until you try to ship a commercial product. Finding these violations requires understanding context, not just reading a license string.

Accomplishments that we're proud of

The agent actually works end-to-end. You comment on a GitLab issue, it reads your repo, and it comes back with a real compliance report - specific violations, exact license clauses, and concrete steps to fix each one. We ran it against our demo repo and it caught all 5 intentional violations correctly.

What we learned

Most ML teams are unknowingly violating licenses right now. The tooling to catch this doesn't really exist yet, especially for models and datasets. This felt like a real gap worth filling.

What's next for Aryaay Tech

  • Support for more AI license types as new models ship
  • Auto-PR generation that fixes violations directly
  • Slack/Teams notifications for new violations introduced in a merge
  • A dashboard showing license risk trends over time across all your ML projects

Built With

  • cyclonedx
  • deps.dev
  • gitlab
  • huggingface
  • python
Share this project:

Updates