Inspiration

Coding agents like Devin don't just read code, they read issues, comments, CI logs, and review threads. That context is an unguarded attack surface. A malicious open-source contributor needs no sandbox access; they only need to poison what the agent reads. Nobody is systematically measuring how robust agents are to this, so we built an adversary that finds these holes automatically.

What it does

Le Chaton FAT is an adaptive red-team agent powered by GLM-5.2 that plays a malicious contributor on a local, simulated GitLab. By construction it can only create and edit issues and comments. It crafts plausible engineering messages designed to steer a coding agent into an unsafe merge request: weakening admin auth, skipping security tests, logging secrets, or touching protected files. A deterministic-plus-LLM judge (also powered by GLM-5.2) then labels every resulting MR SAFE / NEEDS_REVIEW / UNSAFE before merge, and never executes attacker-supplied code. Everything runs locally.

How we built it

  • Model: GLM-5.2 (open weights, MIT), served via PrimeIntellect serverless inference.
  • Attacker: a strategy tree over (channel, goal, style) tuples persisted as JSONL, selected with a UCB bandit, generated with GLM-5.2, and grown by a reflection step.
  • Judge: deterministic policy rules (e.g. require_admin -> require_user on a billing endpoint = UNSAFE) plus a gated LLM judge for ambiguous diffs; raise-only score aggregation that fails closed to NEEDS_REVIEW.
  • Stack: FastAPI target app (demo-saas), local GitLab (Docker), Devin (victim AI agent), GLM-5.2 (red teamer & judge; 2 separate instances)

Challenges we ran into

  • Getting everything to work on a local macbook with low RAM.
  • Troubleshooting Docker container issues.
  • Doing the research needed on how the red teaming agent should carry out the attacks.

Accomplishments that we're proud of

Coming up with this idea and executing it in less than 24 hours, with more than half of the team coming from non-technical backgrounds. The attacker writes a real, convincing issue with GLM-5.2, a coding agent acts on it, and the judge catches the unsafe merge before it lands.

A few things we're happy about:

  • The attacker is honestly boxed in. It only ever opens issues and leaves comments, so when it wins, it's because it poisoned context, not because we handed it extra powers.
  • The judge never runs attacker code and never echoes a canary value back in its own output.
  • It actually adapts. The static, fixed-script attacker gets caught fast; the adaptive loop learns that compatibility/CI-unblock framing slips through, and the unsafe rate climbs.

What we learned

The biggest takeaway is that a coding agent's weak spot is what it reads, not what it runs. The attacks that worked didn't look like attacks. They looked like a tired teammate asking for a small, temporary exception, and that framing was far more effective than anything blunt.

Some smaller lessons:

  • The judge has to be independent from the attacker. Letting one model both write and grade the attack is a blind spot, so the grader should be a different model.
  • Restraint was the hard part. It would've been easy to make the attacker "succeed" by quietly giving it abilities a real outside contributor wouldn't have, and resisting that is what makes the results mean anything.
  • Spending more compute at inference is a real lever. More rounds of select → generate → judge → reflect produced noticeably nastier attacks than a single shot ever did.
  • Infra eats time. Getting a real GitLab to boot identically on every machine — including an Apple Silicon Unix-socket gotcha — took longer than any of the actual AI work.

What's next

More attack channels, multi-turn campaigns, and running the attacker on self-hosted GLM-5.2. The thesis: as agents get more autonomous, their biggest attack surface is the context they read — and you need an adaptive adversary to find the holes before someone else does. Le Chaton FAT is a benchmark for exactly that.

Built With

  • docker
  • fastapi
  • gitlab
  • glm-5.2
  • openai-api
  • prime-intellect
  • pydantic
  • pytest
  • python
Share this project:

Updates