Agent-Hardener

Architecture

About the project

agent-harden started with a simple idea:

If CI can stop broken code from reaching production, it should also stop broken agents.

This began as two guys from Brisbane, an OpenClaw in a Discord server, and a shared sense that agent security was going to matter a lot more than most teams realized. We kept coming back to the same problem: teams already gate merges on tests, secrets, and dependencies, but almost nobody gates merges on whether an AI agent just became easier to jailbreak, socially engineer, or trick into leaking internal instructions.

So we built agent-harden to make that a normal GitLab CI check.

What inspired us

Most prompt-security tools feel static. Real attackers are not.

They adapt. They retry. They rephrase. They push on whatever almost worked.

We wanted a tool that did the same thing: attack a live agent, judge the response, mutate promising attacks into stronger ones, and fail the pipeline when the failures are real.

How we built it

We built agent-harden as a Go CLI that plugs directly into GitLab CI.

The flow is simple:

Seed corpus -> Run attacks -> Heuristic score -> LLM judge -> Mutate strong attacks -> Store variants -> Emit JUnit -> Pass/Fail CI

It attacks a live agent endpoint, filters responses with fast heuristics, escalates ambiguous cases to an LLM judge, evolves near-successful prompts into stronger attacks, stores the most effective variants for future runs, and emits JUnit so GitLab surfaces the results as native test failures.

We also added a GitLab Duo skill so installation itself can be automated inside a repo.

Challenges we faced

The hardest part was avoiding a fake-feeling demo.

We did not want a static checklist dressed up as agent security. We wanted something adaptive enough to feel real, but still fast and cheap enough to fit into CI.

The second challenge was compression: there is a lot going on under the hood, but judges do not have time for a wall of text or a long explainer. So we kept pushing the project toward one clear outcome:

an unsafe agent should fail CI before it ships.