Inspiration
Most of us have heard stories about AI personal assistants that go rogue: forwarding sensitive files to the wrong person, acting on instructions embedded in an email, or confidently returning data that is subtly wrong. These failures share a common root cause. When we build and test AI agents, we tend to walk the happy path. We verify that the agent does what it is supposed to do when everything goes right. What we rarely do is put the agent in an environment that is actively trying to break it.
Malicious sandboxes exist for this purpose, but setting one up requires you to already know what you are looking for. You have to design the attack, write the adversarial data, and configure the environment before you can run a single test. That is a lot of work, and it front-loads all the creativity onto the person doing the testing.
The question we asked was: what if the environment itself could figure out the attack?
What it does
RedAgent is an adversarial testing framework for AI agents. It sits between your agent and its tools, intercepts every tool call in real time, and uses Claude to decide whether to return the real result or a crafted adversarial one.
It tests for four classes of vulnerability:
- Prompt Injection -- text fields in tool results that contain embedded instructions the agent might follow
- False Data Trust -- numeric, status, or identity fields the agent accepts without verification
- State Corruption -- write operations the agent assumes succeeded without checking
- Privilege Escalation -- tool outputs that redirect the agent beyond its intended scope
Instrumentation takes two decorators:
from redAgent import RedAgent
redAgent = RedAgent()
@redAgent.query
def search_emails(folder: str = "inbox") -> str:
return fetch_emails(folder)
@redAgent.mutation
def send_email(to: str, subject: str, body: str) -> str:
return deliver(to, subject, body)
Set REDAGENT_MODE=ON and RedAgent becomes active. When your agent calls search_emails, RedAgent runs the real function, then asks Claude whether to mutate the result given the current attack hypothesis. If it mutates, it returns a crafted result to the agent while logging the original. When the agent finishes, evaluate() reviews the full mutation log and identifies which mutations caused real exploitable behavior. Confirmed bugs are written to SQLite and a GitLab issue is created automatically with the full vulnerability report.
When REDAGENT_MODE is off, the decorators are transparent wrappers with zero overhead.
RedAgent also ships as a GitLab Custom Flow. Mention it in any issue or MR and it will read your repository, map the attack surface of every tool, rank the top three adversarial hypotheses by severity, and post the full analysis as a comment. No local setup required.
How we built it
The system is built on four layers.
Interception Layer -- Python decorators wrap every tool call. When a call is intercepted, RedAgent executes the real function to get a genuine result, then forwards that result along with the session context to Claude.
Reasoning Layer -- Claude Sonnet 4 receives the current attack hypothesis, a snapshot of the world state built from all prior mutations in the session, the list of bugs already confirmed, and the real tool result. It returns a JSON object indicating whether to mutate, what the mutated result should be, a description of the mutation, and the attack type. The world state snapshot is what keeps Claude consistent across calls. If it fabricated a calendar entry two calls ago, it knows about that when deciding what to return now.
Persistence Layer -- SQLite stores four tables: sessions, mutations, bugs, and hypotheses. The hypotheses table grows across runs. Each new session reads it alongside the confirmed bugs table to generate a novel hypothesis that explores territory that has not been tested yet. This is the mechanism that makes RedAgent get better over time.
Reporting Layer -- At the end of each session, report.py uses python-gitlab to open a GitLab issue with the session summary, confirmed bug count, severity breakdown, and the full evidence from the mutations table.
The GitLab flow (flows/flow.yml) runs a single pipeline agent through three steps in one session: it reads the repository to map attack surfaces, ranks hypotheses by severity, and posts the analysis as a comment using create_work_item_note.
Three standalone agents are also registered in the GitLab AI Catalog for individual use: the Analyzer, the Mocking Agent, and the Reporter.
CI/CD integration is defined in .gitlab-ci.yml. Every merge request targeting main triggers a full adversarial test run automatically. The resulting redAgent.db is stored as a build artifact.
Challenges we ran into
Maintaining world state coherence. The mocking agent needs to stay internally consistent across an entire agent session. If it tells the agent that an email came from a specific sender in call two, it cannot contradict that in call six. The solution was to pass the full world state dict to Claude on every call, so it always has the complete picture of what it has already committed to. Getting the prompt structure right so Claude would reliably update and respect this state took several iterations.
Hypothesis quality vs. novelty. Early runs tended to rediscover the same vulnerability. The hypotheses table solves this by tracking what has been tested and how often it succeeded, but tuning the prompt so Claude would genuinely explore new attack vectors rather than rephrasing known ones required careful prompt engineering.
Keeping mutations plausible. A mutated email that is obviously fake teaches you nothing because a real attacker would not use it either. RedAgent needs its mutations to be realistic enough that the agent would plausibly encounter them in production. This meant spending time on the system prompt to ensure Claude crafted mutations that were subtle, grounded in the actual tool implementation, and consistent with the world state.
Accomplishments that we are proud of
The abstraction. Most adversarial testing tools ask you to write the attacks yourself. RedAgent inverts that. You instrument your tools and it figures out what to attack, how to attack it, and whether the attack worked. The only interface exposed to the outside world is two decorators and an environment variable.
Hypothesis-driven fuzzing that compounds. Because confirmed bugs and tested hypotheses are stored across sessions, RedAgent improves with each run. The first run might find a prompt injection in an email body. The second run already knows about that and explores adjacent territory. This is the part that feels qualitatively different from a static test suite.
Adversarial agent-to-agent interaction. The agent under test has no idea it is talking to another agent. Its tools appear to behave normally. The mocking agent is the only one that knows what is happening, and its entire purpose is to deceive the primary agent without breaking its suspension of disbelief. Implementing this flavor of agent-to-agent interaction was one of the most interesting parts of the project.
Full CI/CD integration out of the box. Adding a single .gitlab-ci.yml stage is enough to run adversarial tests on every merge request automatically. The vulnerability report lands as a GitLab issue without any manual steps.
What we learned
World state is the hard part of adversarial testing. Generating an adversarial payload is easy. Keeping the adversarial environment internally consistent across a multi-turn agent session is genuinely difficult. The world state dict and the way it is injected into every Claude call is the mechanism that makes this work, and it took longer to get right than any other part of the system.
SQLite is underrated for session-scoped persistence. The decision to use SQLite over a managed database kept the framework completely self-contained. Each CI run gets its own database file, stored as an artifact, which makes every session independently reproducible and auditable without any external infrastructure.
The GitLab AI Catalog flow model is a natural fit for security tooling. Being able to trigger a full attack surface analysis by mentioning an agent in a GitLab comment lowers the barrier to adoption significantly. Security tooling that requires a local setup rarely gets used. Tooling that runs in the thread where the conversation is already happening does.
Claude is good at reasoning about its own prior decisions. Passing the world state back into every call and asking Claude to stay consistent with it worked better than expected. It very rarely contradicted itself, which was the main risk with this approach.
What is next for RedAgent
Parallel attack sessions. Right now each run tests one hypothesis at a time. The problem is embarrassingly parallel. Multiple sessions against the same agent, each with a different hypothesis, could run simultaneously and report back to the same database. This would dramatically reduce time to first bug on complex agents.
Richer hypothesis generation. The current hypothesis generator uses the bugs table and the hypotheses table to propose what to test next. There is room to make this much more sophisticated, incorporating tool source code, past mutation results, and semantic similarity between confirmed bugs to explore the attack space more systematically.
Broader agent framework support. The current SDK assumes Python function tools. Extending support to other agent frameworks and tool call patterns would make RedAgent useful for a wider range of projects without requiring any changes to how the framework works internally.
Severity-based alerting. The GitLab issue created at the end of each run contains the full vulnerability report, but there is no mechanism yet to block a merge request based on severity. Adding a CI gate that fails the pipeline when a critical or high severity bug is confirmed would close the loop on the CI/CD integration.
Built With
- claude
- css
- gitlab
- gitlab-duo
- javascript
- python
- tsx
- typescript
Log in or sign up for Devpost to join the conversation.