Preface

There is a shorter version of this writeup in about 400 words at the bottom. This main writeup serves as a reflection and technical deep dive into how the system was developed. But the requirements listed a ~400 word writeup, so I included it as well. Also you can test out the system here: https://gauntlet-mu.vercel.app/

Inspiration

Most of us have heard of OpenClaw, the personal AI assistant that has been going viral recently. There's a chance, you've also heard of the security hazards that come with giving general access to agents like OpenClaw. Sometimes they forget what they're not supposed to do, or they were not aware in the first place. That happens because when we test these AI models, we usually try the happy path, the path where it works as intended or close to it at least. What we don't usually do is coming up with creative ways to break it.

Gauntlet catching a prompt injection attack, the agent fell for a prompt injection attack in email to retrieve API keys from Notion

Now, imagine putting your agent in a sandbox where the environment is actively trying to break your agent. A malicious sandbox essentially. It gives your agent a list of emails containing a prompt injection to see how it would react, or feeds it false info from the Internet to see if it would believe it. These sandboxes exist but they're quite difficult to set up.

What it does

So, I had this idea: what if instead of trying to create a sandbox, we mimic the existence of one using another agent (let's call this agent the mocking agent). When your primary agent makes tool calls, the mocking agent (it's built using Agent Builder) would intercept them to find creative ways to break your agent. This takes the sandbox idea a step further and makes it creative. The goal of the mocking agent is to be adversarial and creative all while not letting your primary agent know that it is in a simulated environment.

Interception flow

Two problems emerge: (1) is the fidelity of the mocking agent, it needs to maintain a coherent model of the world throughout the conversation and (2) is the creativity of the mocking agent, it needs to find novel bugs that have not been found all while being grounded on the implementation of the tools the agent could use. Both of these are inherently search problems, (1) searches for relevant memories of the world, and (2) searches for something in the Goldilock's zone between exploration and exploitation.

And that is exactly why Elasticsearch becomes important here. It lets us search! Believe it or not, we can build the entirety of the mocking agent within Elasticsearch Agent Builder, it's a self-contained circuit that manages data on its own without external scripts.

The only interface that is exposed to the world is the interception interface, and it's simple, just decorate your functions and the mocking agent will use its abilities (to be explained later) to precisely mock the result for you. Here's an example:

from gauntlet import Gauntlet

gauntlet = Gauntlet()
gauntlet.init()

@function_tool
@gauntlet.query
def search_emails(folder: str = "inbox") -> str:
    """Search emails in the given folder."""
    return json.dumps(fetch_emails(folder))

@function_tool
@gauntlet.mutation
def send_email(to: str, subject: str, body: str) -> str:
    """Send an email to the specified recipient."""
    return json.dumps({"to": to, "subject": subject, "body": body, "status": "sent"})

with gauntlet.session() as session:
    session.hypothesis = gauntlet.hypothesize()
    task = gauntlet.get_input()
    result = await Runner.run(agent, task)
    gauntlet.evaluate(result.final_output)

That's it. @gauntlet.query and @gauntlet.mutation are the only decorators you need. When your agent calls search_emails, the mocking agent intercepts the result and decides whether to mutate it, maybe injecting a prompt injection into an email body, or returning subtly wrong data. After the run, evaluate() reviews what happened and stores any confirmed bugs.

Once the bug is recorded, you can view it on your Dashboard on Kibana, which makes it easy to know what your agent is most susceptible to. This is useful, especially in sensitive sectors like finance and healthcare, we simply can't afford to have an AI agent send hundreds of healthcare records to their email.

Kibana dashboard showing bug metrics, severity breakdown, and discovery trends

How we built it

Conceptually, it sounds heavy: you kinda need to mimic the whole world, but with the right abstractions and built in abilities within Elasticsearch, this became much easier.

The mocking agent consists of two circuits:

  1. The Short-term Memory circuit: it keeps a recollection of everything that has happened within the current session, so that it can produce a coherent model of the world. For example, if you sent an email, and later retrieve all sent emails, that email would be there.
  2. The Long-term Memory circuit: this is sophisticated because this is where the creativity of the agent comes from. The index contains all the bugs that have been found so far, all the tool implementations as well as real-life samples of what results look like.

Piece them together and it looks like this:

Architecture

As you can tell from that diagram we use the Elasticsearch Agent Builder, along with 4 tools. Before I dive deeper into the memory structure, I'll pay tribute to the features that made this possible in the first place:

  • Elasticsearch Agent Builder: The entire mocking agent lives inside Agent Builder. It has its own instructions, tool bindings, and multi-turn conversation state. When Gauntlet intercepts a tool call, it sends the context to the mocking agent via the Converse API, and the agent autonomously decides whether and how to mutate the result. No external orchestration needed.

  • ES|QL: Every tool the mocking agent uses is an ES|QL query. We use FROM, WHERE, SORT, KEEP, and LIMIT for retrieving mutations and past results. We use SAMPLE to randomly select bugs, EVAL and CONCAT to build structured strings, STATS and MV_CONCAT to aggregate them, and COMPLETION to call an LLM inline — all within a single query. The generate-hypothesis tool is a single ES|QL statement that samples bugs, summarizes them, and generates a novel hypothesis.

  • Kibana Workflows: The store-bug tool is a Kibana workflow rather than a direct API call. It takes in bug metadata as inputs and uses an elasticsearch.index step to write the document into gauntlet-ltm-bugs. This keeps the bug storage logic declarative and self-contained within Kibana.

  • ES|QL COMPLETION: This is probably the most powerful feature we use. The generate-hypothesis tool calls COMPLETION inline to generate a novel hypothesis — the LLM reasons about what bugs exist and proposes new ones to explore, all within the ES|QL pipeline.

  • Inference Endpoints: We register two inference endpoints — a GPT-5-mini completion endpoint (used by ES|QL COMPLETION) and a text-embedding-3-small endpoint (used to embed bug descriptions for similarity search). These are configured once during gauntlet.init() and reused across all runs.

  • Kibana Dashboard: We programmatically create a full Kibana dashboard with Lens visualizations — bug count metrics, severity breakdowns (pie chart), bugs over time (bar chart), a bug pattern heatmap, and a detailed data table. All imported via the Saved Objects API so it's ready to use immediately after gauntlet.init().

Short-term Memory

Short-term Memory

The short-term memory is a single index (gauntlet-stm) that tracks everything the mocking agent has done during the current run, every tool call it intercepted, what the original result was, and what it mutated it to. Two tools interact with it:

  • find-relevant-mutations: An ES|QL query that retrieves all mutations from the current run, sorted chronologically. The mocking agent calls this before deciding on a new mutation so it stays consistent with everything it's already told the primary agent. If it said an email was from Alice earlier, it won't contradict that later.

  • find-relevant-queries: Pulls past tool call results from previous runs for a given tool. This gives the mocking agent a sense of what realistic responses look like, so its mutations stay plausible rather than obviously fake.

Long-term Memory

Long-term Memory

Long-term memory spans three indices. gauntlet-ltm-bugs stores every confirmed bug with its description, pattern, severity, and a dense vector embedding. gauntlet-ltm-func stores the source code and docstrings of every tool the primary agent uses. gauntlet-ltm-queries archives real tool call results across all runs. Three tools interact with them:

  • get-tool-implementations: Reads tool source code from gauntlet-ltm-func so the mocking agent understands what each tool does and can reason about failure modes, even when no bugs have been found yet.

  • generate-hypothesis: The most interesting tool. It samples random bugs from gauntlet-ltm-bugs, aggregates them with MV_CONCAT, builds a prompt, and calls COMPLETION inline to propose a novel hypothesis — all in one ES|QL query. This is how the mocking agent stays creative across runs.

  • store-bug: A Kibana workflow that writes a confirmed bug into gauntlet-ltm-bugs. The mocking agent calls this during evaluate() when it determines that a mutation actually caused the primary agent to fail. The workflow takes in all the bug metadata and indexes it directly via an elasticsearch.index step.

How it all works together

Hypothesize-Prove-Store cycle

The mocking agent can hypothesise the existence of bugs, prove its existence by engineering circumstances where it takes place, and then adds it to its own inventory. That's a closed circuit. Everything is within the Agent Builder.

Time to first bug

The core value proposition of Gauntlet is speed and creativity. Setting up a malicious sandbox means designing the attack yourself, you need to know what you're looking for, build the environment, seed it with adversarial data, and then run your agent through it. Gauntlet skips all of that. You decorate your functions and it figures out the attack on its own.

We tested this across 6 attack scenarios on a personal assistant agent with access to email, calendar, Notion, and internet search:

Scenario Manual sandbox setup (estimated) Gauntlet (measured)
Prompt injection via email body ~45 min ~3 min
Data exfiltration through tool chaining ~90 min ~8 min
Calendar event spoofing with fake data ~30 min ~2 min
Notion page content poisoning ~40 min ~4 min
Search result manipulation leading to wrong actions ~60 min ~5 min
Cross-tool state corruption (email → calendar) ~120 min ~10 min

Manual setup time includes: understanding the attack vector, writing the adversarial data, configuring the sandbox to serve it, and running the test. Gauntlet time is from gauntlet.init() to the first confirmed bug in gauntlet-ltm-bugs.

The gap widens over time. After the first few runs, Gauntlet's long-term memory kicks in, it knows what worked before and explores adjacent hypotheses. The manual approach doesn't compound like that.

Challenges we ran into

  1. Knowing the boundaries of where to intercept. It made more sense to have tight coupling with the agent frameworks to simulate the environment. After all, the agent is in that environment. But this wouldn't scale for other agent frameworks. So, I decided to intercepts at the tool call boundary instead, since that is the eyes and the limbs of the agent into the environment.

  2. This project was a last minute pivot. I realised my previous idea didn't really hold up because of some assumptions I made. You can read my breakdown at: github.com/kavishsathia/rehearse. The idea was to make the agent rehearse in a sandbox that is mocked by another agent before it executes anything. The problem that's so extremely obvious in hindsight is that the environment itself might change in between rehearsal and execution. So I came up with Gauntlet instead, which makes more sense, because the stochasticity of the environment doesn't matter.

  3. Before even pivoting, the biggest challenge was coming up with a novel idea. Usually if you think about search, first thing that comes to mind is storing the unit of work in your domain in the index, and then let the agent search over it. By unit of work, I mean contracts for lawyers, health records for doctors and incident records for software engineers. These are obvious because they follow the pattern I described. They are definitely impactful, but I would like my contribution to be partially in terms of breaking that pattern and showing that search is not just about searching documents.

So how did I come up with this idea? I first noted down patterns I want in my idea:

  • I want it to be an automatic data flywheel, it creates its own contents. I don't need to feed it documents. The complexity is emergent.
  • I want it to be context-specific and hence, dysfunctional without the searching layer. That is to say, just like humans can't function without memory, I want this to not be able to function without searching. That makes searching a vital part of the system. The agent simply cannot guess what to mock without any searching, because it is so incredibly context-specific.
  • I want the idea to play into the zeitgeist. Not some idea that should've existed last year, or can exist next year, but needs to exist now.

A lot of thought went into it, and I came up with this idea.

  1. Since I mentioned 2 challenges, I'll also mention what was not a challenge. Using Elasticsearch! When I realised I could simply build everything within Elasticsearch and there's no need to pass data in and out, I was overjoyed. That's the main thing that makes this system so good.

Accomplishments that we're proud of

  1. Coming up with this abstraction. Usually, I would've stopped at the "malicious sandbox" layer. Maybe I'll realise that's hard and I would use an agent to design a malicious sandbox each time before the primary agent starts running. But this time, I took it a step further, and realised that I could make a mocking agent that is winging it on the fly. A truly creative hacker. Then I took it another step further by making the agent hypothesise on its own on what could be a potential attack vector.

  2. Pivoting in 2 days, but that's only possible because of the simplicity of Elasticsearch and the ease of the abstraction implementation. The complexity is all emergent within Elasticsearch. Which is the elegant part.

  3. Implementing a new agent-to-agent (A2A) strategy. One agent doesn't even know it's talking to another agent. And another agent's sole purpose is to trick the other agent. It was extremely fun implementing this flavour of A2A, let's call it adversarial A2A.

What we learned

  1. ES|QL is far more expressive than I expected. I went in thinking I'd need to shuttle data between Elasticsearch and an external script, but between COMPLETION, STATS, MV_CONCAT, and SAMPLE, I could build entire reasoning pipelines as single queries. The generate-hypothesis tool is a good example, it samples, aggregates, prompts an LLM, and returns a result, all without leaving the ES|QL pipeline.

  2. Agent Builder's multi-turn conversation state is what makes this whole thing work. The mocking agent needs to remember what it's already told the primary agent within a single run, and the Converse API handles that natively. I didn't need to build any conversation management logic, just keep calling converse and it stays coherent.

  3. The hardest part of adversarial testing isn't the attack, it's maintaining plausibility. A mutated email that's obviously fake teaches you nothing. The short-term memory circuit exists entirely to solve this: by giving the mocking agent access to everything it's already committed, it can stay internally consistent while still being adversarial.

What's next for Gauntlet

  1. Well, first of all, Gauntlet needs to live up to its name. Right now it's like 1v1, we need the primary agent to actually run the gauntlet. Maybe 20 attacks all at once. This problem is embarrassingly parallel. You can just run it on another session and it'll just work. So, I'll be thinking about ways to scale this.

  2. The long term memory is core to Gauntlet. Without it, we won't even know what hypotheses to test. There could honestly be some innovation in making the agent balance between exploration and exploitation that could go beyond this hackathon, this project and even agents entirely.

  3. Thinking of other ways to apply Gauntlet, you might have already sensed it but Gauntlet is a special case of Rehearse. I arrived at the idea for Rehearse first and realised its assumptions didn't apply for most scenarios, but it did apply for fuzz testing AI agents. So, I almost immediately jumped into this. There could be other fields where the lack of stochasticity applies, and I want to explore those fields.

In About 400 Words

Problem Solved

As AI agents gain access to real-world tools like email, calendars, databases, and web search, they become vulnerable to attacks like prompt injection, data exfiltration, and content poisoning. Testing for these vulnerabilities typically requires manually building malicious sandbox environments: designing attack vectors, seeding adversarial data, and configuring test infrastructure. This is slow, doesn't scale, and only finds bugs you already thought of. Gauntlet solves this by replacing the manual sandbox with an autonomous adversarial agent, a "mocking agent," that intercepts your agent's tool calls in real time and creatively manipulates results to discover security flaws you didn't anticipate. Developers simply decorate their tool functions with @gauntlet.query or @gauntlet.mutation, and Gauntlet automatically finds bugs, often within minutes.

Features Used

The entire mocking agent is built within Elasticsearch Agent Builder, which manages its instructions, tool bindings, and multi-turn conversation state via the Converse API with no external orchestration required. All tool logic runs as ES|QL queries, leveraging SAMPLE, STATS, MV_CONCAT, and crucially ES|QL COMPLETION, which calls an LLM inline to generate novel attack hypotheses within a single query pipeline. Bug storage uses Kibana Workflows, keeping the write logic declarative and self-contained. Two Inference Endpoints (GPT-5-mini for completion, text-embedding-3-small for embeddings) power hypothesis generation and similarity search across stored bugs. Finally, a programmatically created Kibana Dashboard with Lens visualizations provides real-time visibility into bug counts, severity breakdowns, discovery trends, and attack pattern heatmaps.

The system operates on a closed hypothesize-prove-store cycle across two memory circuits: short-term memory ensures consistency within a session, while long-term memory stores confirmed bugs, tool implementations, and historical results to drive increasingly creative attacks over time.

Features Liked and Challenges

The standout feature was ES|QL COMPLETION. The ability to sample data, aggregate it, prompt an LLM, and return results all within a single query eliminated the need to shuttle data between Elasticsearch and external scripts. The generate-hypothesis tool is a single ES|QL statement that does all of this, which felt remarkably elegant.

Agent Builder's multi-turn state management was equally impressive. The mocking agent needs to remember everything it has already told the primary agent within a run, and the Converse API handled this natively without any custom conversation management logic.

The biggest challenge was maintaining plausibility. An obviously fake mutation teaches nothing. The short-term memory circuit exists entirely to solve this, giving the mocking agent access to all its prior commitments so it stays internally consistent while remaining adversarial. Balancing creativity with coherence was the hardest design problem throughout the project.

Built With

Share this project:

Updates