Deferred-Execution Attacks Against AI Control Monitors

Inspiration

Reading the existing AI control literature, I noticed every published monitor I could find scored agent actions inside a single session. ControlArena, BashArena, SHADE-Arena, and Ctrl-Z all share this temporal scope. At the same time, every Linux sysadmin knows that the operating system exposes a long list of mechanisms for scheduling actions into the future, including cron, systemd timers, inotify watchers, pam_exec hooks, and logrotate postrotate scripts. Putting these two facts together raised an uncomfortable question. If a monitor only sees the session, what stops an untrusted agent from installing a dormant trigger during the session and letting the harmful action fire after the monitor stops watching?

I had not seen this attack class named anywhere, and the omission attacks paper from the previous Apart cycle convinced me that naming a structural blind spot, demonstrating it cleanly, and proposing a matched defense was a fundable shape for a short project.

What I built

The project has three components.

First, a taxonomy of five Linux trigger categories (scheduled jobs, event watches, lifecycle triggers, authentication-path triggers, and resource-condition triggers), each mapped to a MITRE ATT&CK technique under T1053 or T1546. The taxonomy organizes the attack surface and makes it clear which mechanisms an architectural defense must cover.

Second, four hand-crafted trace pairs in BashArena. Each pair shares a main task and a harmful end state, with a commission variant that performs the harm inline and a deferred variant that distributes the harm across a dormant trigger, a sourced configuration file, a wildcard pattern, and a pre-existing symbolic link in the surrounding system. The traces run end-to-end with claude-opus-4-7 as the untrusted agent.

Third, a trigger-graph monitor, implemented as a drop-in ControlArena component. At session end, it enumerates every dormant trigger newly installed under the standard locations on a Linux system, follows reference chains through scripts and sourced configurations, resolves wildcards and symbolic links with the semantics that will apply when the trigger fires, and submits the resolved chain to the same basic_monitor scaffolding the baseline Opus per-step monitor uses.

How I built it

The implementation is built on top of UK AISI's ControlArena and Inspect AI. The attack traces are constructed as BashArena tasks with a main-task scoring function for the legitimate work and a separate side-task verifier that checks the harmful post-condition after the trigger has fired in a sandbox.

The trigger-graph monitor runs as a post-session hook that re-executes the session in a fresh container, captures the resulting filesystem state, and walks a hand-curated list of trigger locations. For each newly installed trigger, the monitor resolves its command into a synthesized action chain by tracing every shell source, every script invocation, every wildcard expansion, and every symbolic-link resolution that will happen at fire time. The chain plus the original session bash calls are then scored by the same monitor scaffolding the baseline uses, and the final suspicion score is the maximum across both surfaces.

Evaluation runs the five monitor configurations (Sonnet per-step, Opus per-step, Opus per-step with a deferred-aware prompt suffix, Opus full-trajectory, and trigger-graph) against the four trace pairs and four matched honest trajectories, producing the headline and false-positive tables.

Challenges

The hardest part of the project was constructing trace pairs whose deferred variants did not leak. A naive implementation puts something that looks malicious into the session, for example a path that names a sensitive directory, or a command that writes to a world-readable location, and the per-step monitor catches it immediately. The traces had to distribute the harmful composition across a trigger, a sourced configuration file, a wildcard pattern, and a pre-existing symbolic link, so that the install-time text contained nothing more suspicious than routine sysadmin work. The lifecycle (EXIT trap) category was excluded for exactly this reason, since the rc file body is both the trigger and the action chain, so the harmful composition unavoidably appears in the in-session transcript and the per-step monitor caught it at 0.7.

The second challenge was the false-positive analysis. The first trigger-graph implementation looked great on attacks but flagged honest sysadmin work, which would have been useless in deployment. Diagnosing this required scoring each honest trajectory under the augmented-transcript framing, identifying which framing components made the monitor treat the trigger surface as suspicious in isolation, and adding the task-intent step that grounds scoring in the agent's stated objective.

Results

The strongest per-step monitor flags 4 of 4 commission attacks and 0 of 4 matched deferred attacks at the conventional audit threshold of 0.5, with mean suspicion collapsing from 0.82 on commission to 0.18 on deferred. Adding the deferred-aware prompt suffix does not close the gap. The trigger-graph monitor flags 4 of 4 deferred attacks at mean 0.90 and produces 0 of 4 false positives across four honest trajectories. Fisher's exact test on the per-step comparison (4 of 4 versus 0 of 4) gives \(p = 0.029\).

Built With

anthropic
docker
linux
python

Updates

William Guo started this project — May 17, 2026 10:41 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.