Inspiration

Most AI agents are brilliant amnesiacs. They can be sharp inside a single session and then reset to zero at the next one: nothing they did today makes them fundamentally better tomorrow. The systems we use don't compound; they wait for the next foundation model to be trained by someone else.

We wanted to take a real first step toward the opposite: intelligence that compounds from experience - an agent whose work today measurably improves the model it runs on tomorrow. The cheapest, most concrete version of that idea is also the most testable: take a small, open-source model and optimize it on curated experience until it does a genuinely hard job better than it could before. Not by buying a bigger model - by squeezing more capability out of an open one. That's the first turn of a loop that, repeated, looks a lot like a system that learns.

What it does

We built a toolkit and a repeatable process that optimizes an open-source model from experience. It:

Runs capable agents - frontier models and local coding agents (Codex, Claude Code, Devin) — against a hard, real-world agentic benchmark.
Scores every attempt with a fast, additive signal (partial credit), so even a brutal benchmark yields dense, usable feedback.
Curates the trajectories that actually worked into a training set.
Fine-tunes a small open model - Qwen3.5-9B - on that curated experience.
Re-measures through the exact same evaluator.

One turn of the loop moved the base model from 0.17 → 0.34 average partial credit (≈ 2×) and from 0 → 7 strict task completions on Zapier's AutomationBench — a benchmark where even the best frontier model in the world clears barely 1 in 7. We optimized an open model, from experience, and it got measurably better.

How we built it

The benchmark grades a simulated SaaS world (CRM, email, sheets, calendars) on whether the agent's actions leave the world in the right final state. We reused its world + tools + rubric directly, and built the loop around it.

The assessment. The grader is local and deterministic. We lean on the additive form of it — the fraction of per-task assertions that pass:

$$\text{partial}(M) \;=\; \frac{1}{|A|}\sum_{a \in A} \mathbb{1}!\left[a \text{ holds on the final world state}\right], \qquad \text{strict}(M) \;=\; \mathbb{1}!\left[\text{partial}(M) = 1\right]$$

Strict is the headline metric and it's near-zero for everyone; $\text{partial}$ is dense, cheap, and what makes the data trainable.

The harness. A local stdio MCP server exposes the benchmark's tools to any agent that speaks MCP, so Codex / Claude / Devin solve tasks with their own native multi-step loops — we just grade the final world state and capture the trace.

The loop. With $M_0$ the base model and a curation step that keeps high-signal trajectories,

$$M_{t+1} \;=\; \text{SFT}!\big(M_t,\; \text{curate}({\text{traces}})\big)$$

The stack. Trace collection across a 12-model leaderboard (~28.8k trajectories) → normalize to one tool-call format → curate by $\text{partial}\ge 0.5$ → SFT Qwen3.5-9B with prime-rl on 8×H100 (64k context, FlashAttention-3, fused cross-entropy) → serve the checkpoint with vLLM → re-evaluate through the same rubric.

Challenges we ran into

Experience doesn't scale as markdown files. The obvious way to make an agent "learn" is to write everything down - journals, notes, memory files - and feed them back into context. We've lived how that breaks: context windows are bounded, notes go stale and compete for the model's attention, and re-reading your own logs is not the same as getting better at the task. Text-as-memory grows linearly and pays off sublinearly — experience only truly compounds when it's folded into the weights. That realization is the whole reason this project distills experience into the model instead of into more files.
The benchmark is brutal. Frontier SOTA is under ~10% strict. A sparse all-or-nothing score is untrainable, so we had to make the additive signal the spine of both data selection and measurement.
The harness mattered as much as the model. Our first approach forced agents into a constrained one-action-at-a-time policy and they collapsed (~0.16 partial). Letting them run their native loops through the MCP bridge roughly quadrupled the signal (~0.62) on the same tasks — a result that reshaped the whole design.
Training an open 9B at long context is finicky. Context-parallel runs produced NaN gradients; a 131k-context run OOM'd in the gated-delta-rule backward; we patched fused-CE zero-token shards and sorted out tool-call/reasoning parser + processor handling to serve the checkpoint at all. The stable path was 64k, single-GPU context parallelism.
Eval integrity. The answer key (assertions) is never exposed to the agent - it's applied only at grading time - so improvements reflect capability, not leakage.

Accomplishments we're proud of

An end-to-end, repeatable self-optimization loop that actually moved an open model (~2×) on a frontier-hard agentic task.
A native, agent-agnostic harness — Codex, Claude Code, and Devin all plug in through MCP — and the empirical finding that native loops dramatically beat constrained policies.
Honest measurement. We report the dense additive signal alongside the frontier strict numbers for context, and we're explicit about what we validated versus what comes next.

What we learned

How you run an agent can matter as much as which model you train. The harness is a force multiplier.
A dense, additive assessment turns a benchmark where almost everyone scores zero into a usable training and selection signal.
A small open model can be moved a lot, fast, on curated experience — the first turn of a compounding loop, on commodity-ish open weights.

What's next

Generalization. Our current result is loop-validating on a narrow slice (the SFT data is drawn from the same benchmark's traces). The next step is held-out / stratified eval splits before any generalization claim.
Predict before you pay. A contrastive cohort ablation — train on a high-signal cohort vs. a low/random one and show the high-signal cohort moves the model more — would demonstrate that the cheap additive signal predicts which data is worth a full training run.
More turns. Each iteration's improved model collects better traces, which train a better model. That's compounding intelligence — and we've shown the first turn works.

Built with

Languages: Python.

ML / training: prime-rl (SFT), PyTorch, FlashAttention-3, fused cross-entropy (Liger), torchrun, context parallelism, Hugging Face datasets.

Serving / inference: vLLM (OpenAI-compatible endpoints).

Open model trained: Qwen3.5-9B.

Frontier models (trace generation): Claude Opus/Sonnet 4.6, GPT-5.4 family, Gemini 3.1 Pro / 3 Flash, GLM-5.1, and others via the public leaderboard.

Agents: Codex CLI, Claude Code, Devin CLI — integrated through MCP (Model Context Protocol) via a custom local stdio tool-bridge.

Benchmark / eval: Zapier AutomationBench; PrimeIntellect verifiers + the additive rubric (partial_credit / task_completed_correctly).

Infra / ops: 8×H100 cluster, Slurm, Weights & Biases, tmux.