Inspiration
Most AI agents are brilliant amnesiacs. They can be sharp inside a single session and then reset to zero at the next one: nothing they did today makes them fundamentally better tomorrow. The systems we use don't compound; they wait for the next foundation model to be trained by someone else.
We wanted to take a real first step toward the opposite: intelligence that compounds from experience - an agent whose work today measurably improves the model it runs on tomorrow. The cheapest, most concrete version of that idea is also the most testable: take a small, open-source model and optimize it on curated experience until it does a genuinely hard job better than it could before. Not by buying a bigger model - by squeezing more capability out of an open one. That's the first turn of a loop that, repeated, looks a lot like a system that learns.
What it does
We built a toolkit and a repeatable process that optimizes an open-source model from experience. It:
- Runs capable agents - frontier models and local coding agents (Codex, Claude Code, Devin) — against a hard, real-world agentic benchmark.
- Scores every attempt with a fast, additive signal (partial credit), so even a brutal benchmark yields dense, usable feedback.
- Curates the trajectories that actually worked into a training set.
- Fine-tunes a small open model - Qwen3.5-9B - on that curated experience.
- Re-measures through the exact same evaluator.
One turn of the loop moved the base model from 0.17 → 0.34 average partial credit (≈ 2×) and from 0 → 7 strict task completions on Zapier's AutomationBench — a benchmark where even the best frontier model in the world clears barely 1 in 7. We optimized an open model, from experience, and it got measurably better.
How we built it
The benchmark grades a simulated SaaS world (CRM, email, sheets, calendars) on whether the agent's actions leave the world in the right final state. We reused its world + tools + rubric directly, and built the loop around it.
The assessment. The grader is local and deterministic. We lean on the additive form of it — the fraction of per-task assertions that pass:
$$\text{partial}(M) \;=\; \frac{1}{|A|}\sum_{a \in A} \mathbb{1}!\left[a \text{ holds on the final world state}\right], \qquad \text{strict}(M) \;=\; \mathbb{1}!\left[\text{partial}(M) = 1\right]$$
Strict is the headline metric and it's near-zero for everyone; \(\text{partial}\) is dense, cheap, and what makes the data trainable.
The harness. A local stdio MCP server exposes the benchmark's tools to any agent that speaks MCP, so Codex / Claude / Devin solve tasks with their own native multi-step loops — we just grade the final world state and capture the trace.
The loop. With \(M_0\) the base model and a curation step that keeps high-signal trajectories,
$$M_{t+1} \;=\; \text{SFT}!\big(M_t,\; \text{curate}({\text{traces}})\big)$$
The stack. Trace collection across a 12-model leaderboard (~28.8k trajectories) → normalize to one tool-call format → curate by \(\text{partial}\ge 0.5\) → SFT Qwen3.5-9B with prime-rl on 8×H100 (64k context, FlashAttention-3, fused cross-entropy) → serve the checkpoint with vLLM → re-evaluate through the same rubric.
Challenges we ran into
- Experience doesn't scale as markdown files. The obvious way to make an agent "learn" is to write everything down - journals, notes, memory files - and feed them back into context. We've lived how that breaks: context windows are bounded, notes go stale and compete for the model's attention, and re-reading your own logs is not the same as getting better at the task. Text-as-memory grows linearly and pays off sublinearly — experience only truly compounds when it's folded into the weights. That realization is the whole reason this project distills experience into the model instead of into more files.
- The benchmark is brutal. Frontier SOTA is under ~10% strict. A sparse all-or-nothing score is untrainable, so we had to make the additive signal the spine of both data selection and measurement.
- The harness mattered as much as the model. Our first approach forced agents into a constrained one-action-at-a-time policy and they collapsed (~0.16 partial). Letting them run their native loops through the MCP bridge roughly quadrupled the signal (~0.62) on the same tasks — a result that reshaped the whole design.
- Training an open 9B at long context is finicky. Context-parallel runs produced NaN gradients; a 131k-context run OOM'd in the gated-delta-rule backward; we patched fused-CE zero-token shards and sorted out tool-call/reasoning parser + processor handling to serve the checkpoint at all. The stable path was 64k, single-GPU context parallelism.
- Eval integrity. The answer key (assertions) is never exposed to the agent - it's applied only at grading time - so improvements reflect capability, not leakage.
Accomplishments we're proud of
- An end-to-end, repeatable self-optimization loop that actually moved an open model (~2×) on a frontier-hard agentic task.
- A native, agent-agnostic harness — Codex, Claude Code, and Devin all plug in through MCP — and the empirical finding that native loops dramatically beat constrained policies.
- Honest measurement. We report the dense additive signal alongside the frontier strict numbers for context, and we're explicit about what we validated versus what comes next.
What we learned
- How you run an agent can matter as much as which model you train. The harness is a force multiplier.
- A dense, additive assessment turns a benchmark where almost everyone scores zero into a usable training and selection signal.
- A small open model can be moved a lot, fast, on curated experience — the first turn of a compounding loop, on commodity-ish open weights.
What's next
- Generalization. Our current result is loop-validating on a narrow slice (the SFT data is drawn from the same benchmark's traces). The next step is held-out / stratified eval splits before any generalization claim.
- Predict before you pay. A contrastive cohort ablation — train on a high-signal cohort vs. a low/random one and show the high-signal cohort moves the model more — would demonstrate that the cheap additive signal predicts which data is worth a full training run.
- More turns. Each iteration's improved model collects better traces, which train a better model. That's compounding intelligence — and we've shown the first turn works.
Built with
Languages: Python.
ML / training: prime-rl (SFT), PyTorch, FlashAttention-3, fused cross-entropy (Liger), torchrun, context parallelism, Hugging Face datasets.
Serving / inference: vLLM (OpenAI-compatible endpoints).
Open model trained: Qwen3.5-9B.
Frontier models (trace generation): Claude Opus/Sonnet 4.6, GPT-5.4 family, Gemini 3.1 Pro / 3 Flash, GLM-5.1, and others via the public leaderboard.
Agents: Codex CLI, Claude Code, Devin CLI — integrated through MCP (Model Context Protocol) via a custom local stdio tool-bridge.
Benchmark / eval: Zapier AutomationBench; PrimeIntellect verifiers + the additive rubric (partial_credit / task_completed_correctly).
Infra / ops: 8×H100 cluster, Slurm, Weights & Biases, tmux.
Log in or sign up for Devpost to join the conversation.