Meta-Harness

Landing Page of Meta Harness
Sentry/Datadog Observability Layer for Harness Improvements by LLM
Live inferencing of baseline model, seeing active node expansions in real time
Graph of Inner and Outer Loop For Meta Harness Improvements

Stanford's Meta-Harness paper had a linear loop. We made it a tree.

A LangGraph-native substrate for self-improving coding agents, with branching time-travel, concurrent forks, and cross-run memory. Built in 36 hours for LA Hacks 2026.

At a glance


Architecture	Two LangGraph state machines, Postgres-checkpointed, async end-to-end
Engine	Real `claude` CLI proposer + Anthropic Haiku 4.5 inner loop. No model mocks.
Substrate	`AsyncPostgresSaver`, `AsyncPostgresStore`, `asyncio.create_task` per branch
Search space	6 fixed tools, 11 override points, ~150-line `SKILL.md` as the only domain-specific code
Tests	78 backend tests passing, 47 adversarial eval tests, 12 of 12 acceptance checks green
Demo arc	baseline 62% → iter 4: 80% → fork iter 3′: 85% (global best)
Demo cost	Roughly $3 in Anthropic spend, 6 minutes wall-clock, runs on a laptop

Inspiration

Less than two months ago, Stanford IRIS Lab researchers (with MIT collaboration) published a result that hit harder than it should have for a one-month-old paper: agents that read their own execution traces and rewrite their own harness beat ACE by +7.7 points on TerminalBench-2, using 4× fewer context tokens.

Their prototype proved the algorithm works, but it has three properties that hold it back from production engineering use:

Linear. for iter in range(N): .... SIGINT it, you start over. No rewind, no fork, no concurrent comparison.
Research-shaped. No checkpointing, no resume, no observability primitives.
Single-shot. Runs in isolation; no memory of what prior runs learned.

We built Meta-Harness to close all three gaps and turn the idea into a practical system for software teams: a secure, branching, fully-observable framework for self-improving coding agents, with time-travel, concurrent forks, cross-run memory, and a live dashboard, all running on a laptop.

What it does

Meta-Harness lets coding agents debug and upgrade themselves, while giving humans full control to rewind, fork, and watch that evolution live.

It runs on a dual-loop architecture, both halves expressed as LangGraph state machines with Postgres-backed atomic checkpointing:

   OUTER STATE MACHINE  (4 nodes, AsyncPostgresSaver-checkpointed)
   ──────────────────────────────────────────────────────────────────
   propose ──► validate ──► benchmark ──► update_frontier
      │                          │                │
      │                          │                └─ loop while budget > 0
      ▼                          ▼
   spawns `claude` CLI        spawns inner
   subprocess + SKILL.md      subgraph per
   (writes a new              candidate
   agents/<name>.py)
                                  │
                                  ▼
   INNER STATE MACHINE  (5 phases, sandboxed subgraph per candidate)
   ──────────────────────────────────────────────────────────────────
   orient ─► plan ─► act ─► verify ─► submit
                       │
                       └── 6 fixed tools, 11 override points

The headline differentiator is forking. Right-click any historical checkpoint in the dashboard, edit the proposer's prior, and a new branch starts running concurrently against the same AsyncPostgresSaver. Two trajectories grow on the trajectory tree at the same time, both Pareto-optimal at different (accuracy, tokens) tradeoffs.

Three properties fall out by construction from the substrate choice:

Property	Mechanism
Secure	Each candidate is a sandboxed subgraph in a fresh isolated workspace with rlimits, so buggy candidates can't corrupt the run
Consistent	Every state transition is an atomic Postgres checkpoint via `AsyncPostgresSaver`, so replays are deterministic
Reversible	Time-travel via `aget_state_history` + `aupdate_state` + `ainvoke(None, fork_config)` lets you fork any checkpoint, ever

We didn't write the time-travel code. We picked the right substrate.

The demo arc (real run, real models, real numbers)

Five tasks, five trials each, real Haiku 4.5 in the inner loop, real claude CLI subprocess as the proposer. The proposer reads each iteration's execution traces and proposes a harness modification. No --mock-bench. No synthesized scores.

                       baseline harness  →  62%
                              │
                              ▼
   iter 1   retry on schema_drift errors            70%    (+8%)   ✓ accept
                              │
                              ▼
   iter 2   stricter tool-description hashing       66%    (−4%)   ✗ reject
                              │
                              ▼
   iter 3   early-exit on auth failures             74%    (+4%)   ✓ accept
                              │
                              ▼
   iter 4   more specific tool descriptions         80%    (+6%)   ✓ NEW BEST
                              │
                              │
                       ╱ fork from iter 2 ╲
                       │                  │
                       ▼                  ▼
   iter 2′  rewrite tool descriptions w/ examples   78%   (+16%)   ✓ accept
                              │
                              ▼
   iter 3′  add few-shot demos to descriptions      85%    (+7%)   ✓ GLOBAL BEST

Both branches grow concurrently against a single AsyncPostgresSaver. The final Pareto frontier has two non-dominated candidates: more-specific-descriptions at roughly (80%, 24.8K tokens) and few-shot-demos at roughly (85%, 26.2K tokens). Two genuinely different solutions, found by the same compute budget spent differently. That contrast is the whole pitch in one frame.

A holdout pass against 2 unseen tasks the proposer never saw runs automatically after the search loop, so the gap between search-set score and holdout score is the honest overfitting signal.

How we built it

Frontend

Next.js 16 + React 19 + TypeScript: dashboard shell with React Server Components and async route params.
D3.js (custom SVG): renders the branching trajectory tree (mainline path, fork zones, accept/reject/best status signals, hover→fork modal on right-click).
Monaco DiffEditor (@monaco-editor/react): side-by-side code-diff viewer for agents/<name>.py candidates against their parent.
ReactFlow (@xyflow/react): renders the static state-graph diagram (outer 4-node + inner 5-phase) for architecture context.
Server-Sent Events: streams a runtime-enforced closed set of 11 event types from a per-run channel; browsers automatically reconnect via Last-Event-ID so a network blip resumes mid-stream.
Framer Motion: landing-page typing animation and scanline.
Tailwind 4 + JetBrains Mono: dark-cockpit aesthetic, charcoal #0c0c12 background, whisper-color accents at roughly 35% saturation.

Backend

FastAPI 0.115 + Uvicorn: async REST surface with a custom StreamingRegistryError exception handler that turns SSE-contract violations into clean 500s.
LangGraph 0.2: StateGraph for both loops, compiled with AsyncPostgresSaver as the checkpointer.
Async-end-to-end. Every node body is async def; sync subprocess work (pytest, git apply, claude CLI) is wrapped in asyncio.to_thread(...) so the event loop never blocks. This is required because the sync PostgresSaver deadlocks under concurrent forks.
Anthropic Claude (two distinct roles):
- Proposer: claude CLI as a subprocess (per Stanford's claude_wrapper.py pattern), invoked with --append-system-prompt $(cat SKILL.md), --output-format stream-json, --dangerously-skip-permissions, and an empty --plugin-dir for hermeticity. Reads roughly 80 files per iteration from the run directory; writes ONE new candidate harness file.
- Inner-loop agent: Anthropic Haiku 4.5 via AsyncAnthropic, running the 5-phase ReAct loop with 6 fixed tools.
PostgreSQL 16 (Docker): durable run state, checkpoint history, cross-run memory.
AsyncPostgresStore: separate ("learned_patterns", "<domain>") namespace; patterns from prior runs flow into the next run's --append-system-prompt automatically.
Process-isolated sandbox: fresh isolated workspace per inner-loop trial, rlimits via preexec_fn (with macOS RLIMIT_AS skipped because Python's runtime exceeds it).

The bounded search space

The contribution is the explicit, finite contract: 6 tools the inner loop can use, 11 override points the proposer can modify.

What's fixed (the contract)	What's evolvable (the search space)
`read_file`, `apply_patch`, `write_file`, `run_bash`, `grep_search`, `task_complete`	`SYSTEM_PROMPT`, `PLAN_PROMPT_TEMPLATE`, `MAX_ACT_TURNS`, `MAX_VERIFY_RETRIES`, `_build_initial_context`, `_format_tool_result`, `_compose_act_prompt`, `_call_llm`, `should_loop_back_to_act`, `_summarize_for_overflow`, `build_inner_graph`

A standout in apply_patch: when a unified diff fails to apply because surrounding context lines don't match, we parse the hunk header @@ -42,5 +42,6 @@, read the file's actual content at the failed range, and surface it in the error response:

return {
    "status": "error",
    "error_type": "context_mismatch",
    "error_message": (
        f"Patch context did not match at lines "
        f"{ctx['start_line']}-{ctx['end_line']}. "
        f"The file currently reads:\n{ctx['content']}\n"
        "Edit the patch to match this and retry."
    ),
    "context_echo": ctx,
}

The model fixes the patch in the next turn without spending tokens on a redundant re-read. Cuts inner-loop turn count by roughly 30% on patch-heavy tasks.

Challenges we ran into

1. Sync PostgresSaver deadlocks under concurrent forks. LangGraph issue #6624 is a real one we hit empirically: asyncio.gather over interrupt()-able coroutines silently swallows the interrupts. Solution: switch to async-everything, and use asyncio.create_task per branch (tracked in a branch_registry: dict[str, asyncio.Task]) instead of gather.

2. A psycopg keyword that disabled 20 tests silently. Our healthcheck wrote AsyncConnection.connect(timeout=5), but libpq's parameter is connect_timeout. The kwarg-mismatch raised ProgrammingError, which a bare except Exception: return False swallowed as "Postgres is unhealthy." We didn't notice until a verification pass; every Postgres-backed test had been silently skipping. Fix was three lines. The lesson on except: pass is permanent.

3. apply_patch's context-echo design. Naive patch tools fail with "patch did not apply; re-read the file." That triggers a wasteful re-read cycle. Designing the context_echo response (parsing the hunk header, reading the actual file content at the failed range, surfacing it inline) was a real engineering insight that significantly reduced token spend.

4. Pipe-buffer deadlocks in the proposer subprocess. With --verbose stream-json, claude's stderr fills its 64KB pipe buffer in seconds. If we read stdout serially, stderr blocks the subprocess on its next stderr write, an instant deadlock. Solution: two reader threads each pushing to a shared queue.Queue, main thread drains with a 200ms poll timeout.

5. Calibration honesty. When we ran real Haiku on the original eval set, baseline scored 100%. Haiku is more capable on simple bug-fixes than the paper predicted. Rather than synthesize the score curve, we hardened all 5 search-set tasks with 13 adversarial test cases targeting common shortcuts: input mutation, type-strictness, structural assertions, floating-point tolerance, edge-case inputs. Baseline now lands at roughly 60% naturally on real Haiku scoring. The climbing-numbers narrative is observed, not synthesized.

6. Bridging research novelty to production feasibility. Hardening every layer (async checkpointing, fork primitives, SSE replay-on-reconnect, sandbox isolation, pristine-failing tasks for honest holdout) was where most of our 36 hours went. Algorithm is roughly 30% of the project; substrate work is roughly 70%.

Accomplishments we're proud of

Two LangGraph state machines, both checkpointed, both async, composing correctly. Two concurrent branches sharing a single AsyncPostgresSaver, verified by integration test, no deadlock.
Real claude CLI proposer producing real harness candidates. During development we generated agents/structured_tool_feedback.py: a real CodingAgentHarness subclass with a real override of _format_tool_result, written autonomously by the proposer in 2 minutes 37 seconds at $0.98.
Cross-run memory that actually persists. A pattern accepted in run A flows into run B's proposer prior automatically: AsyncPostgresStore with namespace ("learned_patterns", "coding-agent"), recency-weighted top-N retrieval.
47 adversarial eval tests across 5 hardened tasks; 78 backend tests passing; 12 of 12 acceptance checks green via the demo dry-run script.
A 13-step build order with literal Definition-of-Done commands per step. Every step ships a verifiable, runnable slice with a one-line check that proves it works.
A live observability stack we'd actually use in production. SSE channel registry rejects unregistered event types with a 500, so the contract is enforceable at runtime, which means the dashboard's event→reducer map can be exhaustive.

What we learned

The substrate IS the contribution. Three properties (secure, consistent, reversible) are properties of LangGraph's primitives, not properties we wrote. Picking the right substrate was the design decision; everything else was integration work. Most "novel" systems are a thin layer of insight on top of someone else's hard substrate work; honesty about that is a feature.
Better harness design beats better models on bounded benchmarks. Stanford's +7.7 points came from harness changes alone (same base model, same task set). Prompts, tool-result formatting, retry policy, context-overflow strategy: these structural axes capture improvements that model upgrades don't give you for free.
SKILL.md is the only domain-specific code in the entire project. Swap the SKILL.md file for a different one (text-classification or browsing, for example) and the same claude_propose machinery now optimizes a different domain. Domain-specificity is roughly 150 lines of Markdown; everything else is reusable.
Auto-research only works in practice with strict evals, observability, and sandboxed subagents. Without holdout evaluation, the proposer overfits. Without sandboxing, a buggy candidate kills the run. Without live observability, you can't debug a stuck loop. All three are non-negotiable.
Honest empty states beat mocked data. The frontend's getDiff() returns null when the diff API isn't ready, and the panel shows "No diff available." We resisted the urge to seed it with fake fixtures next to real candidate names; the integrity of the demo depends on it.

What's next for Meta-Harness

Sub-day fixes that materially improve the demo:

Wire the 3 currently-unwired override points (MAX_VERIFY_RETRIES, _build_initial_context, should_loop_back_to_act). One-line fix per call site, restores full search-space expressiveness.
Aggregate real per-trial token usage in real-bench eval results so the Pareto-on-tokens chart becomes truthful on live runs.
Polish the fork modal UX with a state-mod input form.
Render the real format_patterns_for_prompt(...) output in the memory panel instead of the hardcoded fixtures.

Production-readiness arc (next 2 weeks):

pgvector-backed embedding search for cross-run memory (the schema is ready for it).
Multi-domain support: ship a browsing-agent skill and a math-reasoning skill alongside the coding-agent one to demo generalization live.
Bandit-style budget allocation across concurrent branches instead of greedy hill-climb.
GitHub Actions CI with a Postgres service container.
Multi-tenant Postgres schemas plus auth.

The composability play. Meta-Harness's outer loop is agent-agnostic. The 11 override points are the search space; the inner loop's CodingAgentHarness could be replaced by any agent with a similar phase structure. A reference integration with a Devin-class autonomous SWE agent is the strongest possible follow-up: harness search across thousands of production sessions, with real branching A/B comparison.