Baton — Verified AI Agent Handoffs
Baton compiles the noisy, half-finished work of one AI agent into the smallest verified state another agent needs to continue. It transfers work between independent coding tools (Claude Code and Codex CLI) without re-explaining the task, and at no extra API cost.
Inspiration
Anyone who codes with AI agents has hit the same wall. You spend forty minutes with Claude Code on a hard bug. It finally has the context: the repo layout, the failing test, the approaches you already ruled out. Then it hits a usage limit, and you are starting over in a fresh Codex window, retyping the whole story from memory.
The industry has spent two years making each agent smarter, with larger context windows and better reasoning. No one built the part that lets one agent pass the work to another. As the number of usable models grows and rate limits stay real, the problem stops being "is the model smart enough" and becomes "can I keep working when this one runs out." That gap, continuity, is what we set out to close.
What It Does
When an agent hits a usage limit, crashes, or stalls mid-task, Baton does four things:
- Freeze. It snapshots the workspace from facts, not narration:
git diff, test exit codes, terminal output, and a regex code skeleton. - Compress. It distills that into a small, schema-validated handoff packet, roughly 95 percent smaller than the raw session. The packet keeps the one thing a summary throws away: failure memory, the "do not re-run this migration blindly, the first attempt half-succeeded" knowledge.
- Switch. It launches a different agent (Claude Code or Codex CLI) in the same repository, working from that packet alone.
- Verify. It runs your real verification command, such as
npm test, and reports the actual exit code and final diff. It does not claim success. It proves it or it does not.
The developer never re-explains the task during the transfer. And Baton drives your already-authenticated local CLIs (claude -p, codex exec) rather than metered REST APIs, so a handoff adds nothing to your bill.
How this differs from what already exists
| Category | What it does | Why Baton is different |
|---|---|---|
| Cursor, Windsurf, Copilot | AI editors built around one model and one session | They have no concept of handing off to a rival tool. Baton is not an editor. It is the layer between tools. |
| Claude Code / Codex resume flags | Resume a session, but only their own, and only locally | There is no cross-vendor resume. Baton is provider-neutral by design. |
| Memory tools (Mem0, MCP memory servers) | Persist the conversation | Agents report progress optimistically, including on failing tests. Baton compresses executable evidence, not chat. |
| Agent frameworks (LangGraph, CrewAI, AutoGen) | Orchestrate agents you build, using metered API keys | Baton rescues the human-driven session you are already in, for free, on local auth. |
Most of the market has been competing on the model axis. Baton works on a different one, continuity, and that axis becomes more valuable as the set of available models keeps splitting.
How We Built It
The architecture is provider-neutral, with shared contracts as the only dependency boundary.
- Shared contracts (
packages/shared). Runtime-validated Zod schemas (RelayEvent,HandoffPacket). One definition produces both the validator and the TypeScript type, so no layer can emit a malformed packet. - Node and TypeScript server. A guarded session state machine, a generic process runner built on
node:child_processthat turns any CLI's lifecycle into typed events, provider adapters for Claude and Codex, the orchestrator that runs detect, freeze, distill, launch, and verify, and a WebSocket broadcaster for the live timeline. - The compressor. It gathers evidence deterministically (git, regex skeleton, exit codes), then gives the language model exactly one job: distill. The output is schema-validated, and on any failure a deterministic fallback packet is built from the raw evidence, so compression is never a single point of failure.
- The verifier. It runs the stored command through the shell and treats the exit code as the only source of truth.
- Event store. Every event streams to Redis so the timeline survives a refresh, or to an in-memory store with the same interface. The engine never imports Redis; adapters emit into a sink and do not know who is listening.
- Front end. A React and Vite dashboard, plus an Electron desktop companion that docks to a screen edge and adds a native folder picker.
Built with: TypeScript, Node.js, React, Vite, Electron, Redis, WebSocket, Zod, Git, Claude, Codex.
Challenges We Ran Into
- Deciding what not to store. The git diff is re-derivable because the next agent reads the repo itself. So the packet had to carry only the parts that cannot be recovered: intent, decisions, and failure memory. Drawing that line took several rewrites.
- Headless agents that would not act. Run with
-porexec, real Claude and Codex would hang waiting for a stdin EOF, and would refuse to edit files because headless mode denies writes with no human to approve them. The agent would describe the correct fix and change nothing. Closing stdin and adding two permission flags (--permission-mode acceptEdits,--sandbox workspace-write) turned a frozen screen into an agent that actually edits the repo. - Trusting evidence over claims. We repeatedly caught agents reporting success on red tests. That pushed the design toward an exit-code-only verifier and toward compressing facts instead of summaries.
- Regex instead of an AST, on purpose. The next agent reads the real code, so we only need to point at the surface, not parse it. A line-anchored regex skeleton was the right cost-for-precision trade, and resisting the heavier option was its own discipline.
- Provider neutrality as a constraint. Keeping the engine from importing Redis, an adapter, or the UI meant continually refusing convenient shortcuts.
Accomplishments That We're Proud Of
- A handoff that crosses both a vendor boundary and a usage limit: Claude's session compressed into a portable packet, Codex resuming the same repository from it, and a passing test confirming the result.
- No added cost. Reusing local CLI auth means switching models is free, with no API meter and no surprise bill.
- Failure memory. The packet's pitfalls field is what makes the second agent start ahead of a cold session rather than from zero.
- A closed loop. Most handoff tools stop at "here is the context." Baton stops at a real exit code.
- A clean contract spine. Because Zod schemas are the single boundary, adding a new provider is just another adapter.
What We Learned
- As models commoditize, the scarce resource is not a smarter agent but uninterrupted work across agents. The connective layer is the product.
- Self-reported progress is unreliable; executable evidence is not. Designing around that made the whole system more robust.
- The value of compression is in what you can safely discard. A good packet is small because the repository on disk is the source of truth.
- The unglamorous layer decides everything. Stdin handling, permission flags, and path resolution were the difference between a good idea and a working handoff.
What's Next for Baton
- Hardened real multi-CLI runs with authenticated Claude and Codex, streaming each step live instead of a single JSON blob.
- A context firewall: deterministic redaction of secrets (API keys, tokens,
.envassignments, private keys) before any evidence is distilled, stored, or shown. - More providers behind the same neutral adapter, including Gemini and local models, plus controlled multi-hop handoffs.
- RelayBench: measured comparisons of task completion with and without a clean handoff.
- A hosted, sandboxed demo so anyone can watch a handoff happen without installing anything.
- Session persistence across restarts, so a frozen session can be resumed later, on another machine, with a different model.
Built With
- claude
- codex
- electron
- node.js
- react
- redis
- typescript
- vite
Log in or sign up for Devpost to join the conversation.