Walkie-Talkie -- Make devin succeed in harder tasks

Multi-agent harness communication
Swe-bench pro eval 20% improvement
Demo project building frontend and backend at the same time

Walkie-Talkie (`wt`)

A walkie-talkie is a orchestration backend for agent harness to multiply and collaborate. We have evaluated Devin+Opus 4.8 to perform 20% better on swe-bench pro benchmark with wt cli installed to tackle hard tasks.

Inspiration

Agents are great solo but have no clean way to talk to each other or coordinate work.
We wanted a primitive where one agent can decompose a goal, dispatch orthogonal pieces to child agents, and integrate the result — a closed control loop.

What it does

Allows the prime harness to create children harness to delegate decomposed tasks
Allow prime agent to communicate with children harness to coordinate on solve hard problems(frontend and backend separation, parallel backtesting etc.)
Prime agent can launch multiple children harness in clean git worktree environment to test solutions in isolation which improved Devin's performance on swe-bench pro

How we built it

Rust workspace, single wt binary: wt-proto (wire/IPC types, no I/O), wt-core (identity, SQLite store, auth, transport, services), wt-daemon (accept loop, delivery worker, IPC, mDNS, harness supervisor), wt-cli (clap client).
Transport: iroh (QUIC + relays + DNS discovery), one Ed25519 identity per install used for both transport mTLS and token signing.
Persistence: one SQLite DB (WAL) with a combined outbox/inbox message log; receiver-side dedup via composite PK; delivery worker resumes after restart.
Orchestration: in-daemon message bus, per-child supervisor over Claude Code stream-json (kill_on_drop lifecycle), per-session worktree/new-folder workspaces.

Challenges we ran into

Eval, we used harbor + devin for the eval, the eval on swe-bench pro consumed tremendous amount of tokens and took a lot of time, we have to use daytona for running the eval efficiently in parallel
Testing, We have to debug through a multi-agent harness system where there is no single point of failure, sort and render multi-agent communication to make sure the agent-agent communication is successful

Accomplishments that we're proud of

Crack swe-bench pro by 20% We improved Devin's performance on swe-bench pro by 20% as wt is proved to improve the problem resolving capability of agent harness.
Real cross-internet exchange: bidirectional messaging between a laptop behind residential NAT and a cloud sandbox, hole-punched direct (no relay), macOS↔Linux.
A single binary that is both daemon and CLI, with green unit + subprocess e2e tests and CI gates (build/test/clippy/fmt).
A genuinely harness-agnostic orchestration model with a written, closed-loop operating discipline.

What we learned

Multi-harness beats single harness on hard tasks

Built With

Updates

Xinyu Zhang started this project — Jun 20, 2026 07:57 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.