Whisperers

Inspiration

We build StudierAI, a voice tutor — and we kept hitting the same wall: long calls drift and get expensive. It's not just us. OpenAI's own docs warn that "instruction adherence can drift" as context grows and that "turns later in the session will be more expensive." Managed platforms paper over it with hard limits — Retell caps context around 32k, OpenAI documents blind truncation — so by minute 10 the agent has quietly forgotten what the caller said at minute 1.

The picture that stuck with us was a grandmother calling a customer-care line: she explains in the first minute that the gift is a watch for her grandson's graduation, it must arrive before the 20th, deliver to the neighbor at interno 3. Ten minutes later the agent asks her to repeat everything. Today every team fixes this by hand, rewriting its own context handling. We wanted to package the fix so anyone building a voice agent could just bolt it on.

What it does

Whisperer is a drop-in, model-agnostic memory layer that sits next to a voice agent. Run the same agent (same prompt, same voice) twice over one long call: base (no layer) vs Whisperer (layer on). The base forgets the minute-1 fact; Whisperer recalls it — and a judge proves it.

Three real pieces:

The distiller — every 4 turns a cheap model (gpt-4o-mini, structured output) reads the live transcript and writes a typed, append-only state ledger: identity, objective, facts, commitments. Every entry cites the transcript turn that proves it, and a deterministic reconcile step drops anything that doesn't point to a real caller turn → no hallucinated memory.
Compact re-grounding — each turn, instead of resending the whole conversation, Whisperer sends [compact ledger + the current question]. The agent answers from distilled state, so it remembers minute 1 without dragging ten minutes of context behind it.
The judge + harness — a binary judge reads a recorded call plus the seeded fact and returns remembers: yes / no with the turn citation. In batch (N = 10 per side) it produces the number.

The measured result: base 0 / 10, Whisperer 10 / 10, every verdict backed by a citation — no vibes. On top: a split-screen demo and a live HUD that writes the ledger to screen by itself.

How we built it

Three folders — server/ (the layer), harness/ (the judge), web/ (the dashboard) — coupled only through JSON contracts, so each could be built in parallel against fixtures. The dashboard runs mock-first with no backend at all.

The mechanism is the differentiator: it isn't "summarize and re-inject" (that's already in the OpenAI cookbook). It's a typed ledger with evidence — each fact cites the turn that proves it, which is what makes the HUD readable and the memory verifiable.

Why the cost shape is real: the base re-pays its growing context every turn, so its cumulative input tokens grow quadratically, while Whisperer sends a bounded ledger of size $\le k$:

$$C_{\text{base}} \approx \sum_{t=1}^{T}\bigl(b + s\,t\bigr) = bT + s\frac{T(T+1)}{2} = O(T^2), \qquad C_{\text{whisperer}} \approx \sum_{t=1}^{T}\bigl(b + k\bigr) = (b+k)\,T = O(T)$$

where $b$ is the fixed system/tools baseline, $s$ the tokens added per turn, $k$ the ledger cap. Our recordings confirm the shape: the base's per-turn tokens_in climbs ($249 \to 876$) while Whisperer's stays flat ($375 \to 495$).

Stack: OpenAI Agents SDK (built on the openai-voice-agent-sdk-sample), structured outputs for the distiller and the judge, ElevenLabs for the real grandmother voice, and Codex as the primary builder — every feature kicked off with "read the spec, propose the integration plan," one closed task at a time. The Codex-signed commit trail is our proof of build.

Challenges we ran into

The cost inversion. Our first honest meter showed the wrong direction: on a short 28-turn call the item-capped base is cheap, so a naïve Whisperer that resent history looked more expensive. We discovered the two goals are coupled and inconcilable on a short call: reliable forgetting needs a small window (cheap base), while a dramatic cost gap needs the base to send a lot (large context). At ~640 tokens of dialogue you can't have both. We resolved it with honesty rather than a faked number (see below).
Making the base forget without cheating. We capped it to reproduce the real 32k hard-cap managed platforms ship and the blind truncation OpenAI documents — a real condition, not an invented handicap.
Getting to 10/10. The first live pass scored 0/10 vs 5/10: the distiller captured everything, but the agent hedged at recall ("shall I check?"). Reinforcing the injection prompt to answer directly from the ledger took it to a real 10/10.
No hallucinated memory. Making the ledger trustworthy meant a deterministic reconcile that rejects any fact not anchored to a real caller turn.

Accomplishments that we're proud of

A number you can defend: base 0/10 vs Whisperer 10/10, every verdict carrying a turn citation, judged against the real workflow.
Verifiable, non-hallucinated memory — the typed-ledger-with-evidence design is the part we haven't seen elsewhere.
Intellectual honesty as a feature. The on-screen cost is labeled a projection, never "measured"; the magnitude (≈7.6×) comes from official Realtime rates and our StudierAI bill (€4.50 vs €0.13), while the direction is backed by a real 1.30× measurement we kept as Q&A evidence. Under-promising made us unattackable.
A demo that lands: real grandmother audio, a HUD writing itself, and the recall moment where the base flounders and Whisperer answers exactly.

What we learned

The measurement is the product. A binary judge with citations turns "it feels better" into a number a technical reviewer can't dismiss.
Forgetting and cost are coupled on short calls — a non-obvious result that forced us to separate what we measured (direction, recall) from what we project (magnitude). Naming that boundary out loud is the signal of maturity, not a weakness.
Structured outputs + determinism beat clever prompting for trust: the reconcile step, not the model, is what guarantees no hallucinated memory.
Tight folder contracts + mock-first let three of us build fully in parallel, with Codex driving each closed task.

What's next for Whisperers

The watchdog (designed, not yet shipped): instead of periodic re-grounding, detect when the agent is about to violate a committed fact and inject only that fact, just in time.
The long-call measured cost win: run a real 10–20 minute call where forgetting and the cost gap both hold, turning the projection into a measurement.
Count the distiller's own cost end-to-end (a cheap text model, but we want it on the ledger too) and broaden provider support — Vapi, Retell, ElevenLabs Agents — plus multi-language ledgers.
Ship it into StudierAI in production as customer zero.