ReCompress — Devpost submission

Paste each section into the matching Devpost field. This covers the WHOLE project — both acts and all five distillation experiments. Tagline + "Built With" + links at the bottom.


Project name

ReCompress: rewrite-don't-delete context compression, distilled to 1.5B and extended to flat-context multi-turn

Tagline (Devpost "tagline" field)

A query-aware rewriting layer that extends compression into the regime deletion can't reach — distilled into a 1.5B model, then carried into multi-turn conversations to keep a 12-turn chat flat (184 tok) while a naive agent balloons to 1,482.


Inspiration

The Token Company's bear-2 compresses prompts by deleting low-value tokens — fast and lossless-by-design, but blind in two ways deletion structurally can't fix: it can't read your question, and it can't rewrite. We wanted to measure what those two abilities are worth — and prove you don't need a giant model to get them.

Compression isn't only a single-prompt problem, though. In a long conversation, context grows every turn (O(n²) cost). So we built the project in two acts: Act 1 — a query-aware compressor distilled into a small offline model; Act 2 — a multi-turn memory ("Re:Zero") that uses that compressor to keep context flat forever.

What it does

ReCompress is one system in two acts.

Act 1 — single-shot compression. Given a long context + a question, it drops the passages irrelevant to that question and rewrites the rest densely; a downstream LLM then answers. We distilled this behavior (from a DeepSeek teacher) into Qwen2.5-1.5B + LoRA so it runs offline and cheap.

Act 2 — Re:Zero multi-turn memory. A fixed ~300-token budget per turn — protected facts + a compressed checkpoint of older turns + the recent raw delta — so context stays flat instead of growing. The checkpoints are compressed by the Act 1 model, so the whole agent runs on our distilled compressor.

Headline results.

Act 1 — distilled 1.5B vs bear-2, same compression instruction (ours realizes ~8.5× fewer tokens: ~48 vs ~409 on HotpotQA), QA-F1, paired bootstrap 95% CI:

Benchmark ReCompress 1.5B bear-2 Δ vs bear significant?
HotpotQA 0.704 0.452 +0.252 (+56%) ✅ yes
2Wiki (never trained on) 0.570 0.390 +0.180 (+46%) ✅ yes
MuSiQue 0.297 0.186 +0.111 directional
SQuAD v2 0.593 0.471 +0.123 directional

Act 2 — multi-turn HotpotQA, 6 turns, n=20:

Strategy final F1 context tokens
Naive (growing history) 0.485 846
Re:Zero + DeepSeek API 0.455 198
Re:Zero + our distilled model 0.501 174
Re:Zero + bear 0.472 257

Re:Zero powered by our distilled model wins on both axes — best answer quality at the fewest tokens. Over 12 turns a naive agent grows to 1,482 context tokens while Re:Zero stays flat at ~184 (8.1× less, and diverging).

How we built it — and the five experiments behind the winning model

The Act 1 distilled model is the result of five distillation experiments, each a deliberate test. We name them so the trajectory is legible (not "v1…v5"):

# Experiment What we changed Outcome
1 Spark first distill — 261 examples, LoRA r=16, 3 epochs Wash vs bear (Δ=+0.06, CI includes 0): too little data
2 Bonfire scaled hard — 2,500 examples, r=64, 6 epochs Overfit: eval loss bottomed at epoch 2 then climbed every epoch
3 Hearth the balance — 5,000 examples, r=32, dropout 0.1, weight-decay, early-stop The winner — +0.252 F1 (+56%) on HotpotQA, generalizes to 2Wiki
4 Oracle answer-grounded: best-of-4 candidates, kept only ones the solver answers right Lost to Hearth (−0.05 F1) — selecting by a frozen judge overfits the judge
5 Oracle-Lite answer-grounded but greedy (no best-of-N) — to test if #4's loss was selection noise Also lost — proving it wasn't a best-of-N artifact; answer-grounding just doesn't beat imitation

Plus a sixth idea we designed, tested, and dropped — "Bear-Booster": train the small model to make bear's output better (optimize bear(model(text))). Our own data showed it's dominated — the standalone model beats model→bear on every benchmark, and a pre-processor is strictly costlier than bear alone. A clean negative result that sharpened the thesis: rewriting must replace deletion, not augment it.

Pipeline: DeepSeek teacher → 5,000 query-aware compression pairs → LoRA fine-tune Qwen2.5-1.5B (4-bit, Unsloth) on a Modal H100 → eval vs bear (TTC SDK) under the same compression instruction with bootstrap CIs across 4 benchmarks → wire the winner (Hearth) into Re:Zero as a pluggable backend → custom multi-turn benchmark. ~$10–15 total compute.

Challenges we ran into

  • Distillation failed twice before Hearth worked (Spark wash → Bonfire overfit). Fixing it took the full anti-overfitting playbook: more data, lower rank, dropout, weight-decay, early-stopping on best-eval.
  • A "smarter" idea (answer-grounded, Oracle/Oracle-Lite) lost — twice. Optimizing against downstream answer success overfit the frozen solver and hurt out-of-distribution generalization. Documented as a negative result.
  • The Modal + Unsloth + trl stack is version-brittle — ~7 distinct runtime failures before the first clean train (dependency conflicts, container data paths, the formatting_func contract, eval-time CUDA OOM, a packing incompatibility). All written up as reproducibility notes.
  • Integration: wiring the Act 1 Modal model into Act 2's synchronous checkpoint loop without breaking either codebase.

Accomplishments we're proud of

  • A 1.5B model that recovers the query-aware regime bear cedes — beating bear with statistical significance while emitting ~8.5× fewer tokens, and transferring to a near-in-distribution benchmark it never trained on (2Wiki, +46%; directional-but-unproven on the dissimilar OOD sets). It complements deletion rather than replacing it: deletion stays best for fast/verbatim/reusable; rewriting adds the query-specific case.
  • We stress-tested our own headline before a judge could. The teacher and solver are both DeepSeek (a circularity a sharp reviewer attacks first), so we re-scored with an independent solver (Claude Sonnet): the gap is invariant — Δ vs bear = +0.288 (independent) vs +0.285 (in-family), CI excludes zero both ways. The result is not a same-family artifact.
  • A unified system: the same distilled compressor works as a single-shot compressor and as a multi-turn memory engine (the strongest backend of the three we tested, vs DeepSeek and bear).
  • Research-grade rigor + intellectual honesty in 24h: bootstrap CIs on every claim, a cross-solver audit, a mask-the-answer audit (measured against ourselves), 5 named experiments with a clear winner, three documented negative results, a conceptual finding (the "deletion ceiling"), a 13-figure visualization suite, and a custom multi-turn benchmark.

What we learned

  • Query-aware rewriting beats blind *deletion* at far fewer tokens — most where there are distractors (multi-hop QA). On purely abstractive QA (MS MARCO) it ties bear — an honest boundary we report.
  • The win survives an independent judge (Claude Sonnet, +0.288) — it's not teacher↔solver affinity.
  • Much of the margin is span-selection, not reasoning — and we proved it on ourselves. Masking the gold answer from the compression drops our F1 by 65% (vs bear's 31%). So our edge is largely "query-aware compression keeps the answer-bearing span at a 3.5% budget where bear's deletion at 30% truncates it" — a real, useful property, stated precisely rather than oversold.
  • We measured our multi-turn overhead honestly — and found our own expensive component was useless. The "8.1× flatter context" is the solver-context axis; counting the per-turn compression LLM calls, the system actually costs more in total tokens at short horizons. Digging in, the LLM "Echidna" checkpoint-trigger turned out to decide checkpoint 98.3% of the time — no real decision. We replaced it with a free rule (2.6× cheaper, same F1) and swept conversation length: with the cheap trigger the system beats an uncached growing-history agent on total tokens from ~6 turns, widening monotonically to ~4× by 20 turns (6,886 vs 28,838). The LLM-trigger version was actually so expensive it got overtaken by the naive agent around turn 11 — it was counterproductive, not just wasteful. Against a cached agent we still don't win on raw tokens, and we say so.
  • You can't fix deletion by stacking it after rewriting (model→bear < model everywhere) — the "deletion ceiling."
  • Downstream-grounded distillation isn't free — selecting by a frozen judge overfits it (Oracle/Oracle-Lite both lost to imitation-based Hearth).
  • The same compressor composes — single-shot quality transfers to keeping conversations flat.

What's next

  • Direct comparison to LLMLingua-2 (closest prior work — they delete/classify tokens; we rewrite and distilled it into a generative 1.5B).
  • Scale teacher data + ratios to push MuSiQue/SQuAD into significance.
  • A live demo of Re:Zero holding a long conversation flat in real time.

"Built With"

python · qwen2.5-1.5b · lora · unsloth · modal · h100 · deepseek-api · the-token-company-sdk · huggingface-datasets · hotpotqa · 2wikimultihop · musique · squad · ms-marco · matplotlib

Links

One-line differentiator (keep ready for the pitch + Q&A)

"LLMLingua deletes tokens; we rewrite them — distilled into a 1.5B (after five experiments — Hearth won) that extends The Token Company's bear into the query-aware regime it cedes, measured head-to-head with CIs across 4 benchmarks, and powers a multi-turn memory that keeps a 12-turn chat flat at 184 tokens while a naive agent hits 1,482."

Built With

  • 2wikimultihop
  • deepseek-api
  • h100
  • hotpotqa
  • huggingface-datasets
  • lora
  • modal
  • ms-marco
  • musique
  • python
  • qwen2.5-1.5b
  • squad
  • the-token-company-sdk
  • unsloth
Share this project:

Updates