Eat Tokens Wisely AI

eattokenswiselyai

Inspiration

LLM agents don't fail because the model is weak they fail on context bandwidth. Every call carries bloated, repetitive input: tool outputs, search results, docs, logs, conversations. It's expensive, slow, and often hurts answers. Summarizing it with another LLM just trades tokens for hallucination risk. We wanted compression that's provable, not a model guessing a shorter version.

What it does

A drop-in compression layer with two modes, picked automatically by input type:

Lossless structural codec (JSON tool output, logs): factors repeated subtrees and string values into shared definitions. Byte-exact (decode(encode(x)) == x), and the compact form is read natively by the model — fewer tokens, zero information lost.
Lossy extractive selector (prose, docs, search, transcripts): a CPU keep-scorer picks the smallest set of original, verbatim spans under a token budget. No generated text, so nothing is fabricated.

One line in front of any model call: context = compress(task=question, raw=tool_output_or_docs, budget=240)["compressed_text"].

How we built it

Compression is pure CPU : a scikit-learn logistic-regression keep-scorer + budgeted selection + near-duplicate suppression. No LLM in the compression path, so savings are real, not circular; it runs with the model API key unset.
Measurement, not vibes. A frozen reader (Claude Haiku, temperature 0) answers full vs compressed context; we score against gold labels, never an LLM judge. Every headline number has a paired-bootstrap confidence interval.
Proven on real inputs, including live MCP servers (Context7 docs, Perplexity search) over the standard protocol, plus Deepgram STT for the voice case.
FastAPI + a dependency-free dashboard where judges can edit any question and re-run live , nothing is hardcoded.

Challenges we ran into

The honest failure modes were the hard part. Lossy compression gives wrong answers on exact-counting tasks, so we restricted the "same answer" claim to byte-exact (lossless) or agreement-verified cases. We caught ourselves inflating lossless ratios with pretty-printed baselines and switched to minified ones. And we made the compression strictly LLM-free to keep the savings non-circular ,the scorer is logistic regression over lexical/structural features, not a model.

Accomplishments that we're proud of

5.3× fewer tokens → 76% of full-context F1 on HotpotQA; random selection craters, proving selection does real work.
Lossless codec is byte-exact across 4,000 adversarial fuzz cases (0 failures); the model reads the compact form as accurately as full JSON.
Query-conditioning beats the query-agnostic paradigm by +0.29 F1, CI clears zero.
Generalizes unchanged across modalities conversations (beats BM25), web search (~81%), library docs (~67%), voice — and on real MCP tool output, with ≈ $5k saved per 1M queries at frontier-model prices.

What we learned

We pre-registered our experiments and report two ideas that didn't work: reader-grounded labels don't beat human labels (68% of "supporting facts" are redundant, not load-bearing), and a learned variable-rate budget doesn't beat a fixed one (the headroom is reader stochasticity, not learnable structure). Honesty is the moat — every claim survives a judge re-running it.

What's next for Eat Tokens Wisely AI

A pip-installable package, an inline proxy that compresses MCP/tool responses transparently, and per-domain fine-tuning of the scorer.

Benchmarks

Reader = claude-haiku-4-5 (temp 0), scored with SQuAD EM/token-F1 against gold (no LLM judge), paired-bootstrap 95% CIs.

HotpotQA distractor (n=150): full-context F1 0.79 → 0.60 at 5.3× fewer tokens (76% retained); query-conditioning beats query-agnostic +0.29 F1 (CI clears 0); in-domain we ≈ BM25.
Cross-task (scorer unchanged): SQuAD ~102%, CoQA ~100% (beats BM25), 2WikiMultiHop ~57%, NarrativeQA ~59% (BM25 wins out-of-domain — reported honestly).
Lossless codec (12 tool-output bundles, n=60): 48% fewer tokens vs pretty JSON, EM 1.00 = 1.00, byte-exact round-trip.
Real MCP output: Context7 docs ~67%, Perplexity search ~81% (extractive, answer preserved), each verified live.

References

LLMLingua-2 Pan et al., Findings of ACL 2024 (arXiv:2403.12968) ,the query-agnostic paradigm we benchmark against.
RECOMP , Xu, Shi & Choi, ICLR 2024 (arXiv:2310.04408) ,extractive RAG compression.
FILCO, Wang et al., 2023 (arXiv:2311.08377) , learned context filtering (closest to our reader-grounded null).
BM25 ,Robertson & Zaragoza, 2009 (baseline). MMR, Carbonell & Goldstein, 1998 (diversity selection).
Datasets: HotpotQA, SQuAD, CoQA, 2WikiMultiHopQA, NarrativeQA. Protocol: Model Context Protocol (Anthropic).

Built With

anthropic
claude
deepgram
fastapi
hugging-face
javascript
model-context-protocol
numpy
perplexity-api
python
scikit-learn
scipy
sqlite
tiktoken
uvicorn

Updates

Rajashekar Vennavelli started this project — Jun 21, 2026 01:37 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.