Inspiration
What it does
How we built it
Challenges we ran into
Accomplishments that we're proud of
What we learned
What's next for Compressgram
Two frustrations collided into one project.
First, AI course generators have a retention problem nobody is solving. The whole market races to generate courses faster — "courses in seconds," "11x faster content." But a course you forget in a week is a fast way to waste time. Speed became table stakes; learning got left behind. We kept coming back to the Feynman technique — learn by explaining simply and answering questions until the gaps close — a proven, retention-first method that's barely used in schools and almost absent from AI course tools. That became our product: an agent that builds courses designed to stick, grounded in the learner's own material.
Second, grounding courses in real source material (via RAG) means feeding large amounts of retrieved text to an LLM on every generation — which is slow and expensive, and at scale, that cost is the business. When we saw The Token Company's compression challenge, the two problems clicked together: the bloated, retrieval-heavy context our product produces is exactly what compression is built to shrink. One build could serve both.
🧠 What we learned
The biggest lesson was conceptual: token reduction and downstream quality are a pair, never a single number. Anyone can delete tokens — delete them all and you've "compressed" 100% and destroyed the output. The real bar, and the one The Token Company's challenge actually sets, is reduce tokens while preserving the quality of what the model produces. That reframed our entire benchmark.
We also went deep on the prompt-compression literature and learned there's a clean taxonomy behind it:
Selective-Context — the ancestor: score each unit by informativeness, drop the least informative. Simple, but one-directional and prone to dropping things that matter. LLMLingua — adds a budget controller (different compression budgets for different parts of the prompt) and iterative, dependency-aware compression. Up to ~20×\times × compression with little loss. LongLLMLingua — makes compression question-aware (keep what's relevant to the query) and reorders key content to fight the "lost in the middle" effect. Crucially, it showed compression can improve downstream performance, not just preserve it. LLMLingua-2 — reframes compression as a token classification task (preserve/discard), trained on GPT-4-distilled labels — fast and task-agnostic. SCOPE — a generative approach (rewrite/summarize rather than delete).
The mechanism that ties it together — and that we leaned on — is signal-to-noise: stripping redundant filler makes the tokens that matter a larger fraction of what the model sees, so a shorter prompt can be easier for the model to use, not harder.
The system has two halves. One teammate built the RAG course agent — retrieval, agent orchestration, and discoverability via ASI:One (so the agent is reachable with an @ mention). The other built the compression layer that sits between retrieval and generation. This is that layer:
- Our own compression algorithm. Instead of merely calling a compression API, we wrote our own query-aware extractive compressor, synthesized from the research above. For each piece of retrieved material it:
splits text into sentences, scores each sentence by relevance to the course topic / the learner's current question (embedding similarity — the LongLLMLingua question-aware idea), boosts sentences containing definitions, facts, entities, and numbers, removes near-duplicate sentences, keeps the highest-scoring sentences within a token budget (the keep/discard framing from LLMLingua-2), moves high-value sentences toward the front (anti "lost in the middle"), and deletes without ever rewriting — so facts and numbers stay exact (a deliberate choice against SCOPE-style generative rewriting, which could alter a figure and corrupt course accuracy).
Roughly, each sentence ss score(s)=w1⋅rel(s,q)+w2⋅info(s)−w3⋅dup(s) info(s) rewards definitions/entities/numbers, and dup(s)\text{dup}(s) dup(s) penalizes redundancy. We keep top-scoring sentences until the token budget is met.
Domain-aware split logic. Our layer knows it's compressing course material. It decides what is safe to compress (retrieved chunks, prior Q&A, carried context) versus what must stay exact (system prompt, course schema, the learner's current question). A generic compression API has no idea about these boundaries — send it the whole prompt and it may corrupt what should stay precise. Our layer protects them by design.
A three-mode framework. Everything runs behind one interface with three swappable modes: none (baseline), local (our own algorithm), and token-company (the commercial API). This means we are not locked to any vendor — we can run entirely on our own compressor with zero external calls — and it lets us benchmark all three head-to-head.
Resilience. The layer never throws and never blocks generation. If compression fails (API slow/down), it falls back to the original full context and flags it; if retrieval returns nothing, it returns cleanly without a wasted call. The product never breaks because of a compression hiccup.
The benchmark harness. To prove "quality held," we run paired generations of the same course — once with compressed context, once without — at temperature 0 so compression is the only variable. A blind LLM-as-judge then scores both outputs on accuracy, coverage, and question quality without knowing which is which. We sweep the compression aggressiveness to find the point where quality starts to drop — the safe operating point.
A live telemetry dashboard. Tokens before/after, % saved, quality-held as a pair, savings broken down by source, the protected-content panel, and the fallback state — making the otherwise-invisible compression visible in real time.
Built With
- nextjs
- token
- typescript
Log in or sign up for Devpost to join the conversation.