SuperCompress

AI wastes energy by reading junk. SuperCompress fixes that.

Every agent workflow sends massive context to the GPU: old logs, irrelevant RAG chunks, boilerplate files, stale tool calls, and filler conversation turns. The model pays for every token — in API cost, GPU time, KV-cache memory, electricity, cooling, and carbon.

SuperCompress removes the context that does not matter before it ever reaches the GPU.

Not by blind truncation.
Not by summarizing everything.
By learning what is safe to forget.


Why This Matters

AI scaling is becoming an energy problem.

Every unnecessary token sent to a model consumes compute. At small scale, that looks like a few wasted cents. At agent scale, it becomes millions of useless tokens processed, stored, cooled, and billed.

SuperCompress turns context compression into an environmental primitive:

Do not make the GPU read what the answer does not need.

Every token removed is:

  • Less GPU prefill work
  • Less KV-cache memory
  • Less electricity
  • Less cooling demand
  • Less CO₂ emitted
  • Lower inference cost

The goal is simple:

Make AI cheaper, faster, and cleaner without making it dumber.


The Core Idea

Most compression methods optimize for how much text they delete.

SuperCompress optimizes for what value they preserve.

A normal truncation method might delete the exact line that answers the question just because it appears in the middle of the context.

SuperCompress instead scores context against the active query and keeps what matters.

Long context + user question
        ↓
Tokenize context
        ↓
Score each token with a tiny CPU model
        ↓
Remove low-value lines
        ↓
Preserve answer-critical entities
        ↓
Send compressed prompt to the LLM

Environmental Impact

SuperCompress estimates avoided CO₂ using the idea that fewer tokens create fewer KV-cache slots and less GPU work.

$$ \text{CO}2 \text{ avoided} \approx \Delta tokens \cdot \alpha{\text{KV}} \cdot \frac{1}{\text{tokens/GPU-s}} \cdot \frac{W_{\text{GPU}}}{3600} \cdot \frac{\text{kg CO}_2}{\text{kWh}} $$

Where:

$$ \alpha_{\text{KV}} \approx 0.55 $$

This represents the share of prefill compute attributed to context and KV-cache work.

At scale, the impact compounds.

For example:

$$ \Delta tokens \approx 800\text{M} $$

$$ \Delta E \approx 29\text{ kWh} $$

$$ \Delta \text{CO}_2 \approx 12\text{ kg} $$

These are illustrative estimates, not live datacenter telemetry. The assumptions are documented in the Environment Guide.

SuperCompress also estimates water savings using:

$$ W_{\text{water,mL}} \approx E_{\text{kWh}} \times 1.8 \times 10^3 $$

The live demo labels all impact numbers clearly as estimates.

No greenwashing.
No fake precision.
Just honest math.


What We Learned

1. Compression budget is not compression intelligence

FIFO, truncation, and summarization can all hit around 65% KV savings at a 35% keep budget.

But they often delete the wrong things.

On benchmark seeds, those baselines fall to around 25% oracle recall.

SuperCompress reaches 100% oracle recall at the same budget because it scores context against the actual question.

The real metric is not:

How much did you delete?

It is:

Did you delete the right things?


2. Small models can gate big models

The EvictionPolicyNetwork is tiny: roughly 5K parameters, CPU-only, and sub-millisecond on laptop hardware.

It does not generate text.

It decides what text the big model should not have to read.

That tiny decision can save far more GPU work than it costs.


3. Honest environmental claims matter

The demo shows:

  • Tokens saved
  • KV savings
  • Electricity saved
  • Water saved
  • CO₂ avoided

But every number is presented as an estimate.

The assumptions include:

Assumption Value
GPU power 150 W
Throughput 2,500 tokens/GPU-second
Grid carbon intensity 0.417 kg CO₂/kWh
KV share of prefill 55%
Cooling water estimate 1.8 L/kWh

The point is not to pretend we have live datacenter telemetry.

The point is to make the cost of wasted context visible.


How It Works

Each token becomes a compact 9-dimensional feature vector.

Index Signal
0–1 Attention mass and layer attention mean
2 Recency
3 Question entity match
4–8 Semantic type: code, comment, chat, boilerplate, or other

The model scores each token, groups by line, and evicts the lowest-value lines while protecting answer-critical entities.


Benchmarks

Eight seeded scenarios were tested at a budget ratio of 0.35.

Policy KV Savings Oracle Recall
FIFO ~65% ~25%
Truncation ~65% ~25%
Summarization ~65% ~25%
H2O ~65% ~90%
SuperCompress ~65% 100%

Same savings. Very different answer quality.

That is the whole story.


Product

SuperCompress ships as more than an algorithm.

It includes:

  • Python library
  • Hosted API
  • Vercel deployment
  • Firebase auth
  • Blob-backed API key storage
  • Live demo
  • Mintlify docs
  • Usage tracking
  • Environmental impact estimates
  • 65 automated tests

Stack

Layer Technology
Policy + training Python 3.10+, PyTorch, ~5K-parameter MLP
Public API Python compress_context()
Production API Vercel serverless routes
Key storage @vercel/blob
Browser engine compress-engine.js + model.json
Auth Firebase Admin
Site Static web app on Vercel
Docs Mintlify

Usage

Install

pip install git+https://github.com/arjunkshah/supercompress.git

Hosted API

curl -X POST https://trysupercompress.vercel.app/api/v1/compress \
  -H "X-API-Key: sc_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "context": "...",
    "query": "What does fetch_user return?"
  }'

Challenges

1. Fixed savings looked fake

A fixed 0.35 keep budget always produced roughly 65% savings, which made the demo feel hardcoded.

We fixed this with adaptive learned eviction. Savings now vary based on the actual input and query.

Examples:

  • Around 74% savings on agent logs
  • Around 96% savings on long filler with a small answer core

2. Truncation failed where agents need memory most

Head-and-tail truncation often loses answers buried in the middle of context.

We built a failure-case demo showing truncation achieving high savings but near-zero answer quality.

SuperCompress keeps the critical line.


3. Production infra was part of the product

The first version had broken backend wiring, unreliable Blob overwrites, and incorrect API imports.

We moved to:

  • Vercel-native serverless routes
  • Same-origin /api/*
  • Versioned Blob snapshots
  • One deploy
  • One domain

Infra was not separate from the product.

It was the product becoming real.


Why SuperCompress Exists

We are not asking models to think less.

We are asking them to read less junk.

SuperCompress gives developers a way to reduce cost, latency, and environmental impact while preserving the answer.

Cleaner prompts.
Lower GPU load.
Less wasted energy.
Same answer.


Links

Built With

Share this project:

Updates