Pied Piper
Inspiration
Tokens are expensive, and most of them are wasted. Text is full of filler, video is packed with static frames and repeated backgrounds. At scale, this context bloat compounds into millions lost on inference. A long document burns through a context window before the useful signal arrives, and a two-minute demo might carry five seconds of real visual change.
Pied Piper strips out tokens that don't carry information before they reach a downstream model. Compression runs at millisecond scale across both text and vision modalities, producing output directly usable as context for any LLM or VLM. Our experiments show that pruning noisy tokens can improve performance on long-context tasks by removing the noise that dilutes the actual signal. You only pay when we save tokens, so the economics are aligned from day one.
What It Does
One endpoint, one SDK call: pied_piper.compress(input). Hand it text, documents, images, video, or a mixed batch, and get back compressed context ready for inference.
For text, custom-trained BERT encoders classify token-level importance, dropping structural filler while preserving meaning. PDFs, slides, and markdown are extracted, chunked with paragraph-aware overlap, and compressed in a single pass. For video, TransNetV2 segments into shots, representative frames are sampled per clip, and CLIP embeddings score each clip for novelty and coverage. A budgeted selector keeps the clips that maximize information density under a fidelity target, then stitches them back into a compressed MP4 artifact. Images pass through as first-class inputs without compression.
The system runs as a Modal-hosted FastAPI service on a shared GPU runtime holding the text compressor, TransNetV2, and CLIP in memory. The SDK is a lightweight httpx client, pip-installable as piedpiper-sdk, returning structured results with per-item metrics: token counts, compression ratios, and clip-level metadata for video.
How We Built It
The core stack is PyTorch. Text compression uses a fine-tuned DistilBERT encoder for token-level importance scoring, integrated through LLMLingua-2 for adaptive rate control. The video pipeline combines TransNetV2 for shot boundary detection with CLIP ViT-B/32 for semantic scoring.
The inference service runs through Modal's ASGI integration with automatic scale-to-zero and GPU-backed containers. A single container loads all model runtimes at startup and serves requests sequentially. The SDK is a separate pyproject.toml-packaged Python library with one runtime dependency on httpx, keeping all ML inference server-side.
Challenges We Ran Into
The core training tension was semantic preservation versus aggressive cutting. Push too hard toward compression and meaningful tokens get dropped; too conservative and the ratio isn't worth the overhead. Getting both required careful iteration on the loss function to prevent divergence.
On video, shot-level segmentation doesn't map cleanly to semantic importance. TransNetV2 gives clean shot boundaries, but visually distinct shots can be semantically redundant. CLIP-based novelty scoring on top of structural segmentation solved this.
Holding both runtimes on a single GPU required careful memory management, and Modal's cold-start behavior forced us to tune SDK timeout profiles so the client survives the first request after a scale-to-zero event.
Accomplishments That We're Proud Of
Pied Piper is a live, deployed inference service with a pip-installable SDK. pip install piedpiper-sdk, set an API key, call pied_piper.compress(input). Text, documents, images, and video all go through a unified API contract, with the backend routing each modality through the appropriate compression pipeline. The contract is stable across modalities, so adding future capabilities won't break existing integrations.
What We Learned
The gap between a model that works in isolation and one that handles arbitrary user input at sub-second latency behind an HTTP API is wide. Much of the real work was input normalization, error handling, and graceful degradation when individual items fail in a mixed batch.
We also gained a deeper appreciation for context window management. Less context often means better results when the context that remains is the right context.
What's Next for Pied Piper
We're exploring smaller, faster text encoder architectures that match current quality at lower inference cost, and moving from clip-level selection to frame-level pruning within clips for more aggressive video compression. The longer-term goal is making the total cost of compression plus downstream inference strictly less than sending uncompressed context, across every modality, at every scale.
Log in or sign up for Devpost to join the conversation.