Inspiration
Open-weight models are already strong. The wall now is serving them over long context. Two things break at once. Inference gets slow: when the KV cache outgrows a single GPU, recovering it after a miss dominates latency. And memory gets lossy: when a conversation overflows the context window, today's systems either summarize everything (blurring the details) or cut by recency (dropping the old fact you still needed). We wanted to attack both, under a fixed budget, without bigger hardware.
What it does
- Falcon KV (speed): a DRAM warm-tier KV recovery path for vLLM. When KV state spills off the GPU, it recovers from DRAM instead of recomputing from scratch.
- SalienceCompact (accuracy): an importance-aware memory layer. It scores each chunk of conversation history, keeps the load-bearing facts verbatim, compresses the routine, and drops the noise, building one compact memory that answers many future questions under a fixed token budget.
How we built it
Falcon KV: we kept the model (Gemma) and hardware (1x H100) fixed and ran stock vLLM as an honest baseline, then layered a DRAM-backed warm tier with a reuse-biased policy on the same long-context traffic. We measured TTFT p95, end to end p95, post-TTFT token latency, and reload/spill volume. SalienceCompact: a training-free, model-agnostic, question-agnostic pipeline. We chunk the conversation, score each chunk for importance with a cheap model (Claude Haiku), map scores to actions (preserve exact, compress, drop), and reassemble a compact context under a budget. We evaluated on the LoCoMo long-conversation memory benchmark using its own F1 and exact-match metrics, plus an objective, model-free evidence-retention metric, with Mistral-7B as the answer model.
Challenges we ran into
The model server crashed mid-run once and our code silently turned every failed call into an empty answer, producing garbage F1 scores that looked real. We caught it, added a health check (empty-answer rate and scorer health), and made the runs trustworthy. We also hit a credential boundary: you should never put API keys on a leased GPU pod. We solved it by scoring with the Anthropic API locally and passing only a scores file to the pod, so the key never left our machine. On the KV side, the honest result is that the first warm-tier policy ties stock. The real lever turned out to be tail restore latency and smarter admission/eviction, not more plumbing.
Accomplishments that we're proud of
SalienceCompact wins every cell on LoCoMo (multi-hop and single-hop, every budget), and the lead widens as the budget grows: summarization plateaus, selection scales. It retains 2 to 3 times the evidence of the baselines on the objective metric. The headline: using the same Mistral-7B, SalienceCompact at a 4K budget beats the LoCoMo paper's own Mistral running at 8K (single-hop 22.4 vs 10.2, multi-hop 22.1 vs 12.8, F1 x100). Better answers with half the context. On the systems side, we built the first working DRAM warm-tier KV recovery path on vLLM end to end and pinpointed exactly where the next gain lives.
What we learned
Smarter memory beats more hardware. The bottleneck is not raw context size, it is what you keep and how fast you can get it back. We also learned to never trust silent failures (always measure health), to keep our compaction question-agnostic so it reflects the real agent setting rather than retrieval, and that handcrafted cache heuristics are workload-fragile (the Bitter Lesson favors a scalable warm tier plus an adaptive policy).
What's next for Falcon
The two halves compose. SalienceCompact keeps high-value tokens verbatim, which is exactly what a KV warm tier can cache and reuse, where summarization cannot. Next steps: an adaptive admission/eviction policy to cut tail restore latency, measuring the combined accuracy-per-token-of-context against latency, and validating SalienceCompact on real agent traces with tool calls, not just conversation.
Log in or sign up for Devpost to join the conversation.