Flash Cache

AI-powered applications increasingly rely on LLMs for planning, reasoning, and task execution across repetitive workflows such as customer service agents, SRE incident diagnosis, web navigation agents, and data-entry pipelines. These workflows routinely send structurally similar prompts that vary only in specific parameters (e.g., item name, region, incident ID). Each request today incurs a full LLM inference call at ~3.5s and significant token cost. Existing caching approaches fail to address this: • Exact-match caching achieves 0% hit rate whenever prompts vary at all. • Semantic caching (GPTCache, LangCache) returns verbatim cached responses for "similar" prompts - yielding 100% incorrect results when prompt variations require corresponding response variations. • KV/prefix caching (Anthropic, OpenAI, AWS Bedrock) reduces token recomputation cost but does not eliminate the LLM inference call. GenCache is a generative caching layer that closes this gap. It identifies structural patterns across prompt-response clusters and synthesizes lightweight Python programs that can generate correct, variation-aware responses locally - without calling the LLM. Published at NeurIPS 2025 (Microsoft Research / UIUC), it demonstrated 83–98% cache hit rates and up to 34% end-to-end latency reduction on real agentic workflows.

Built With

c++
cloudflare
cmake
fetchai

Built With

Updates