Cheetah.ai

Inspiration

Modern multi-agent pipelines - legal reviewers, financial analysts, document auditors - share the same documents across many agents. But every agent is stateless. Each one rebuilds its context from scratch, re-reading the same document the model just processed 30 seconds ago. We called this the amnesia tax.

The root cause is mathematical. Standard prefix caches like RadixAttention reuse KV state only up to the first token mismatch. The moment an agent's system prompt diverges at token 0 - different role, different task wrapping the same document - the entire shared context is a cold miss. Cache hits per pipeline: zero. We wanted to fix it not by changing the model or the cache engine, but by adding the layer that was missing: a control plane that reads the workflow graph and prepares the cache before the next agent asks.

What We Built

Cheetah.ai is a shared context bridge for multi-agent inference. It sits between your agent framework and your serving engine and provides four core pillars:

1. The Bridge Prefix caches fail the moment a single byte changes at the start of a prompt. The Bridge restructures each agent's prompt to enforce a canonical shape: [SYS] + [DOC] + [TASK]. The heavy document block is injected first as a byte-stable prefix. Agent-specific instructions are appended downstream. A SHA-256 fingerprint over the document block proves byte-identity across agents, even when system prompts drift. The result: every agent in the pipeline hits the same cached prefix with zero retraining and zero overhead.

2. The Orchestrator Between agent calls, the GPU sits idle. We use that dead time. The orchestrator reads the workflow manifest - a YAML DAG declaring which agents run, in what order, reading which documents. After Agent (i) completes, the orchestrator looks ahead to Agent (i+1), identifies its document, and fires a keep_resident warmup request before Agent (i+1) dispatches. This works for any document because the orchestrator knows what is coming from the DAG, not from traffic history. By the time Agent (i+1)'s request lands, the cache is already hot.

3. Robustness Real documents get amended. A 64-bit SimHash over the document block catches near-duplicates - whitespace edits, swapped numbers, reordered clauses - with a Hamming distance threshold of ≤ 10 / 64. A near-match still triggers keep-resident, so a minor edit does not become a cold miss. Eviction is also forward-looking. Standard LRU drops the least recently used entry. Ours drops what the next agent in the DAG will not need, keeping the upcoming document resident regardless of recency.

4. Snowflake Analytics & AI Layer We turned Snowflake from a passive data warehouse into the active AI and observability backbone of our control plane. Every orchestrator decision, cache hit, and eviction is asynchronously streamed to a live Snowflake sink. Instead of just drawing static charts, we use Snowflake Dynamic Tables to power a real-time leaderboard aggregating TTFT speedups and GPU-seconds saved - no external schedulers needed. We also leverage Snowflake Cortex AI (llama3.1-70b) directly within the database to read the orchestrator's decision log and automatically generate a natural-language run summary narrating how the pipeline performed.

Results

Tested on an Apple M4 Pro (16 GB) with vllm-mlx:

Scenario	Total TTFT
Stateless agents, UUID-busted prefixes	104 s
Cheetah.ai, 3 different documents	36 s

2.6x faster - on a harder workload. Agents 2 and 3 landed in under 0.5 seconds each. The orchestrator absorbed their cold prefills between agents, where the user cannot feel them. GPU-seconds saved ≈ 68 s per pipeline.

Challenges

The hardest problem was the prompt restructuring guarantee. Agents in the wild prepend role text, session IDs, and metadata before the document - exactly what breaks prefix caching. Getting the Bridge to enforce [DOC] at position 0 without breaking downstream task instructions required careful prompt boundary detection.

The second challenge was timing the orchestrator warmups correctly. Fire too early and the warmup competes with the active agent for GPU memory. Fire too late and Agent (i+1) dispatches before the cache is warm. We solved this by hooking into the agent completion event and issuing warmups with max_tokens=1 - enough to force a full prefill of the document block, but cheap enough to not interfere.

Finally, integrating the live Snowflake telemetry without slowing down the hot inference path was a hurdle. We had to build a non-blocking, batched queue system that fires observability data to Snowflake asynchronously.

Future Improvements

One of the key features is that the system will work across virtually any open-source model architecture. Depending on which model an agent uses, the runtime automatically selects and manages the appropriate KV cache format and attention implementation for that architecture. The orchestrator handles cache coordination between agent calls, ensuring cache creation and minimizing cache misses or unnecessary recomputation, so the user experiences a seamless low-latency workflow even when different agents or models are involved.

What We Learned

The biggest insight was that the workflow graph is the missing input. Serving engines are powerful but blind to intent. The moment you give the cache a lookahead — even one step — the problem changes completely. The cold prefill does not disappear; it just moves to where the user cannot feel it.

Built With

python
pyyaml
sha-256
simhash
snowflake
streamlit
vllm-mlx

Submitted to

Uncommon Hacks 2026
- Winner Best Use of Snowflake

Created by

I helped drive Cheetah.ai from concept to completion. I focused heavily on the initial idea generation and keeping our team organized with logistics. During the build, I helped with general execution and took the lead on preparing our final demo. It was a great experience balancing both the operational and hands-on sides of the project!

Md Samad
I worked across the project end-to-end from the initial idea and planning out the overall architecture to thinking through how the different components should fit together cleanly. I also helped with testing, validating results, debugging issues, and making sure everything worked properly during the demo flow.

I also contributed to the slides, demo creation, and overall communication/story of the project.

It was a pretty collaborative build so I ended up contributing across a mix of technical, product, and presentation work.

abdullahyousaf1798
Ali Mardan

Updates

Ali Mardan started this project — May 17, 2026 04:37 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.