Tessera

Every answer is a tile your team already cut.

Inspiration

It started with a small, familiar moment: asking an AI assistant something — and realizing you'd asked nearly the same thing yesterday, but forgot the answer. You just paid for that answer twice.

One of our teammates kept watching a bigger version of this play out at their company. The same questions, again and again — "How do I get staging access?" "Which service owns the billing webhook?" "What's the deploy command for the legacy repo?" Sometimes a senior engineer answered for the fifth time that month; increasingly, an AI assistant answered instead. But the assistant had no memory that the organization had already solved it. Every repeat was a brand-new call — full cost, full latency — to regenerate an answer the company already had.

At the scale AI now runs, that waste is enormous. Google alone processes 3.2 quadrillion tokens a month; the whole LLM API market runs roughly 1.5 quadrillion tokens a month — about 50 trillion a day. Enterprises spent $37 billion on AI in 2025, and budgets are already breaking: Uber's CTO said the company burned through its entire 2026 AI budget in four months. The problem isn't that the models are bad. It's that we keep paying them to regenerate what we already know. That gap is what Tessera closes.

What it does

Tessera is a shared, segmented knowledge layer that sits in front of a team's AI assistant and reuses an answer the moment someone has already given it — instead of regenerating it.

The flow is straightforward:

A developer asks a question.
Before the question reaches the LLM, Tessera embeds it and checks Redis for a semantically similar question that has already been answered.
Cache hit — Tessera returns the institutional answer instantly, with context such as "3 engineers at your level asked this last week — here's what worked."
Cache miss — the question goes to the LLM, and the new answer is stored as a fresh tile for everyone who comes after.

This wins on two fronts at once. The infrastructure win: a cache hit skips the entire LLM call, cutting cost and returning in milliseconds instead of seconds. The business win: the same engine means engineers stop re-answering each other, with fewer interruptions and smoother onboarding. Each answer is a single tessera — a tile — and together they form a living mosaic of what the team already knows.

How we built it

Tessera intercepts a question before it ever reaches the model, rather than answering it after the fact like a typical search tool or chatbot.

Redis + RedisVL power the vector search and semantic cache. RedisVL's SemanticCache provided meaning-based matching, tunable distance thresholds, and TTLs out of the box. Its filterable_fields enabled one of our favorite features — answers segmented by tenure and seniority, so a new hire and a staff engineer asking the "same" question receive answers pitched at the right level.
Python ties the embedding, similarity search, and fallback-to-LLM path together.
Sentry catches the failure modes that matter most in a cache: incorrect matches and errors in the answer path.

Redis was the natural core: semantic caching needs both fast vector similarity search and a key-value store to fetch the stored response, and Redis handles both in one place.

Challenges we ran into

Threshold tuning. Too loose, and the cache returns a near question with the wrong answer; too strict, and it almost never hits. Finding the right distance threshold — and adding a confidence gate on top — was our hardest correctness problem.
Staleness. A cached answer can go out of date the moment a service is renamed or a process changes. We relied on TTLs and tile invalidation so the mosaic stays trustworthy.
Privacy and segmentation. A shared cache cannot leak answers across permission boundaries. Segmenting by team and seniority had to respect access, not just adjust the tone of the response.
Sizing the value honestly. Semantic caching's savings are real but depend on workload: repetitive, FAQ-style traffic hits 40–70% of the time, while creative or multi-turn work barely caches at all. We were careful to claim savings only where the data supports them.

Accomplishments that we're proud of

We built a working semantic cache that reuses real answers before the LLM is called — not just a search box that retrieves them afterward.
We made segmentation by tenure and seniority a first-class feature, so the same question returns the right answer for the right person.
We grounded the whole pitch in independent research and a transparent model, rather than optimistic claims.
We delivered a clear, demoable moment — the "3 engineers at your level asked this last week" experience — that makes the value obvious in seconds.
We built it on Redis as core infrastructure, using the sponsor's technology for exactly what it does best.

What we learned

Before writing a line of code, we checked whether this was a real, measurable problem. It is — on both the cost side and the human side.

The cost is exploding. The LLM API market processes ~1.5 quadrillion tokens a month, enterprises spent $37 billion on AI in 2025 (up 3.2x in a year), and companies are already hitting budget ceilings. Provider prompt caching helps, but only partially — it discounts the input tokens (Anthropic charges 0.1x on cache reads) while still regenerating every output. Reusing the whole answer requires a semantic cache.

Semantic caching works, and it's fast. Redis LangCache reports up to ~73% cost reduction on high-repetition workloads, with hits returning in milliseconds versus seconds. In one benchmark, a 7-second model call became a 27 ms cache hit — a 250x speedup. A peer-reviewed, Redis-based semantic cache reported 61–69% hit rates with over 97% accuracy on repetitive queries.

The human cost is just as real. The average knowledge worker spends 8.2 hours a week finding, recreating, and duplicating information (APQC). Three out of four developers re-answer questions they've answered before (Stack Overflow). Developers spend only about 16% of their week actually coding (Atlassian). And individual AI tools don't fix the team problem — only 17% of agent users said agents improved team collaboration.

The model, in plain terms. For a team of $n$ engineers asking $q$ repeated questions per week, each costing $m$ minutes at a loaded hourly cost $c$, the annual cost of repeated questions is:

$$\text{Annual cost} = n \times q \times \frac{m}{60} \times c \times 52$$

and the value Tessera recovers at capture rate $r$ is:

$$\text{Value saved} = \text{Annual cost} \times r$$

For 50 engineers asking 5 questions a week at 10 minutes each and $100/hr, that's about $217,000 a year; capturing 30% recovers roughly $65,000 — well above the cost of the tool.

This gave us our framing: token savings are the infrastructure win, and engineering time is the business win — and Tessera delivers both from one cache.

What's next for Tessera

Expanding to high-volume and customer-facing AI workloads, where token savings compound into hundreds of thousands of dollars a year.
Automatically promoting frequently-hit tiles into a curated, human-verified FAQ board.
Smarter staleness detection tied to repository and infrastructure changes.
Deeper segmentation and routing, with richer "who asked this and what worked" context.
Pilots with real teams to measure capture rates, dollars saved, and time recovered in practice.

Tessera started with one engineer answering the same question for the fifth time. That frustration turned out to be shared across the entire industry — and the fix isn't a smarter chatbot. It's a system that stops paying to regenerate what it already knows.

Built With

arize
devin
python
redis
redisvl
semantic-cache
vector-search

Updates

Altair Adilkhan started this project — Jun 21, 2026 01:58 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.