KVStream: A Decode-Side Attention Datapath for Long-Context Inference

Inspiration

Modern inference hardware is incredibly good at dense linear algebra. Prefill, projections, and MLPs expose big matrix operations, so they map naturally onto systolic arrays and GEMM engines. Decode is a different problem.

In long-context decode, every generated token has to scan a growing KV cache. At low batch sizes, that stops looking like a dense matrix multiply and starts looking like a streaming memory-and-reduction problem: one query vector, a long KV cache, online normalization, and a final attention output.

KVStream was built around that observation. Rather than trying to build a full transformer accelerator, we focused on one critical decode primitive: a streaming attention datapath that sits next to dense compute units and specializes the KV-cache scan.

We are not claiming online softmax or FlashAttention-style streaming attention as new. The contribution is translating that known structure into measured hardware, identifying where it actually breaks down, and exploring how a decode-specific ASIC datapath can push beyond the general-purpose baseline.

What it does

KVStream implements a streaming attention tile for single-query decode. It takes the current query vector, streams cached K/V vectors, and produces the final per-head attention output.

Instead of storing the full attention score matrix or softmax probabilities, the tile keeps only the running online-softmax state: the running max m, the denominator l, and the weighted value accumulator acc[D]. At the end, it emits the attention output O[D], which would feed the normal output projection path.

The project has three main parts:

A measured serial HLS/RTL streaming attention baseline.
A block-parallel online-softmax model that breaks the long recurrence across KV blocks.
A Skip-Softmax-style HLS extension that gates V reads, exp calls, and PV MACs when a block is negligible.

How we built it

We started with a serial streaming attention tile in HLS. The tile computes exact single-query attention by streaming through the KV cache and maintaining online-softmax state. We validated it against a NumPy reference, ran HLS synthesis, passed RTL cosimulation, and ran Vivado out-of-context synthesis for the serial baseline.

The serial baseline matched reference attention with around 1e-6 error, passed RTL cosim, and met timing in Vivado OOC at 200 MHz. However, the datapath could not accept a new KV token every cycle because the online-softmax state depends on the previous token.

That measurement shaped the rest of the project. The first bottleneck was not just memory traffic; it was the recurrence through m, l, and acc[D].

From there, we built a block-parallel architecture model. Each KV block computes its own local summary: m_b, l_b, and acc_b[D]. A reducer can merge those block summaries without revisiting the individual tokens. This changes attention from one long serial chain into parallel block computation followed by a reduction tree.

We also added a Skip-Softmax-style HLS extension. For each K block, the hardware computes the maximum QK score in that block. If the block max is far below the running max, the block is unlikely to contribute meaningfully to the final softmax output. In that case, the tile skips the V block read, the exp calls, and the PV multiply-accumulate path. It still computes QK for every token, so this is value-path gating, not skipping all attention work.

Bottlenecks

The main bottlenecks are:

The serial online-softmax recurrence through m, l, and acc[D].
KV memory traffic. In our 4K-context projection, the model reads about 268 MB of KV data per generated token.
The area cost of aggressive block-level parallelism.
Memory interface packing. The initial HLS interface used raw int8_t Q/K/V values, which can waste useful AXI bandwidth if each transfer carries only 8 useful bits on a wider bus.

How KVStream works around them

The serial streaming tile avoids materializing the attention matrix. It keeps only online-softmax state locally and emits the final attention output.

The block-parallel model attacks the recurrence. Instead of processing every token through one dependency chain, each block computes a local attention summary and a reducer merges the summaries.

The Skip-Softmax gate attacks the value side of the memory and compute path. If a block’s max score is too small, the datapath skips loading its V rows and avoids the exp/PV work for that block.

Together, these ideas move decode attention toward a better hardware regime: streamed KV access, local state, hierarchical reduction, and physical gating of unnecessary value-path activity.

Results

The measured serial tile is real hardware evidence: HLS synthesis completed, RTL cosimulation passed, and Vivado out-of-context synthesis met timing for the serial baseline.

Using that measured baseline, we built a sequence-length-corrected ASIC projection for 4K-context decode. After correcting the bandwidth model for packed AXI usage, the serial projection reaches about 9.9K attention-only decode tokens per second. The H100 attention-only roofline in the same setup is about 7.5K tokens per second, so the serial projected datapath is about 1.32x H100 attention-only decode.

The block-parallel model improves the architecture by breaking the recurrence across KV blocks. Before Skip-Softmax gating, the latency-optimized block model reaches the KV bandwidth wall at about 29.8K tokens per second, or 3.98x the H100 attention-only roofline. This is modeled, not synthesized, and the aggressive low-latency configuration is area-heavy.

The Skip-Softmax extension adds value-path gating on top of that. For the 4K peaked-attention projection, the before/after looks like this:

Design point	Projected attention-only decode	H100 comparison
H100 roofline	7.5K tok/s	1.00x
Serial streaming tile	9.9K tok/s	1.32x
Block-parallel model before Skip-Softmax	29.8K tok/s	3.98x
Block-parallel + Skip-Softmax, streaming policy	43.8K tok/s	5.86x
Block-parallel + Skip-Softmax, two-pass policy	57.8K tok/s	7.72x

The two Skip-Softmax rows are modeled projections on a peaked-attention distribution. The streaming policy is more hardware-natural because it uses the running max as blocks arrive. The two-pass policy makes better skip decisions because it first finds the global max, so it should be read as an upper-bound model.

The Skip-Softmax HLS extension also produced valid Vivado utilization with no black boxes. It used about 17.5K LUTs and 87 DSPs. The timing report for that extension was unconstrained, so we do not claim Fmax for it yet.

How it fits into a broader inference ASIC

KVStream is not a full transformer accelerator. It is a decode-side attention datapath that would sit next to systolic or GEMM arrays.

A broader inference chip would naturally split prefill and decode. Prefill has high arithmetic intensity and lots of parallelism across sequence positions, so dense compute fabrics are the right tool. Decode has a different shape: one query vector scans a long KV cache. A systolic array can compute the dot products, but the bottleneck becomes KV streaming, online normalization, and reduction.

In that architecture, the dense units handle QKV projection, output projection, and MLP. KVStream handles the decode attention path. The Q vector comes from the projection unit, the KV address generator streams cached K/V into KVStream, KVStream produces the per-head attention output, and the output projection returns to the dense compute path.

KVStream does not replace systolic arrays. It complements them by specializing the part of decode that looks less like dense matrix multiplication and more like streaming reduction over memory.

Applications

KVStream is most relevant for low-batch, latency-sensitive, long-context decode.

That includes agentic systems with long memory, codebase and document assistants, retrieval-heavy chat systems, personalized learned-KV workloads, and latency-sensitive financial or trading agents.

The common pattern is the same: small batch size, long context, and repeated decode over a large KV cache.

Accomplishments that we're proud of

We built a real hardware baseline, not just a model. The serial streaming attention tile passes NumPy correctness, RTL cosimulation, and Vivado out-of-context synthesis.

The measured baseline gave us reason to explore block-level parallelism instead of just assuming streaming attention would be enough.

The block-parallel model gave us a clean architecture direction: summarize KV blocks with (m, l, acc[D]), then merge those summaries in a tree. That is the core dependency-breaking idea.

The Skip-Softmax extension gave us a concrete control-path story. A small block-level predicate can physically gate V memory reads, exp calls, and PV MAC activity when attention is block-sparse.

We are also proud that the project is explicit about what is measured and what is modeled. The serial tile is synthesized. The block-parallel reducer is an architecture model. The Skip-Softmax extension has hardware utilization evidence, with timing still requiring a constrained run before making Fmax claims.

What we learned

We learned that online softmax alone is not the interesting hardware contribution. The interesting part is what happens when online softmax becomes a decode datapath with real scheduling, memory, and recurrence constraints.

The serial design showed that decode attention can be recurrence-bound, not just bandwidth-bound. The block-parallel model showed that the recurrence can be broken by treating each block’s (m, l, acc[D]) as a mergeable summary. The Skip-Softmax extension showed that a kernel-level idea can become a physical hardware gate for memory and compute activity.

We also learned that data layout matters as much as datapath design. If Q/K/V are exposed as raw int8 values over an underpacked AXI interface, the design can waste most of the available bus width. A serious decode engine needs the memory interface, local buffering, and compute datapath to be co-designed.

What's next for KVStream

The broader goal is to turn KVStream from a measured tile and architecture model into a full decode-attention subsystem.

That means exploring the design space across memory bandwidth, local SRAM staging, block size, lane count, numeric precision, reduction topology, and scheduler integration. The next generation would combine packed KV streaming, block-parallel online-softmax reduction, skip-based value gating, and ASIC-realistic softmax approximations into a unified decode engine.

Long term, KVStream points toward a specialized attention path for long-context inference: dense arrays handle the matrix-heavy parts of the transformer, while a decode-side streaming engine handles KV-cache attention with local state, hierarchical reduction, and hardware-level memory gating.