Research Explainer · Hu et al. (2025)

EPIC reuses KV caches across any prefix, by recomputing only a handful of tokens per chunk

Position-Independent Caching lets language models reuse document KV vectors regardless of what comes before them. EPIC's LegoLink algorithm fixes the resulting attention sink with O(kN) work instead of O(N²).

Published May 2025

lower Time-To-First-Token vs CacheBlend-15 under multi-request workloads

higher throughput than CacheBlend-15 on Llama 3.1 8B at matched context cache ratios

O(kN) LegoLink recomputation cost (k initial tokens per chunk) versus CacheBlend's O(N²)

Modern LLM prompts are mostly boilerplate. A multi-document QA request might be ten thousand tokens of retrieved passages followed by a fifty-token question. Existing serving systems cache Key-Value (KV) vectors for repeated prefixes so the prefill stage does not have to redo that work. That trick only fires when the prefix matches exactly, token for token.

In retrieval-augmented generation and few-shot setups, the same document keeps appearing in different prompts, but rarely in the same position. A passage that arrived second yesterday arrives fifth today, behind a different system message. Standard prefix caching invalidates the whole thing and recomputes from scratch. The immutable content is right there in cache, and the system cannot touch it.

TTFT vs context length: linear LegoLink, quadratic everyone else

Hu et al. (2025), Figure 9. Synchronous workload on a single A100-80GB with a fixed 512-token chunk size. CacheBlend-15 OOMs near 35k tokens; LegoLink-16 keeps going past 50k. Values traced from the published figure.

Comparison of four position-independent caching algorithms (Naive, Fully Recompute, CacheBlend, LegoLink), each shown as cached KV chunks plus a query, the tokens it recomputes, its answer with a correct/incorrect mark, and its resulting attention-map heatmap.
How the four caching strategies differ. Naive reuses every chunk's cache untouched and gets the answer wrong ('Derek'); LegoLink recomputes only the first few tokens of each chunk and recovers the correct answer ('Chrysan Company') at a fraction of the cost. Adapted from Hu et al. (2025), Figure 4.

EPIC formalises Position-Independent Caching (PIC) as a compile-and-link pipeline, borrowed from how C source files become executables. The compile step submits each immutable chunk on its own, with position IDs starting from zero, and stores the resulting KV vectors under a cache ID. This is analogous to producing a position-independent object file.

The link step is where the real work happens. When a request arrives with a list of cache IDs and a user query, EPIC fetches the cached KV vectors, stitches them together at the correct positions, and recomputes a subset of vectors so attention behaves correctly. The question is which subset, and how few you can get away with.

Figure 3 of the paper shows the system architecture: a KVCompile module that handles the offline pre-computation, a KVCache store (HBM, DRAM, and local filesystem, in the style of Mooncake), and a KVLink prefill path that runs alongside standard prefill and houses the LegoLink algorithm.

Compiling each chunk in isolation creates a peculiar pathology. Because position IDs always start at zero, the first few tokens of every chunk become disproportionately important to the attention mechanism, a phenomenon known as the attention sink. Figure 4 of the paper visualises this directly: in the Naive reuse case, four bright vertical stripes appear at the start of each chunk, and attention never reaches the actual answer further down.

LegoLink's insight is that you do not need to fix the whole prompt to fix this. You need to fix the sinks. The algorithm recomputes only the first k tokens of each chunk (except the first chunk, which is genuinely at the start). Those k tokens learn that they are not, in fact, the beginning of the sequence, lose their sink behaviour, and let later tokens attend to the right places.

The result is recomputation cost of O(kN) where k is small and fixed (typically 2 to 32) rather than CacheBlend's O(N²). LegoLink also uses static sparsity: the tokens to recompute are known up front, with no first-layer probe required. CacheBlend's dynamic selection of 15% of tokens runs an entire first-layer pass to pick them, which Appendix A of the paper shows costs 16.4% to 63.6% of total TTFT on its own.

The runtime cost of dynamic sparsity is not a small constant. Figure 10 of the paper breaks down CacheBlend-15's TTFT layer by layer across six datasets and three models. The second transformer layer, which is where CacheBlend picks its 15% of tokens by comparing attention maps, dominates everywhere it matters.

DatasetMistral 7BLlama 3.1 8BYi 9B
Needle in a Haystack32.33%32.31%28.82%
HotpotQA58.40%51.96%52.11%
MultiNews20.05%17.68%16.37%
SAMSum49.49%42.31%44.11%
MuSiQue63.56%58.79%58.39%
2WikiMQA40.67%35.97%35.68%
Hu et al. (2025), Figure 10. Share of total TTFT spent on CacheBlend-15's first-layer token-selection pass, by dataset and model.

The complexity argument shows up directly when you stretch the prompt. Figure 9 of the paper sends synchronous requests of increasing length to all three algorithms with a fixed 512-token chunk size. Fully Recompute and CacheBlend-15 both follow the expected quadratic curve. LegoLink-16 grows close to linearly, since k is fixed and only N changes.

The memory consequences are equally stark. CacheBlend-15 runs out of memory on a single A100-80GB at roughly 35,000 tokens, because its intermediate tensors scale with the quadratic recomputation. LegoLink-16 keeps running comfortably past 50,000.

The trade is favourable but not free. Across six datasets and three models, LegoLink-2 (just two recomputed tokens per chunk) keeps accuracy drops within 0 to 7% of Fully Recompute, while cutting TTFT by up to 3×. CacheBlend-1 or CacheBlend-5, recomputing a similar absolute number of tokens, can lose up to 80% of FR's accuracy on the same workloads.

An even more extreme variant, LegoLink-0, prepends four dummy tokens during compile and discards their KV vectors, then skips link-step recomputation entirely. It works surprisingly often: the attention sink is neutralised at compile time, and the runtime overhead is zero. The authors flag one remaining failure mode across all sparsity-based methods, including StreamingLLM, H2O, and Quest: outputs sometimes start correctly and then ramble, dragging F1 and ROUGE-L scores down without being technically wrong.

WHY IT MATTERS

The accepted wisdom for PIC was that the link step needed quadratic work to maintain accuracy. EPIC shows it only needs to neutralise the attention sink at each chunk boundary, which is a constant per chunk. That collapses an O(N²) problem into O(kN) and turns long-context RAG from a memory-management nightmare into ordinary serving.

Reference

Hu, J., Huang, W., Wang, W., Wang, H., Hu, T., Zhang, Q., Feng, H., Chen, X., Shan, Y., & Xie, T. (2025). EPIC: Efficient Position-Independent Caching for Serving Large Language Models. Proceedings of the 42nd International Conference on Machine Learning, PMLR 267. https://arxiv.org/abs/2410.15332