2026

Chunk-Level KV Cache Reuse for Efficient RAG Serving

A serving-system approach that restructures retrieved chunks to unlock direct prefix-cache reuse in retrieval-augmented generation.

High Chunk Overlap → Weak Prefix Overlap Our proposed design reorders retrieved chunks so existing prefix caching can reuse KV directly. Original RAG Requests Req 1: C7 C9 C5 C6 Req 2: C5 C7 C9 C10 ✗ No prefix match — KV recomputed frequency-guided reorder Aligned Chunk Prefixes Req 1: C7 C9 C5 C6 Req 2: C7 C9 C5 C10 shared prefix → KV reused ✓ Shared prefix — KV reused directly

Our method converts position-independent chunk overlap into prefix-aligned KV reuse without modifying the prefix-cache mechanism.

Abstract

Retrieval-augmented generation improves answer quality by adding external documents to the prompt, but the retrieved context increases prefill cost, GPU memory pressure, and time-to-first-token. Although many RAG requests retrieve overlapping document chunks, standard prefix caching can only reuse KV cache when the token prefix is identical—reuse opportunities are lost when the same chunks appear in different positions.

We observe that retrieved chunks are typically semantically self-contained, making chunk order a systems-level optimization variable. It reorders chunks using recent access-frequency statistics, places commonly shared chunks earlier in the prompt, indexes reusable chunk-prefix KV in a CP-Tree, and removes redundant chunks across turns in a conversation. Implemented on vLLM, our proposed design improves TTFT by 1.2–1.6× across evaluated workloads without measurable degradation in response quality.

Key Observations

1 High Redundancy, Low Prefix Reuse

Across realistic RAG workloads, 43% of retrieved chunks are shared with prior requests, yet only 11% are prefix-aligned and reusable under standard prefix caching. The gap is not from lack of overlap, but from positional misalignment that prevents exploitation of it.

Chunk redundancy across requests is high, but prefix-aligned chunk-KV reuse is limited due to positional misalignment.
0% 25% 50% Percentage HotpotQA 2WikiMQA SQuAD TriviaQA Ave. Prefix (11%) Total (43%)
Chunk overlap across requests: prefix-aligned vs. total shared.
0.5 0.75 1.0 Request ID (0–18) BERTScore Baseline Random (mean±std)
BERTScore: baseline order vs. 100 random permutations. Lines overlap closely across all models.
2 Chunk Order Doesn't Affect Quality

Each retrieved chunk (128–512 tokens) is semantically self-contained. Experiments across LLaMA-3-8B, Mistral-7B, and Qwen-7B show negligible BERTScore and F1 change over 100 random chunk permutations—making chunk order a free optimization variable.

Chunk order has negligible impact on generation quality when chunks are semantically self-contained (128–512 tokens per chunk).
3 Multi-Turn RAG Amplifies Redundancy

In multi-turn conversations, 21% of chunks per turn already appeared in an earlier turn and are needlessly recomputed. Removing these duplicates causes no statistically significant quality drop (paired t-test, p = 0.47), providing a free efficiency gain.

In multi-turn RAG, repeated chunk retrieval leads to redundant KV computation that can be eliminated without affecting generation quality.
Turn 1 C1 C2 C3 Q1 A1 Turn 2 C1 C2 C7 Q2 A2 repeated Repeated chunks per turn 0% 25% 50% 24% MTRAG 18% ShareGPT 21% Ave.
Repeated chunks across turns (top) and frequency on MTRAG/ShareGPT (bottom).

Design

Retrieve
Chunks
Update
CA-Table
Reorder by
Hotness
Lookup
CP-Tree
Reuse KV +
Suffix
Chunk-Access TableRecent access statistics

Tracks chunk frequencies over a sliding window. Frequently reused chunks are promoted to earlier positions so similar requests share a common prefix.

Chunk-Prefix TreeReusable prefix metadata

Indexes cached chunk-prefixes with KV metadata (storage location, token span). Supports longest-prefix lookup and sub-prefix reuse from longer cached paths.

Conversation DeduplicationMulti-turn reuse

Removes chunks already seen in earlier turns of the same conversation, eliminating repeated KV computation for persistent context.

CompatibilityBuilt on prefix caching

Reshapes inputs so more requests satisfy existing prefix-cache reuse conditions. Compatible with vLLM and complementary to CacheBlend.

Evaluation Highlights

1.2–1.6×TTFT speedup
19–38%KV computation eliminated
No lossmeasurable quality impact

Evaluated on HotpotQA, 2WikiMQA, SQuAD, TriviaQA, ShareGPT, and MTRAG across LLaMA-3-8B, Mistral-7B, and Qwen2.5-7B.

Takeaway

Our proposed method treats retrieved chunks as a reorderable systems abstraction. By aligning shared chunks into common prefixes and deduplicating repeated conversation context, it improves the practical effectiveness of KV prefix caching for RAG serving without requiring changes to the LLM architecture or the prefix-cache mechanism itself.

Citation

To appear...