Our method converts position-independent chunk overlap into prefix-aligned KV reuse without modifying the prefix-cache mechanism.
Abstract
Retrieval-augmented generation improves answer quality by adding external documents to the prompt, but the retrieved context increases prefill cost, GPU memory pressure, and time-to-first-token. Although many RAG requests retrieve overlapping document chunks, standard prefix caching can only reuse KV cache when the token prefix is identical—reuse opportunities are lost when the same chunks appear in different positions.
We observe that retrieved chunks are typically semantically self-contained, making chunk order a systems-level optimization variable. It reorders chunks using recent access-frequency statistics, places commonly shared chunks earlier in the prompt, indexes reusable chunk-prefix KV in a CP-Tree, and removes redundant chunks across turns in a conversation. Implemented on vLLM, our proposed design improves TTFT by 1.2–1.6× across evaluated workloads without measurable degradation in response quality.
Key Observations
Across realistic RAG workloads, 43% of retrieved chunks are shared with prior requests, yet only 11% are prefix-aligned and reusable under standard prefix caching. The gap is not from lack of overlap, but from positional misalignment that prevents exploitation of it.
Each retrieved chunk (128–512 tokens) is semantically self-contained. Experiments across LLaMA-3-8B, Mistral-7B, and Qwen-7B show negligible BERTScore and F1 change over 100 random chunk permutations—making chunk order a free optimization variable.
In multi-turn conversations, 21% of chunks per turn already appeared in an earlier turn and are needlessly recomputed. Removing these duplicates causes no statistically significant quality drop (paired t-test, p = 0.47), providing a free efficiency gain.
Design
Chunks
CA-Table
Hotness
CP-Tree
Suffix
Tracks chunk frequencies over a sliding window. Frequently reused chunks are promoted to earlier positions so similar requests share a common prefix.
Indexes cached chunk-prefixes with KV metadata (storage location, token span). Supports longest-prefix lookup and sub-prefix reuse from longer cached paths.
Removes chunks already seen in earlier turns of the same conversation, eliminating repeated KV computation for persistent context.
Reshapes inputs so more requests satisfy existing prefix-cache reuse conditions. Compatible with vLLM and complementary to CacheBlend.
Evaluation Highlights
Evaluated on HotpotQA, 2WikiMQA, SQuAD, TriviaQA, ShareGPT, and MTRAG across LLaMA-3-8B, Mistral-7B, and Qwen2.5-7B.
Takeaway
Our proposed method treats retrieved chunks as a reorderable systems abstraction. By aligning shared chunks into common prefixes and deduplicating repeated conversation context, it improves the practical effectiveness of KV prefix caching for RAG serving without requiring changes to the LLM architecture or the prefix-cache mechanism itself.
Citation
To appear...