Reuse the shared opening — one guess at a time.
Your requests mostly start the same way — same system prompt, same RAG context, same chat history. Re-prefilling that shared opening every time feels wasteful. Can you avoid it?
Prefix caching = the image-layer cache (shared base layers built once). Cache-aware routing = session affinity / consistent-hash to the replica with the warm cache — round-robin scatters and misses, exactly like sending a stateful session to the wrong pod.
--enable-prefix-caching is ON, and your traffic is 19–47:1
prefill-heavy (big shared RAG prompts) — the ideal case. Same mechanism as the prompt caching
bridge in Lesson 12, minus the TTL/price. Capture more with cache-aware routing (Lesson 25). Your Lab →
Reuse the KV of a shared prefix — and route so the cache actually hits.
Tons of your requests start the same way — the same system prompt, the same RAG context, the same chat history so far. Re-prefilling that shared opening every time is like re-chopping the same onions for every order. Prefix caching preps the shared base once and reuses its KV for everyone who starts the same way.
| the shared base sauce, made once | cached prefix KV (reused across requests) |
| fridge → pantry → cold storage | KV hierarchy: VRAM → host RAM → SSD |
| send the order to the station with the sauce | cache-aware routing |
System prompts, retrieved RAG documents, multi-turn history — these repeat across requests. Since the KV cache for a token depends only on the tokens before it, identical prefixes produce identical KV. So you can compute it once and reuse it.1
The engine hashes each incoming prefix; on a hit, it reuses the cached KV blocks and
skips re-prefilling them — so the user's time
to first token drops to just processing the new suffix. vLLM does this with
--enable-prefix-caching; SGLang generalizes it with RadixAttention (a radix
tree of all live prefixes, so even partial overlaps share).2
Prefix caching is the image-layer cache: identical base layers are pulled and built once, and every image that shares them starts fast — only the top (changed) layer is new work. A cache miss is a cold pull. Which is exactly why the next piece matters…
Cached prefixes compete for scarce VRAM. So engines tier them: hottest in GPU VRAM, spilled to host RAM, then local SSD, then networked storage — each bigger but slower. There's a race: if fetching a cached block from a lower tier is slower than just re-prefilling it, you re-prefill instead.1
Here's the operational catch you'll feel: prefix caches are per replica. If your load balancer sends a request to a replica that doesn't hold its prefix, it's a miss — full re-prefill, no savings. So routing must be cache-aware: hash the prefix and send matching requests to the same replica (covered in Lesson 25).2
This is session affinity / consistent-hash routing. A round-robin Service
scatters requests and tanks your hit rate; a sessionAffinity or prefix-hash Ingress pins
related requests to the replica that already has the warm cache — the same reason you route a user's
session to the pod holding their state.
Your vLLM runs --enable-prefix-caching (confirmed in the server
args). And your traffic is 19–47:1 prefill-heavy (classic RAG with big shared system prompts) —
exactly the workload where prefix caching is a huge TTFT win. This is the same mechanism as the
prompt caching bridge in Lesson 12,
just without the TTL/billing wrapper. Next step to capture more of it: cache-aware routing (Lesson 25).
· Your Lab →