Split prefill and decode — one guess at a time.
Prefill is compute-bound; decode is memory-bound (Lessons 9 & 11). They run on the same GPU today.
Two workloads on separate node pools tuned for different profiles (compute-optimized for prefill jobs, memory-optimized for decode services); the KV handoff is cross-pool data shipping — worth it only if pod-to-pod bandwidth beats co-locating.
You co-locate prefill+decode on single GPUs with --enable-chunked-prefill
— right for 4 GPUs. Disaggregation (NVIDIA Dynamo) is the next tier at larger scale + fast fabric; your
down-NVLink pair would hurt KV transfer, so fix that first. Your Lab →
Put prefill and decode on different hardware — and ship the KV between them.
Prefill is compute-heavy (big prep burst); decode is bandwidth-heavy (steady plating). Forcing both at one station means neither is ideal — a giant prep order blocks the plating line. Disaggregation gives prep its own station and plating its own, each tuned for its job. The cost: you must ship the prepped ingredients (the KV cache) from prep to plating.
| the prep kitchen (heavy, bursty) | prefill workers (compute-bound) |
| the plating line (steady, fast) | decode workers (memory-bound) |
| shipping prepped trays between them | KV-cache transfer over the interconnect |
Prefill saturates compute; decode saturates memory bandwidth (Lesson
11). On one GPU they interfere — a long prefill stalls everyone's decode. --enable-chunked-prefill
(Lesson 24) softens this by interleaving, but they
still share the same silicon tuned for neither.1
Disaggregated serving runs dedicated prefill workers and dedicated decode workers. Prefill workers chew through prompts; decode workers stream tokens. Each pool is sized and tuned for its own bottleneck, so neither drags the other.2
Here's what makes or breaks it: the prefill worker's KV cache must be transferred to the decode worker — over NVLink, InfiniBand, or a layer like NIXL. If that transfer costs more than just decoding in place, disaggregation loses. So it pays off only with a fast interconnect and prompts big enough that the prefill savings outweigh the shipping.2
This is running two workloads on separate node pools tuned for different profiles — a compute-optimized pool for the bursty prefill jobs and a memory/bandwidth-optimized pool for the long-lived decode services. The KV transfer is cross-pool data shipping over the cluster network, so the whole thing only wins if your pod-to-pod bandwidth is fat enough to beat doing it in one place.
Disaggregation shines at large scale: many GPUs, long prompts, and a fast fabric — which is exactly what NVIDIA Dynamo orchestrates (dynamic prefill/decode pools, KV routing) on systems like the GB200 NVL72 (72 GPUs as one NVLink domain). At small scale, co-located GPUs with chunked prefill are simpler and often enough.1
Your single-GPU vLLM deployments co-locate prefill and decode and lean on
--enable-chunked-prefill to keep big RAG prefills from stalling decode — the right call at
4 GPUs. Disaggregation (NVIDIA Dynamo) is the next tier: worth it if you scale to many GPUs with a
fast fabric and your very-long-prompt prefills start dominating. Note your NVLink fault
(findings) would directly hurt KV transfer — fix that
first. · Your Lab →