Disaggregated Serving

Split prefill and decode — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.

Today's win: you'll predict why prefill and decode fight on one GPU, how disaggregation fixes it, and the transfer cost that gates the win.

The setup

Prefill is compute-bound; decode is memory-bound (Lessons 9 & 11). They run on the same GPU today.

Step 1 — the conflict

Step 2 — the fix

Recall — cover the screen: what is disaggregated serving?
Run prefill and decode on separate, dedicated worker pools — each tuned for its bottleneck (compute vs memory). The prefill worker produces the KV cache and ships it to a decode worker, which streams the tokens. Each pool scales independently (xPyD). (tap/hover to check)

Step 3 — the gating cost

Step 4 — when it pays off

Step 5 — where you are

In Kubernetes terms infra bridge

Two workloads on separate node pools tuned for different profiles (compute-optimized for prefill jobs, memory-optimized for decode services); the KV handoff is cross-pool data shipping — worth it only if pod-to-pod bandwidth beats co-locating.

On YOUR cluster context

You co-locate prefill+decode on single GPUs with --enable-chunked-prefill — right for 4 GPUs. Disaggregation (NVIDIA Dynamo) is the next tier at larger scale + fast fabric; your down-NVLink pair would hurt KV transfer, so fix that first. Your Lab →

Read this next — primary source DistServe · runnable: day12 (Dynamo), day18.

Final check — teach it back

Explain to a colleague: "Disaggregation helps when…"
…you're at scale with a fast interconnect and long prompts: prefill (compute-bound) and decode (memory-bound) each get dedicated, independently-scaled hardware, so a big prefill no longer stalls decodes. The catch is shipping the KV cache between them must be cheaper than co-locating — so it needs NVLink/InfiniBand-class bandwidth. (tap/hover)

I'm your teacher — ask me anything. Want to estimate the scale where disaggregation would pay off for your traffic?

← Lesson 19Next: Lesson 21 →

References

day12 Dynamo (notebook) / day18; DistServe.

Disaggregated Serving

Put prefill and decode on different hardware — and ship the KV between them.

Today's win: you'll explain why running prefill and decode on the same GPU forces a compromise, how disaggregation gives each its own tuned hardware, and the KV-transfer cost that decides whether it's worth it.

The picture: a prep kitchen and a plating line

Prefill is compute-heavy (big prep burst); decode is bandwidth-heavy (steady plating). Forcing both at one station means neither is ideal — a giant prep order blocks the plating line. Disaggregation gives prep its own station and plating its own, each tuned for its job. The cost: you must ship the prepped ingredients (the KV cache) from prep to plating.

the prep kitchen (heavy, bursty)	prefill workers (compute-bound)
the plating line (steady, fast)	decode workers (memory-bound)
shipping prepped trays between them	KV-cache transfer over the interconnect

1 · The conflict: two jobs, one GPU

Prefill saturates compute; decode saturates memory bandwidth (Lesson 11). On one GPU they interfere — a long prefill stalls everyone's decode. --enable-chunked-prefill (Lesson 24) softens this by interleaving, but they still share the same silicon tuned for neither.1

2 · Disaggregation: a station for each

Disaggregated serving runs dedicated prefill workers and dedicated decode workers. Prefill workers chew through prompts; decode workers stream tokens. Each pool is sized and tuned for its own bottleneck, so neither drags the other.2

Prefill and decode become separate pools, each tuned and scaled for its own bottleneck. The handoff is the prompt's KV cache.

3 · The catch: shipping the KV cache

Here's what makes or breaks it: the prefill worker's KV cache must be transferred to the decode worker — over NVLink, InfiniBand, or a layer like NIXL. If that transfer costs more than just decoding in place, disaggregation loses. So it pays off only with a fast interconnect and prompts big enough that the prefill savings outweigh the shipping.2

In Kubernetes terms infra bridge

This is running two workloads on separate node pools tuned for different profiles — a compute-optimized pool for the bursty prefill jobs and a memory/bandwidth-optimized pool for the long-lived decode services. The KV transfer is cross-pool data shipping over the cluster network, so the whole thing only wins if your pod-to-pod bandwidth is fat enough to beat doing it in one place.

4 · When it wins

Disaggregation shines at large scale: many GPUs, long prompts, and a fast fabric — which is exactly what NVIDIA Dynamo orchestrates (dynamic prefill/decode pools, KV routing) on systems like the GB200 NVL72 (72 GPUs as one NVLink domain). At small scale, co-located GPUs with chunked prefill are simpler and often enough.1

On YOUR cluster — you're at the simpler tier (for now) context

Your single-GPU vLLM deployments co-locate prefill and decode and lean on --enable-chunked-prefill to keep big RAG prefills from stalling decode — the right call at 4 GPUs. Disaggregation (NVIDIA Dynamo) is the next tier: worth it if you scale to many GPUs with a fast fabric and your very-long-prompt prefills start dominating. Note your NVLink fault (findings) would directly hurt KV transfer — fix that first. · Your Lab →

Read this next — primary source DistServe — Zhong et al. (the disaggregation case). Runnable companion: day12 notebook (Dynamo) & day18.

Check yourself (recall, don't peek)

I'm your teacher — ask me anything. Want to estimate at what scale disaggregation would pay off for your traffic? Just ask.

← Lesson 19 — model parallelism Next: Lesson 21 — GPU architecture →

References

NVIDIA Dynamo — disaggregated serving (day12, nvidia-dynamo.ipynb).
DistServe: disaggregating prefill and decoding — Zhong et al. (2401.09670); day18.