The 30× Mystery

Prefill vs decode — but this time, you work it out before I tell you.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.

Today's win: by the end you'll explain why decode is ~30× slower than prefill on your own H100s — and you'll have predicted almost the whole answer yourself, one step at a time.

The mystery

Here are two real numbers from your cluster right now — Qwen3.5-27B, same GPU, same model, one request:

prefill (reading the prompt): ~3,288 tokens/sec
decode (writing the answer): ~108 tokens/sec — about 30× slower

Step 1 — what does decode actually do?

Decode writes the answer one token at a time. Think about what the GPU needs for each new token.

Step 2 — so what's the bottleneck?

You've established that every token re-reads the whole model. Now: what limits how fast that can go?

Recall — cover the screen: in one line, why is decode memory-bound?
Each token re-reads all the weights (huge bytes from HBM) but does only one token's worth of math — so bandwidth, not FLOPs, is the ceiling. (tap/hover to check)

Step 3 — the test that proves it

If decode is really memory-bound, here's a prediction that pins it down. Imagine swapping in a GPU with double the compute (FLOPs) but the same memory bandwidth.

That's the whole asymmetry: prefill is compute-bound (big matrix multiplies over the entire prompt, tensor cores saturated), decode is memory-bandwidth-bound (re-read everything, do almost no math). Same GPU, opposite ceilings — and that's your 30×.1

Step 4 — the payoff: batching

One more prediction — this is where it connects to your day job. Instead of one request, serve 8 at once. The expensive part of decode was re-reading the weights…

Recall — say it from memory: prefill is ___-bound; decode is ___-bound; batching mainly helps ___.
Prefill = compute-bound · decode = memory-bound · batching mainly helps decode throughput (one weight-read serves the whole batch). (tap/hover to check)

So continuous batching, routing, autoscaling — your ops notebooks — are all really about keeping the memory-bound decode phase full, because a full batch multiplies throughput for almost no extra cost. (Up to memory and compute limits — that's Lessons 10 and 11.)

Read this next — primary source Mastering LLM Techniques: Inference Optimization — NVIDIA. Read the prefill/decode and KV-cache sections to confirm the answers you just predicted.

Final check — teach it back

The real test of the predict-first format: can you reconstruct it cold?

Explain to a colleague (out loud): "Our 27B is 30× slower at writing than reading because…"
…reading the prompt (prefill) is one big parallel matmul that saturates the GPU's compute, while writing (decode) re-streams the entire model from memory for every single token — so it's capped by memory bandwidth, not compute. Batching many requests shares that one weight-read, which is why throughput scales with batch. (tap/hover to check yourself)

I'm your teacher — ask me anything. Did the predict-first format help, or get in the way? Want the supporting diagrams from the main version added back as reveals? Tell me and I'll tune it before redoing Lessons 2+.

← Lesson 8 · What Is an Inference Engine?Lesson 10 · The KV Cache & Its Memory →

References

Prefill is Compute-Bound, Decode is Memory-Bound — Towards Data Science. towardsdatascience.com · Mastering LLM Techniques — NVIDIA

Prefill vs. Decode

The one asymmetry that explains your whole serving stack.

Today's win: by the end you'll be able to say, from memory, why prefill is compute-bound and decode is memory-bandwidth-bound — and use that single fact to predict how a serving system behaves under load.

First, a picture for your head: the faraway pantry

You're a lightning-fast line cook. Your ingredients sit in a pantry across town; one road and one van connect you. Watch the van — it hauls the entire pantry for every order. Hold this image; each technical term below maps onto it.

pantry across town	GPU memory (HBM) — weights + KV cache live here
the road & the van	memory bandwidth — how fast bytes move
the cook's stoves & hands	compute — FLOPs / tensor cores

See the full mapping (KV cache, batching, …) →

1 · Two phases, two shapes

Every request runs in two phases. Prefill reads your whole prompt in one pass and emits the first token. Decode then emits the rest one token at a time, each step feeding back the token it just made.1

Pantry: prefill = one big catering order cooked in a single go. Decode = à la carte, one bite at a time.

Watch the tokens light up one after another on the right — that's decode, strictly sequential. Prefill (left) is a single wide pass.

2 · Same GPU, opposite bottleneck

Here's the crux. Prefill does huge matrix–matrix multiplies and saturates the tensor cores — limited by compute. Decode, to make a single token, must re-read all the model weights plus the whole KV cache from memory while doing almost no math — limited by memory bandwidth.23

Pantry: the catering cook is limited by how fast he can cook (compute). The à-la-carte cook is limited by the van (bandwidth) — the cooking is trivial.

The shine sweeps the full bar — prefill maxes compute, decode maxes memory. Speeding up the idle resource buys nothing.

3 · Why one decode token is so expensive

It feels backwards that generating one token is the slow part. The reason: per token you move gigabytes of weights + cache out of memory, but do only one token's worth of arithmetic. That ratio — FLOPs per byte — is the arithmetic intensity, and for decode it's tiny.3

Pantry: the van hauls the whole pantry across town just so you can make one knife-cut. Watch it haul below.

The fat pipe (bytes) is the bottleneck; the math is trivial. That's what "memory-bandwidth-bound" looks like.

4 · The payoff — batching

This is where it ties back to your repo. Because that one expensive weight-read can serve many sequences at once, batching multiplies decode throughput almost for free. Continuous batching, routing, autoscaling — your ops notebooks are all strategies to keep the memory-bound decode phase fed.4

Pantry: the van is crossing town anyway with the whole pantry — so plate bites for 32 customers from that one trip.

One weight-read, many sequences served in turn. Bigger batch → more tokens per byte moved → higher throughput. This is why continuous batching exists.

One nuance to file away: decode is memory-bound at the small/medium batch sizes most serving runs at. Past roughly batch 32 the decode matrix multiplies themselves turn compute-bound, while attention stays bandwidth-bound. We'll derive that crossover with the roofline model in a later lesson.3

In Kubernetes terms infra bridge

Prefill is a request's heavy init / cold-start work — one big compute burst — while decode is its long-running steady-state loop. Same split as an initContainer (expensive, runs once) versus the main container (runs forever): two different resource profiles, which is exactly why some stacks even schedule them on separate node pools (Lesson 20).

On YOUR cluster measured · live

This isn't just theory — it's running on your 4×H100 lab right now. Qwen3.5-27B (tensor-parallel across 2 H100s), single request:

prefill (prompt) throughput up to ~3,288 tok/s — compute-bound, devouring the prompt
decode (generation) throughput ~108 tok/s — memory-bound, one token at a time

That ≈30× gap is the asymmetry of this lesson, on your own GPUs. And your traffic is heavily prefill-bound (prompt:generation ≈ 19:1 to 47:1 — classic RAG), which is exactly why these servers run --enable-chunked-prefill. Watch it live: bash learning/tools/cluster-probe.sh · Your Lab →

Read this next — primary source Mastering LLM Techniques: Inference Optimization — NVIDIA Technical Blog. The highest-trust single overview; read the prefill/decode and KV-cache sections.

Check yourself (recall, don't peek)

Picture the van and the diagrams above, then answer from memory.

I'm your teacher — ask me anything. Want a diagram redrawn, the van to move differently, a worked example with a real model + GPU, or the KV-cache memory math next? Say the word and I'll go deeper or build the next lesson around it.

← Lesson 8 · What Is an Inference Engine?Lesson 10 · The KV Cache & Its Memory →

References

Prefill-decode disaggregation — LLM Inference Handbook (BentoML). bentoml.com
Mastering LLM Techniques: Inference Optimization — NVIDIA Technical Blog. developer.nvidia.com
Prefill is Compute-Bound, Decode is Memory-Bound — Towards Data Science. towardsdatascience.com
vLLM Explained: PagedAttention and Continuous Batching — RunPod. runpod.io