Keeping the GPU Full

Two fixes for wasted decode — predicted, then revealed.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.

Today's win: you'll predict why naïve batching idles the GPU and why contiguous KV wastes memory — and how continuous batching + PagedAttention fix both.

The setup

You batch 4 requests together. They finish at very different lengths — one wants 20 tokens, another 800.

Step 1 — why does static batching idle?

Step 2 — the fix

If the problem is waiting for the slowest, what should you change about when you schedule?

Recall — cover the screen: in one line, what is continuous batching?
Schedule one decode step at a time: after every step, evict finished requests and admit waiting ones, so the running batch never drains. (Iteration-level scheduling, from Orca.) (tap/hover to check)

Step 3 — but can you fit the batch?

Continuous batching only helps if the sequences fit in memory. Early servers reserved one contiguous KV region per request, sized to its maximum possible length.

Step 4 — the memory fix

Recall — say it: what does PagedAttention borrow from operating systems, and what does it buy?
Virtual memory / paging: store KV in small fixed-size blocks allocated on demand and non-contiguously. ~0 fragmentation → far more sequences fit, so continuous batching can actually keep the batch full. (tap/hover to check)

On YOUR cluster real config

vLLM is this engine. --max-num-seqs 8/64 = the continuous-batch width; --enable-prefix-caching shares KV blocks across requests with the same prompt prefix (big for your RAG); --enable-chunked-prefill keeps long prefills from blocking decodes. Live gauge: num_requests_running vs num_requests_waiting. Your Lab →

Read this next — primary source PagedAttention (vLLM) — Kwon et al., SOSP'23 and Orca (continuous batching) — OSDI'22.

Final check — teach it back

Explain to a colleague: "Naïve LLM serving wastes the GPU two ways…"
…static batching idles it waiting for the slowest request in the batch (fixed by continuous batching — swap requests every step), and contiguous max-length KV reservation wastes 60–80% of memory (fixed by PagedAttention — small on-demand blocks). (tap/hover)

I'm your teacher — ask me anything. Want a diagram of prefix-caching block sharing, or how chunked prefill works?

← Lesson 11 · The RooflineLesson 13 · CUDA Kernels & Fusion →

References

PagedAttention — Kwon et al., SOSP'23 (2309.06180) · Orca — Yu et al., OSDI'22 (usenix).

PagedAttention & Continuous Batching

Two fixes that keep the memory-bound decode phase fed.

Today's win: you'll explain why naïve static batching idles the GPU and why contiguous KV allocation wastes most of your memory — and how continuous batching + PagedAttention fix both, turning the batch cap you computed in Lesson 10 into real throughput.

Pantry recap: decode is the van crossing town. From Lesson 9 you know the van is cheapest when it's full. This lesson is about two ways a naïve kitchen keeps the van half-empty — and how to fix them. (full analogy →)

1 · Problem A — static batching idles the GPU

Group N requests, run them together, wait for all to finish, then start the next batch. The catch: requests finish at wildly different lengths, so the whole batch stalls on the slowest one while finished slots sit idle. On real traffic that leaves roughly 60% of the GPU idle.3

Pantry: one shared van that refuses to leave until every customer's order is done — the quick orders just wait on the slow one.

A, B, C finish early but their slots idle until D ends — only then can new work start. The hatched area is wasted GPU.

2 · Fix A — continuous batching (iteration-level scheduling)

Instead of scheduling whole requests, schedule one decode step at a time. After every step, evict finished requests and admit waiting ones — the batch never drains. This is iteration-level scheduling, introduced by Orca, and it's worth 4–8× on variable-length workloads.14

Pantry: the van never waits — it drops finished orders and picks up waiting ones on every trip, so it's always full.

req A finishes and leaves; req E immediately takes the freed slot. No waiting for the whole batch — the running set stays full step after step.

3 · Problem B — contiguous KV wastes most of your memory

Continuous batching only helps if you can fit the sequences. Early servers reserved one contiguous KV region per request, sized to its maximum possible length. Most requests never reach it, so 60–80% of the reserved memory sits empty — and you fit far fewer sequences than Lesson 10's math promised.2

Pantry: reserving a whole shelf per customer in case they order the maximum — shelves fill up while mostly empty, and you turn customers away.

The blue is the KV you really use; the hatch is reserved-for-max waste. Fragmentation, not the formula, is why so few requests fit.

4 · Fix B — PagedAttention

Borrow the operating system's trick: virtual memory. Split the KV cache into small fixed-size blocks and hand them out on demand, non-contiguous, as each sequence grows. Waste drops to nearly zero, so you finally fit the batch your Lesson-2 math allowed. This is PagedAttention; together with continuous batching it gave vLLM up to 24× the throughput of naïve HuggingFace.2

Pantry: ditch per-customer shelves. Use small bins from a shared rack, assigned only as each order actually grows — almost no empty space.

Like OS paging: many requests share one pool of small blocks. Almost no empty space, so continuous batching can actually keep the batch full.

5 · Putting it together — two requests, A and B

Now the synthesis. When person A and person B hit the same model at once, what actually produces each full reply? Everything above shows up here on one timeline — prefill (L1), per-sequence KV (L2), continuous batching (§2) and paging (§4).

Pantry: two customers, one van. The van still hauls the whole pantry per trip — but on that single trip the cook plates a bite for A and a bite for B. Each customer's own order ticket, though, is read separately.

A prefills and starts decoding; B is admitted the moment it's prefilled and joins the same forward passes; when A hits EOS its slot frees and a queued request E reuses it — the batch never drains.

Reading one column at a time — what the GPU's single forward pass holds each iteration:

iter	what the one forward pass contains	tokens out
1	prefill A (whole prompt)	A: +tok 1
2	prefill B + decode A — mixed (chunked prefill)	B: +tok 1 · A: +tok 2
3–38	decode A + decode B — fused, one weight-read	A & B: +1 each
39	decode A + decode B (A hits EOS)	A: ✓ done · B: +tok
40	decode B (+ prefill E from the queue)	B: +tok · E: +tok 1

Inside one fused step: weights shared, KV separate

The reason B is almost free to add: the model's weight matrices are read once from memory and drive both sequences in the same pass (the Lesson 9 batching win). But the attention step reads each sequence's own KV cache (Lesson 10) — which is exactly why attention gets no cross-batch reuse and stays memory-bound (Lesson 11).

One weight-read, two sequences — but two separate KV caches. Shared weights are the throughput win; the separate KV is why each request keeps its own context (and its own memory cost).

So B never waits for A. It joins the instant it's prefilled (continuous batching, §2), rides every shared step after that, and when A finishes, A's KV blocks free immediately (PagedAttention, §4) for the next request in the queue. Because your servers run --enable-chunked-prefill (Lesson 24), B's prompt also slips in over a few steps without freezing A's token stream.

In Kubernetes terms infra bridge

Continuous batching is the scheduler bin-packing pods: admit new requests onto the GPU (node) the instant others finish, keeping utilization high instead of draining the whole batch first. And PagedAttention is memory overcommit with pages — hand out small KV blocks on demand instead of reserving each request's max up front, the same way you overcommit node memory rather than reserving every pod's limit.

On YOUR cluster — this engine is vLLM real config

Both fixes ship in the servers you're running. The flags are the dials:

--max-num-seqs 8 (qwen36) / 64 (qwen35) — the continuous-batch width: how many sequences decode together.
--enable-prefix-caching — PagedAttention blocks shared across requests with the same prompt prefix (huge for your RAG system prompts).
--enable-chunked-prefill — splits long prefills so they don't block decodes — vital since your traffic is 19–47:1 prefill-heavy.

The live gauge is num_requests_running vs num_requests_waiting (the "Running: N reqs" line). Watch it: bash learning/tools/cluster-probe.sh. Want the real throughput-vs-concurrency curve measured on your H100s? That needs a small opt-in load test — say the word and I'll run a bounded one. · Your Lab →

You use this every day: prompt caching bridge to the API

Why is a Claude Code (or any LLM-API) session cheap on repeated turns but pricey to resume tomorrow? Same mechanism: prompt caching is prefix caching made cross-request, time-limited, and billed — it stores the prefill KV for a stable prefix so the next request skips re-prefilling it. It's --enable-prefix-caching with a clock and a price tag.

Write the cache once (1.25×), then ride cheap reads (0.1×) while the session stays warm. Go idle past the TTL and the cache evicts — the next turn re-prefills the whole history from scratch.

cache read (hit)	≈ 0.1× input price (~90% off)
cache write (cold / after expiry)	1.25× (5-min TTL) · 2× (1-hour TTL)
TTL	5 min default (refreshed on each use) · 1 hour option

Resume tomorrow? Idle past the TTL → the cache is gone → your first turn is a miss: re-prefill the entire history at full price plus the 1.25× write premium, then cheap reads resume. So continuing now is cheap; coming back tomorrow re-pays once to rebuild the cache — not a penalty beyond uncached, just the savings reset for that one turn. (Multipliers/TTL are Anthropic's; the mechanism — prefix-KV reuse + eviction — is universal.)

Pantry: the recipe binder stays open on the counter while you keep cooking (warm reads); leave the kitchen idle and the staff put it away — tomorrow they re-read the whole binder from scratch before the first dish.

Read this next — primary source Efficient Memory Management for LLM Serving with PagedAttention — Kwon et al., SOSP 2023. The paper behind vLLM; read §1 and the memory-management section. Pair it with the Orca paper (OSDI'22) for continuous batching.

Check yourself (recall, don't peek)

Picture the idle bars and the packed blocks, then answer from memory.

I'm your teacher — ask me anything. Want the opt-in live batching benchmark on your H100s, a diagram of prefix-caching block sharing, or to connect max-num-seqs back to an SLO in your routing notebooks? Just ask.

← Lesson 11 · The RooflineLesson 13 · CUDA Kernels & Fusion →

References

Orca: A Distributed Serving System for Transformer-Based Generative Models — Yu et al., OSDI'22. usenix.org
Efficient Memory Management for LLM Serving with PagedAttention — Kwon et al., SOSP'23 (arXiv 2309.06180). arxiv.org
vLLM Explained: PagedAttention and Continuous Batching — RunPod. runpod.io
LLM Batching: Static vs Continuous — PremAI. blog.premai.io