The Faraway Pantry

One kitchen story we'll reuse in every lesson — so the hard terms always have something concrete to hang on.

You're a lightning-fast line cook. Your ingredients live in a pantry across town. One road connects them, and a van hauls the entire pantry for every order.

The story

You're a line cook with blazing-fast hands and stoves — you can cook almost anything in seconds (that's compute). But your ingredients aren't in the kitchen. They sit in a giant pantry across town (the GPU's memory, HBM), reachable only by a single road with one delivery van (the memory bandwidth).

Now watch the two ways orders come in:

A big catering order arrives — cook 500 portions at once. You haul one big load and then cook flat-out, every burner blazing. You're limited by how fast you can cook. That's prefill: compute-bound.
À la carte — a customer orders one bite. To make it, the van must drive across town and bring back your whole pantry plus every recipe binder you've filled so far. You make one knife-cut, plate one bite, and the next order means the van does the entire round-trip again. You're limited by the van and the road, not your cooking. That's decode: memory-bandwidth-bound.

Two more pieces fall out of the story. The stack of recipe binders grows with every bite you cook this session — that's the KV cache, and it makes each van trip a little heavier. And the obvious fix: since the van is already hauling the whole pantry, don't serve one customer — plate bites for 32 customers from that single trip. That's batching, and it's why throughput soars.

The mapping — keep the real terms

The kitchen story	In the GPU (the term to keep)
Pantry across town	GPU memory (HBM) — where weights + KV cache live
The single road + the van	Memory bandwidth — how fast bytes move
The cook's hands & stoves	Compute — FLOPs / tensor cores
Growing stack of recipe binders	KV cache — grows one entry per token
The panel of specialist tasters	Attention heads — `kv_heads` file K/V notes per token
Descriptors on one index card	head_dim — length of each head's Key/Value vector
Big catering order, cooked in one go	Prefill — compute-bound
À la carte: one bite per round-trip	Decode — memory-bandwidth-bound
Cooking done per pound hauled	Arithmetic intensity — ops/byte
One van trip, plate for 32 customers	(Continuous) batching → throughput ↑

← Glossary Lesson 9 →