One kitchen story we'll reuse in every lesson — so the hard terms always have something concrete to hang on.
You're a line cook with blazing-fast hands and stoves — you can cook almost anything in seconds (that's compute). But your ingredients aren't in the kitchen. They sit in a giant pantry across town (the GPU's memory, HBM), reachable only by a single road with one delivery van (the memory bandwidth).
Now watch the two ways orders come in:
Two more pieces fall out of the story. The stack of recipe binders grows with every bite you cook this session — that's the KV cache, and it makes each van trip a little heavier. And the obvious fix: since the van is already hauling the whole pantry, don't serve one customer — plate bites for 32 customers from that single trip. That's batching, and it's why throughput soars.
| The kitchen story | In the GPU (the term to keep) |
|---|---|
| Pantry across town | GPU memory (HBM) — where weights + KV cache live |
| The single road + the van | Memory bandwidth — how fast bytes move |
| The cook's hands & stoves | Compute — FLOPs / tensor cores |
| Growing stack of recipe binders | KV cache — grows one entry per token |
| The panel of specialist tasters | Attention heads — kv_heads file K/V notes per token |
| Descriptors on one index card | head_dim — length of each head's Key/Value vector |
| Big catering order, cooked in one go | Prefill — compute-bound |
| À la carte: one bite per round-trip | Decode — memory-bandwidth-bound |
| Cooking done per pound hauled | Arithmetic intensity — ops/byte |
| One van trip, plate for 32 customers | (Continuous) batching → throughput ↑ |