Inference Engineering · Running Analogy Home · Your Lab · Glossary · Lessons

The Faraway Pantry

One kitchen story we'll reuse in every lesson — so the hard terms always have something concrete to hang on.

The Faraway Pantry diagram
You're a lightning-fast line cook. Your ingredients live in a pantry across town. One road connects them, and a van hauls the entire pantry for every order.

The story

You're a line cook with blazing-fast hands and stoves — you can cook almost anything in seconds (that's compute). But your ingredients aren't in the kitchen. They sit in a giant pantry across town (the GPU's memory, HBM), reachable only by a single road with one delivery van (the memory bandwidth).

Now watch the two ways orders come in:

Two more pieces fall out of the story. The stack of recipe binders grows with every bite you cook this session — that's the KV cache, and it makes each van trip a little heavier. And the obvious fix: since the van is already hauling the whole pantry, don't serve one customer — plate bites for 32 customers from that single trip. That's batching, and it's why throughput soars.

The mapping — keep the real terms

The kitchen storyIn the GPU (the term to keep)
Pantry across townGPU memory (HBM) — where weights + KV cache live
The single road + the vanMemory bandwidth — how fast bytes move
The cook's hands & stovesCompute — FLOPs / tensor cores
Growing stack of recipe bindersKV cache — grows one entry per token
The panel of specialist tastersAttention headskv_heads file K/V notes per token
Descriptors on one index cardhead_dim — length of each head's Key/Value vector
Big catering order, cooked in one goPrefill — compute-bound
À la carte: one bite per round-tripDecode — memory-bandwidth-bound
Cooking done per pound hauledArithmetic intensity — ops/byte
One van trip, plate for 32 customers(Continuous) batching → throughput ↑
← Glossary Lesson 9 →