Inference Engineering · Lesson 5 · Attention Home · Glossary · Your Lab

Attention

How tokens read each other — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict how attention works (Q/K/V, scores, the causal mask, heads) — the exact operation the KV cache, GQA, and FlashAttention later optimize.

The setup

Embeddings (Lesson 4) gave each token a vector that knows only itself. But "the pod crashed because it ran out of memory" — to handle "it", the model must use the other tokens.

Step 1 — what each token produces

Step 2 — how relevance is scored

Recall — cover the screen: the attention recipe in one line.
Each token makes a Query, Key, Value. Score every earlier token by Query·Key, softmax the scores into weights, and output the weighted sum of their Values. So "it" ends up carrying mostly "pod"'s value. (tap/hover to check)

Step 3 — what can a token look at?

Step 4 — what's a "head"?

Step 5 — the cost, and the fixes (real)

Recall — say it: three optimizations that all target attention.
KV cache (don't recompute past Keys/Values — Lesson 10); GQA (many query heads share a few K/V heads → smaller cache); FlashAttention (compute the n×n scores in on-chip SRAM, never materializing the full matrix — Lesson 13). (tap/hover)

On YOUR cluster from config.json

Qwen3.6: 24 query heads, 4 KV heads (head_dim 256) = GQA — 6 query heads share each K/V head, so the KV cache is 6× smaller (you'll do the math in Lesson 10). It's also a hybrid: cheap linear attention most layers, full attention every 4th. Your Lab →

Watch this next — primary source Attention, visually — 3Blue1Brown · runnable: day04 notebook · Attention Is All You Need.

Final check — teach it back

Explain to a colleague: "Attention lets a token…"
…pull information from the relevant earlier tokens: it forms a Query, matches it against every earlier token's Key (Q·K → softmax), and blends their Values. Multiple heads do this in parallel, each tracking a different relationship; a causal mask keeps it from seeing the future. (tap/hover)
I'm your teacher — ask me anything. Want the Q·K·V math on a tiny example, or why GQA barely costs quality?
← Lesson 4Next: Lesson 6 →
References
  1. day04 — Attention (notebook); Attention Is All You Need; GQA; FlashAttention.

Attention

How each token reads the others — the operation everything later optimizes.

Today's win: you'll explain attention — Query/Key/Value, the relevance scores, the causal mask, and what a "head" is — and see why this is exactly the operation the KV cache (Lesson 10), GQA, and FlashAttention all exist to make cheaper.

The picture: which earlier items matter right now?

A token's embedding (Lesson 4) only knows itself. To understand context, the cook plating the current bite asks: which earlier items on the order ticket matter for this one? Each token asks a Query ("what am I looking for?"), every token advertises a Key ("here's what I offer"), and carries a Value ("here's my content"). Match query to keys → mix the matching values.

the question this item asksQuery (Q)
the label each item advertisesKey (K)
the content each item carriesValue (V)
one specialist doing this lookupan attention head (many run in parallel)

1 · Why attention: a token needs the others

Take "the pod crashed because it ran out of memory". To handle it, the model must look back and find that it = pod. Attention is the mechanism that lets every token pull in information from the relevant earlier tokens.1

thepod crashedbecause it ranoutofmemory strong: "it" → "pod" "it" looks back; the thickest link wins the most attention weight
Resolving "it" = "pod" is attention at work — every token builds its meaning from the relevant earlier ones.

2 · Query · Key · Value — the lookup

From each token's vector, three learned projections produce its Q, K, and V. The relevance of one token to another is Query · Key (a dot product); those scores go through softmax to become weights; the output is the weighted sum of the Values.1

Pantry: the cook holds up the current item's question (Q) against every earlier item's label (K), sees which match best, and blends those items' contents (V) into the answer.
Q of "it" K "pod" K "crashed" K "because" K "the" Q·K = scores softmax → weights 0.71 (pod) 0.12 0.10 0.07 weighted sum of V ≈ mostly "pod"'s value
Score every earlier token by Q·K, softmax into weights, blend their Values. "it" ends up carrying mostly "pod"'s content. That's one attention computation.

3 · Two essentials: the causal mask, and heads

Two things make it work for generation. Causal mask: a token may only attend to tokens before it — it can't peek at the future it's trying to predict. Multi-head: the model runs several attentions in parallel — each a head with its own Q/K/V projections, learning a different kind of relationship (one tracks what-refers-to-what, another tracks code syntax, …). Their outputs are concatenated.1

Pantry: several specialist cooks each scan the ticket for a different pattern (one for allergies, one for timing, one for sauces) — then combine notes. Each specialist is a head.
multi-head — each head is its own Q/K/V lookup, run in parallel: head 1 · Q/K/V head 2 · Q/K/V head 3 · Q/K/V concatenate each head = a different "lens" causal mask — attend only to earlier tokens: masked allowed
Heads = parallel specialists, each its own Q/K/V. The causal triangle keeps each token from seeing the future — essential, since generation predicts that future.

4 · The cost — and what comes next

Every token attends to every earlier token, so attention is O(n²) in sequence length — the expensive part of long contexts. Three optimizations flow directly from this lesson, and each is a later one:2

On YOUR cluster — GQA, set in the config from config.json

Qwen3.6 has 24 query heads but only 4 key/value heads (head_dim 256) — that's Grouped-Query Attention: 6 query heads share each K/V head, so the KV cache is 6× smaller than full multi-head would be. This single config choice is why you can fit long contexts; you'll do the memory math in Lesson 10.

Advanced note: Qwen3.6 is a hybrid — most layers use a cheap linear attention, with full attention only every 4th layer — a frontier trick to dodge the O(n²) wall. The foundation here (full attention) is what those variants optimize. · Your Lab →

Read/watch this next — primary source Attention in transformers, visually — 3Blue1Brown. Runnable companion: day04 notebook — builds attention from scratch with the causal mask.

Check yourself (recall, don't peek)

I'm your teacher — ask me anything. Want the Q·K·V math worked on a tiny example, or to see why GQA barely dents quality? Just ask.
← Lesson 4 — embeddings Next: Lesson 6 — forward pass & sampling →
References
  1. Attention Is All You Need — Vaswani et al. (1706.03762); day04 (attention.ipynb).
  2. FlashAttention — Dao et al. (2205.14135); GQA — Ainslie et al. (2305.13245).