Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict how a bare token ID becomes a vector that
carries meaning and position — the real input to the transformer.
The setup
Tokenization gave you integers (token 70135). But integers carry no meaning — 70135 isn't "bigger" or
"more like a fruit" than 70134. So how does the model get meaning out of a number?
Step 1 — the mechanism
Step 2 — what does the vector encode?
Step 3 — where does the meaning come from?
Recall — cover the screen: what is an embedding? A learned vector for each token, fetched by indexing the embedding table (token ID → row → vector). Related tokens land near each other in this space; the map is learned during training. It's the real input the transformer computes on.(tap/hover to check)
Step 4 — what's still missing?
Step 5 — your model's table (real)
Recall — say it: input vector = ? + ? input = meaning (token embedding, looked up) + position (a learned position vector in GPT-2, or RoPE rotation in Qwen/LLaMA). The model needs both: what the token is, and where it sits.(tap/hover)
On YOUR cluster from config.json
Qwen3.6 embedding table: 248,320 × 5,120 (a 5,120-dim vector per token). Position via
RoPE (rope_theta 1e7, partial factor 0.25), enabling a 262,144-token context. Your Lab →
Explain to a colleague: "An embedding turns a token ID into…" …a learned dense vector (via a table lookup) that places the token in a meaning-space where related tokens are near each other — then position info is added so the model knows order. That sequence of vectors is what attention reads.(tap/hover)
I'm your teacher — ask me anything. Want two real tokens compared by cosine similarity, or how RoPE rotates by position?
How a bare integer token becomes a vector that means something.
Today's win: you'll explain how a token ID — just a number — becomes a dense
vector that carries meaning and position, which is the real input the transformer blocks work on.
The picture: a menu number → a full flavor profile
From Lesson 3 a token is
a bare menu number (e.g. 70135). A number alone tells the cook nothing about the dish. So the
first thing inside the model is a lookup: turn each number into its flavor profile — a
long list of learned attributes (a vector). Similar dishes get similar profiles.
looking the number up on the menu
embedding lookup — ID → vector
the dish's flavor profile
the embedding vector (768 dims in GPT-2, 5,120 in Qwen)
where the item sits in the order
position information (added, or RoPE)
1 · ID → vector is just a table lookup
The model holds one giant embedding table (the tensor wte from
Lesson 2): one row per vocabulary token. To embed
a token, you index its row. Token 70135 → row 70135 → a vector. No math on the digits — pure
lookup.1
Pantry: the cook reads "item 70135" and flips to that row of the menu to
get its full description. The number was just an address.
An embedding is a row of a learned table. The integer is just the row index; the vector is
what the model actually computes with.
2 · The vector carries meaning — a learned map
Those numbers aren't random: training places tokens with similar meaning near each other in
the vector space. "pod", "node", "container" cluster together; "king", "queen" sit elsewhere. Distance =
relatedness.1
Pantry: dishes with similar flavor profiles end up on the same shelf —
so the cook can reason "this is like that" without being told.
Training learns this map. Because related tokens are near each other, the model can generalize —
the heart of why embeddings work.
3 · Order matters — add position
"pod restarts node" means something different from "node restarts pod" — same tokens, different order.
So the model also injects position. GPT-2 adds a learned position vector
(wpe); modern models like Qwen use RoPE (rotary embeddings) that rotate each
vector by its position. Either way: input = meaning + where-it-sits.2
Pantry: the flavor profile says what the item is; the position
tag says when in the order it arrives. The cook needs both.
Meaning + position = the per-token input. A whole prompt becomes a sequence of these
vectors — what attention reads next.
On YOUR cluster from config.json
Qwen3.6's embedding table is 248,320 × 5,120 — a row for every vocab token,
each a 5,120-dim vector (vs GPT-2's 50,257 × 768). It uses RoPE (rotary, rope_theta
10,000,000, a partial rotary factor of 0.25), which is what lets it stretch to a
262,144-token context. Position isn't stored per-row; it's applied on the fly. · Your Lab →