Embeddings

From integer to meaning — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.

Today's win: you'll predict how a bare token ID becomes a vector that carries meaning and position — the real input to the transformer.

The setup

Tokenization gave you integers (token 70135). But integers carry no meaning — 70135 isn't "bigger" or "more like a fruit" than 70134. So how does the model get meaning out of a number?

Step 1 — the mechanism

Step 2 — what does the vector encode?

Step 3 — where does the meaning come from?

Recall — cover the screen: what is an embedding?
A learned vector for each token, fetched by indexing the embedding table (token ID → row → vector). Related tokens land near each other in this space; the map is learned during training. It's the real input the transformer computes on. (tap/hover to check)

Step 4 — what's still missing?

Step 5 — your model's table (real)

Recall — say it: input vector = ? + ?
input = meaning (token embedding, looked up) + position (a learned position vector in GPT-2, or RoPE rotation in Qwen/LLaMA). The model needs both: what the token is, and where it sits. (tap/hover)

On YOUR cluster from config.json

Qwen3.6 embedding table: 248,320 × 5,120 (a 5,120-dim vector per token). Position via RoPE (rope_theta 1e7, partial factor 0.25), enabling a 262,144-token context. Your Lab →

Read this next — primary source HF: embeddings · runnable: day03 notebook · RoPE paper.

Final check — teach it back

Explain to a colleague: "An embedding turns a token ID into…"
…a learned dense vector (via a table lookup) that places the token in a meaning-space where related tokens are near each other — then position info is added so the model knows order. That sequence of vectors is what attention reads. (tap/hover)

I'm your teacher — ask me anything. Want two real tokens compared by cosine similarity, or how RoPE rotates by position?

← Lesson 3Next: Lesson 5 →

References

day03 — Embeddings (notebook); HF embeddings; RoPE.

Embeddings

How a bare integer token becomes a vector that means something.

Today's win: you'll explain how a token ID — just a number — becomes a dense vector that carries meaning and position, which is the real input the transformer blocks work on.

The picture: a menu number → a full flavor profile

From Lesson 3 a token is a bare menu number (e.g. 70135). A number alone tells the cook nothing about the dish. So the first thing inside the model is a lookup: turn each number into its flavor profile — a long list of learned attributes (a vector). Similar dishes get similar profiles.

looking the number up on the menu	embedding lookup — ID → vector
the dish's flavor profile	the embedding vector (768 dims in GPT-2, 5,120 in Qwen)
where the item sits in the order	position information (added, or RoPE)

1 · ID → vector is just a table lookup

The model holds one giant embedding table (the tensor wte from Lesson 2): one row per vocabulary token. To embed a token, you index its row. Token 70135 → row 70135 → a vector. No math on the digits — pure lookup.1

Pantry: the cook reads "item 70135" and flips to that row of the menu to get its full description. The number was just an address.

An embedding is a row of a learned table. The integer is just the row index; the vector is what the model actually computes with.

2 · The vector carries meaning — a learned map

Those numbers aren't random: training places tokens with similar meaning near each other in the vector space. "pod", "node", "container" cluster together; "king", "queen" sit elsewhere. Distance = relatedness.1

Pantry: dishes with similar flavor profiles end up on the same shelf — so the cook can reason "this is like that" without being told.

Training learns this map. Because related tokens are near each other, the model can generalize — the heart of why embeddings work.

3 · Order matters — add position

"pod restarts node" means something different from "node restarts pod" — same tokens, different order. So the model also injects position. GPT-2 adds a learned position vector (wpe); modern models like Qwen use RoPE (rotary embeddings) that rotate each vector by its position. Either way: input = meaning + where-it-sits.2

Pantry: the flavor profile says what the item is; the position tag says when in the order it arrives. The cook needs both.

Meaning + position = the per-token input. A whole prompt becomes a sequence of these vectors — what attention reads next.

On YOUR cluster from config.json

Qwen3.6's embedding table is 248,320 × 5,120 — a row for every vocab token, each a 5,120-dim vector (vs GPT-2's 50,257 × 768). It uses RoPE (rotary, rope_theta 10,000,000, a partial rotary factor of 0.25), which is what lets it stretch to a 262,144-token context. Position isn't stored per-row; it's applied on the fly. · Your Lab →

Read this next — primary source Getting started with embeddings — Hugging Face. Runnable companion: day03 notebook — builds the lookup and inspects real GPT-2 embedding rows.

Check yourself (recall, don't peek)

I'm your teacher — ask me anything. Want to see two real tokens' vectors compared by cosine similarity, or how RoPE actually rotates? Just ask.

← Lesson 3 — tokenization Next: Lesson 5 — attention →

References

Embeddings — day03 (embeddings.ipynb); HF: getting started with embeddings.
Rotary Position Embedding (RoPE) — Su et al. (2104.09864).