Inference Engineering · Lesson 4 · Embeddings Home · Glossary · Your Lab

Embeddings

From integer to meaning — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict how a bare token ID becomes a vector that carries meaning and position — the real input to the transformer.

The setup

Tokenization gave you integers (token 70135). But integers carry no meaning — 70135 isn't "bigger" or "more like a fruit" than 70134. So how does the model get meaning out of a number?

Step 1 — the mechanism

Step 2 — what does the vector encode?

Step 3 — where does the meaning come from?

Recall — cover the screen: what is an embedding?
A learned vector for each token, fetched by indexing the embedding table (token ID → row → vector). Related tokens land near each other in this space; the map is learned during training. It's the real input the transformer computes on. (tap/hover to check)

Step 4 — what's still missing?

Step 5 — your model's table (real)

Recall — say it: input vector = ? + ?
input = meaning (token embedding, looked up) + position (a learned position vector in GPT-2, or RoPE rotation in Qwen/LLaMA). The model needs both: what the token is, and where it sits. (tap/hover)

On YOUR cluster from config.json

Qwen3.6 embedding table: 248,320 × 5,120 (a 5,120-dim vector per token). Position via RoPE (rope_theta 1e7, partial factor 0.25), enabling a 262,144-token context. Your Lab →

Read this next — primary source HF: embeddings · runnable: day03 notebook · RoPE paper.

Final check — teach it back

Explain to a colleague: "An embedding turns a token ID into…"
…a learned dense vector (via a table lookup) that places the token in a meaning-space where related tokens are near each other — then position info is added so the model knows order. That sequence of vectors is what attention reads. (tap/hover)
I'm your teacher — ask me anything. Want two real tokens compared by cosine similarity, or how RoPE rotates by position?
← Lesson 3Next: Lesson 5 →
References
  1. day03 — Embeddings (notebook); HF embeddings; RoPE.

Embeddings

How a bare integer token becomes a vector that means something.

Today's win: you'll explain how a token ID — just a number — becomes a dense vector that carries meaning and position, which is the real input the transformer blocks work on.

The picture: a menu number → a full flavor profile

From Lesson 3 a token is a bare menu number (e.g. 70135). A number alone tells the cook nothing about the dish. So the first thing inside the model is a lookup: turn each number into its flavor profile — a long list of learned attributes (a vector). Similar dishes get similar profiles.

looking the number up on the menuembedding lookup — ID → vector
the dish's flavor profilethe embedding vector (768 dims in GPT-2, 5,120 in Qwen)
where the item sits in the orderposition information (added, or RoPE)

1 · ID → vector is just a table lookup

The model holds one giant embedding table (the tensor wte from Lesson 2): one row per vocabulary token. To embed a token, you index its row. Token 70135 → row 70135 → a vector. No math on the digits — pure lookup.1

Pantry: the cook reads "item 70135" and flips to that row of the menu to get its full description. The number was just an address.
token 70135 (" strawberry") embedding table (wte) row 0 row 70135 → … 248,320 rows [0.12, -0.04, 0.88, … ] — 5,120 numbers the token's learned vector
An embedding is a row of a learned table. The integer is just the row index; the vector is what the model actually computes with.

2 · The vector carries meaning — a learned map

Those numbers aren't random: training places tokens with similar meaning near each other in the vector space. "pod", "node", "container" cluster together; "king", "queen" sit elsewhere. Distance = relatedness.1

Pantry: dishes with similar flavor profiles end up on the same shelf — so the cook can reason "this is like that" without being told.
embedding space (2-D shadow of 5,120-D) — near = related pod node container king queen def return your ops vocab royalty code
Training learns this map. Because related tokens are near each other, the model can generalize — the heart of why embeddings work.

3 · Order matters — add position

"pod restarts node" means something different from "node restarts pod" — same tokens, different order. So the model also injects position. GPT-2 adds a learned position vector (wpe); modern models like Qwen use RoPE (rotary embeddings) that rotate each vector by its position. Either way: input = meaning + where-it-sits.2

Pantry: the flavor profile says what the item is; the position tag says when in the order it arrives. The cook needs both.
token vector (meaning) + position (wpe / RoPE) input vector → blocks one such vector per token; the sequence flows into attention (Lesson 5)
Meaning + position = the per-token input. A whole prompt becomes a sequence of these vectors — what attention reads next.

On YOUR cluster from config.json

Qwen3.6's embedding table is 248,320 × 5,120 — a row for every vocab token, each a 5,120-dim vector (vs GPT-2's 50,257 × 768). It uses RoPE (rotary, rope_theta 10,000,000, a partial rotary factor of 0.25), which is what lets it stretch to a 262,144-token context. Position isn't stored per-row; it's applied on the fly. · Your Lab →

Read this next — primary source Getting started with embeddings — Hugging Face. Runnable companion: day03 notebook — builds the lookup and inspects real GPT-2 embedding rows.

Check yourself (recall, don't peek)

I'm your teacher — ask me anything. Want to see two real tokens' vectors compared by cosine similarity, or how RoPE actually rotates? Just ask.
← Lesson 3 — tokenization Next: Lesson 5 — attention →
References
  1. Embeddings — day03 (embeddings.ipynb); HF: getting started with embeddings.
  2. Rotary Position Embedding (RoPE) — Su et al. (2104.09864).