Tokens — the Unit Underneath Everything

Before prefill or decode — worked out one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.

Today's win: you'll predict what a token is, how byte-level BPE builds one, why "strawberry" trips models, and why every number in this course is counted in tokens — checked against your live Qwen tokenizer.

The setup

A model does math on integers, not letters. So the very first thing that happens to your prompt — before any prefill — is tokenization: text is chopped into pieces and each piece becomes a number.

Step 1 — what does a tokenizer output?

Step 2 — pieces of what size?

Recall — cover the screen: what is a token, in one line?
A piece of text (usually a subword) mapped to an integer ID — its index in the model's fixed vocabulary. That integer is all the model ever sees. (tap/hover to check)

Step 3 — common vs rare words

On your Qwen tokenizer, the everyday word running is one piece. What about a monster like antidisestablishmentarianism?

Step 4 — the famous failure

Ask a model how many r's are in "strawberry" and it often gets it wrong.

Recall — say it: how does byte-level BPE build its vocabulary?
Start from raw bytes; repeatedly merge the most frequent adjacent pair into a new token, saving each rule. The saved merge rules are the tokenizer; tokenizing new text just replays them. (tap/hover)

Step 5 — what actually gets prefilled?

You send "What is tokenization?" — 5 tokens of content.

On YOUR cluster live Qwen3.6

hello=1 · strawberry=1 (id 70135) but strawberry=3 · tokenization=2 · H100=4 · every digit is its own token · 你好=1 but 🍓=3 byte pieces. Your context window is 131,072 tokens, and the measured 19–47:1 prompt:generation ratio is counted in these tokens. Pull your own with curl localhost:8000/tokenize. Your Lab →

Read this next — primary source Let's build the GPT Tokenizer — Andrej Karpathy · Hugging Face Tokenizers course.

Final check — teach it back

Explain to a colleague: "Tokenization is Lesson 3 because…"
…the token is the unit of everything that follows — prefill processes N tokens, decode emits one per step, KV is stored per token, the context window is a token budget, and the bill is per token. No tokens, no units for any later lesson. (tap/hover)

I'm your teacher — ask me anything. Want the live /tokenize run on your own prompts, or the byte-level detail behind Ġ?

← Lesson 2 · What's Inside a ModelLesson 4 · Embeddings →

References

Subword units / BPE — Sennrich et al. (1508.07909); byte-level BPE — Karpathy (video); HF course (tokenizers).

Tokens — the Unit Underneath Everything

Before prefill or decode: how your text becomes the integers a model actually runs on.

Today's win: you'll explain what a token is, how byte-level BPE builds one, why "strawberry" trips models — and why every cost, latency, KV, and context number in this whole course is counted in tokens. Verified live on your own Qwen tokenizer.

The picture: the kitchen only cooks by number

Extend the Faraway Pantry: the kitchen works off a fixed numbered menu — every dish has an index. A customer's free-form order ("the faraway pantry…") never reaches the line as words; a host first rewrites it into menu numbers. The cook only ever sees numbers. That host is the tokenizer; the menu is the vocabulary; off-menu requests get written as a combo of the closest menu items.

the numbered menu	vocabulary — the ~248k fixed pieces the model knows
the host translating the order	tokenizer — text → integer IDs (and back)
a combo for an off-menu dish	subword split — rare words become several pieces

1 · A token is a piece of text with a number

A model can't do math on letters. Tokenization chops your text into pieces from a fixed vocabulary, then replaces each piece with its integer ID (its index in that vocabulary). That integer is all the model ever sees.1

Pantry: the host reads the order aloud, finds each item on the numbered menu, and writes the ticket as a list of numbers. The kitchen cooks the numbers.

Text → pieces → integer IDs. The model never sees "pantry" — it sees 65949. Everything downstream (prefill, KV, decode) operates on these integers.

2 · Why subwords — not whole words, not characters

Two obvious schemes both fail. Characters give a tiny vocabulary but enormous sequences (slow, and the model must relearn spelling everywhere). Whole words give short sequences but a giant vocabulary that still breaks on any word it never saw (an "out-of-vocabulary" hole). Subwords are the sweet spot: common words are a single piece, rare ones split into a few — and nothing is ever out-of-vocabulary, because you can always fall back to smaller pieces (down to raw bytes).12

Pantry: a menu of single letters = endless tickets. A menu with every possible dish = impossibly long, and still missing tomorrow's special. A menu of components covers anything by combination.

On your Qwen tokenizer, running is a single token, while antidisestablishmentarianism is 6 — and a never-before-seen string still tokenizes, because it can always fall back to smaller pieces.

3 · How it's built — byte-level BPE

The dominant method is byte-level Byte-Pair Encoding.2 Training: start from raw bytes, then repeatedly find the most frequent adjacent pair and merge it into a new token, recording the rule. Do that thousands of times and you've grown a vocabulary of useful subwords. Tokenizing new text just replays those merge rules, greedily.

Pantry: the menu wasn't designed top-down — it grew. Whatever pair of items got ordered together most often became its own combo button, over and over.

Each step merges the most frequent adjacent pair and saves the rule. The saved rules are the tokenizer. (Real merges are learned from a huge corpus; this is a toy.)

Two consequences you'll see in the raw output. The tokenizer works on bytes, so a leading space is part of the next piece — shown as Ġ (and a newline as Ċ). And anything outside the learned vocabulary still encodes as raw UTF-8 bytes: a common CJK word like 你好 is one learned token, while 🍓 falls back to several byte pieces.2

4 · Why this is Lesson 3 — the token is the unit of everything

Every quantity in this course is per token. Prefill processes N input tokens (L1); decode emits one token per step (L3); the KV cache stores per token; the context window is a token budget; throughput is tokens/sec; and the bill is per token. Get tokenization, and every later number has a unit.

Tokenization is upstream of the entire course. Every later lesson measures something per token — which is why this one comes first.

The "count the r's" trap why it matters

Ask a model how many r's are in "strawberry" and it often miscounts. Now you know why: with a leading space, strawberry is a single token (ID 70135) — the model receives one opaque integer, not the letters s-t-r-a-w-b-e-r-r-y. It can't see inside a token any more than the cook can read the original order off a menu number. Spelling, rhyming, and digit math are all hard for exactly this reason.

On YOUR cluster — measured on the Qwen3.6 tokenizer live

text	tokens	pieces
`hello`	1	hello
`strawberry` (with space)	1	Ġstrawberry · id 70135
`strawberry` (no space)	3	str · aw · berry
`tokenization`	2	token · ization
`H100`	4	H · 1 · 0 · 0
`1234567890`	10	every digit is its own token
`你好`	1	one multilingual token
`🍓`	3	byte fallback (3 pieces)

The prompt isn't raw text. Your "What is tokenization?" is 5 tokens of content — but wrapped in Qwen's chat template it becomes 15 tokens actually prefilled: <|im_start|> user … <|im_end|> … <|im_start|> assistant … <think> (Qwen3.6 even auto-opens a reasoning block). That wrapper is fixed overhead on every turn.

Takeaways for you: numbers and code are token-expensive (digits split 1:1); your context window is 131,072 tokens; and your measured 19–47:1 prompt:generation ratio is counted in these tokens. Pull your own: curl localhost:8000/tokenize (the vllm:prompt_tokens_total metric is this step, summed). Your Lab →

Read this next — primary source Let's build the GPT Tokenizer — Andrej Karpathy builds byte-level BPE from scratch (the clearest treatment anywhere). Pair with the Hugging Face Tokenizers course.

Check yourself (recall, don't peek)

Picture the menu numbers, the BPE merges, and the strawberry token, then answer from memory.

I'm your teacher — ask me anything. Want the live /tokenize run on your own prompts, the byte-level detail (why Ġ), or how the chat template is built? Just ask.

← Lesson 2 · What's Inside a ModelLesson 4 · Embeddings →

References

Hugging Face NLP Course — Tokenizers. huggingface.co
Neural Machine Translation of Rare Words with Subword Units (BPE) — Sennrich et al., 2016 (1508.07909); byte-level BPE — GPT-2 / Karpathy (video).