Before prefill or decode — worked out one guess at a time.
A model does math on integers, not letters. So the very first thing that happens to your prompt — before any prefill — is tokenization: text is chopped into pieces and each piece becomes a number.
On your Qwen tokenizer, the everyday word running is one piece. What about a monster like
antidisestablishmentarianism?
Ask a model how many r's are in "strawberry" and it often gets it wrong.
You send "What is tokenization?" — 5 tokens of content.
hello=1 · strawberry=1 (id 70135) but strawberry=3
· tokenization=2 · H100=4 · every digit is its own token ·
你好=1 but 🍓=3 byte pieces. Your context window is 131,072 tokens, and the
measured 19–47:1 prompt:generation ratio is counted in these tokens. Pull your own with
curl localhost:8000/tokenize. Your Lab →
/tokenize run on your own prompts, or the byte-level detail behind Ġ?Before prefill or decode: how your text becomes the integers a model actually runs on.
Extend the Faraway Pantry: the kitchen works off a fixed numbered menu — every dish has an index. A customer's free-form order ("the faraway pantry…") never reaches the line as words; a host first rewrites it into menu numbers. The cook only ever sees numbers. That host is the tokenizer; the menu is the vocabulary; off-menu requests get written as a combo of the closest menu items.
| the numbered menu | vocabulary — the ~248k fixed pieces the model knows |
| the host translating the order | tokenizer — text → integer IDs (and back) |
| a combo for an off-menu dish | subword split — rare words become several pieces |
A model can't do math on letters. Tokenization chops your text into pieces from a fixed vocabulary, then replaces each piece with its integer ID (its index in that vocabulary). That integer is all the model ever sees.1
Two obvious schemes both fail. Characters give a tiny vocabulary but enormous sequences (slow, and the model must relearn spelling everywhere). Whole words give short sequences but a giant vocabulary that still breaks on any word it never saw (an "out-of-vocabulary" hole). Subwords are the sweet spot: common words are a single piece, rare ones split into a few — and nothing is ever out-of-vocabulary, because you can always fall back to smaller pieces (down to raw bytes).12
running is a single token, while
antidisestablishmentarianism is 6 — and a never-before-seen string still tokenizes,
because it can always fall back to smaller pieces.The dominant method is byte-level Byte-Pair Encoding.2 Training: start from raw bytes, then repeatedly find the most frequent adjacent pair and merge it into a new token, recording the rule. Do that thousands of times and you've grown a vocabulary of useful subwords. Tokenizing new text just replays those merge rules, greedily.
Two consequences you'll see in the raw output. The tokenizer works on bytes, so a
leading space is part of the next piece — shown as Ġ (and a newline as Ċ). And
anything outside the learned vocabulary still encodes as raw UTF-8 bytes: a common CJK
word like 你好 is one learned token, while 🍓 falls back to several byte pieces.2
Every quantity in this course is per token. Prefill processes N input tokens (L1); decode emits one token per step (L3); the KV cache stores per token; the context window is a token budget; throughput is tokens/sec; and the bill is per token. Get tokenization, and every later number has a unit.
Ask a model how many r's are in "strawberry" and it often miscounts. Now you
know why: with a leading space, strawberry is a single token (ID 70135) — the
model receives one opaque integer, not the letters s-t-r-a-w-b-e-r-r-y. It can't see inside a
token any more than the cook can read the original order off a menu number. Spelling, rhyming, and digit
math are all hard for exactly this reason.
| text | tokens | pieces |
|---|---|---|
hello | 1 | hello |
strawberry (with space) | 1 | Ġstrawberry · id 70135 |
strawberry (no space) | 3 | str · aw · berry |
tokenization | 2 | token · ization |
H100 | 4 | H · 1 · 0 · 0 |
1234567890 | 10 | every digit is its own token |
你好 | 1 | one multilingual token |
🍓 | 3 | byte fallback (3 pieces) |
The prompt isn't raw text. Your "What is tokenization?" is
5 tokens of content — but wrapped in Qwen's chat
template it becomes 15 tokens actually prefilled:
<|im_start|> user … <|im_end|> … <|im_start|> assistant … <think>
(Qwen3.6 even auto-opens a reasoning block). That wrapper is fixed overhead on every turn.
Takeaways for you: numbers and code are token-expensive (digits split 1:1);
your context window is 131,072 tokens; and your measured 19–47:1 prompt:generation ratio
is counted in these tokens. Pull your own: curl localhost:8000/tokenize (the
vllm:prompt_tokens_total metric is this step, summed). Your Lab →
Picture the menu numbers, the BPE merges, and the strawberry token, then answer from memory.
/tokenize run on your
own prompts, the byte-level detail (why Ġ), or how the chat template is built? Just ask.