Inference Engineering · Lesson 25 · Routing, Load Balancing & Queueing Home · Glossary · Your Lab

Routing, Load Balancing & Queueing

LLM-aware routing — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict why round-robin fails for LLMs, what to route on instead, and how queueing protects your tail latency.

The setup

You have several vLLM replicas behind a load balancer. Your instinct from stateless web is round-robin.

Step 1 — why round-robin struggles

Step 2 — the prefix-cache angle

Recall — cover the screen: the two LLM-specific routing rules.
Route by WORK not count (token-aware: balance by pending tokens / KV occupancy, since requests vary 100×), and route by PREFIX (KV-cache-aware: send matching prompts to the replica with the warm cache, turning a re-prefill into a hit). (tap/hover to check)

Step 3 — protecting the tail

Step 4 — your highest-value lever

In Kubernetes terms infra bridge

Your Service/Ingress/LB layer, with three twists: balance by tokens not connections; use consistent-hash / session affinity on the prompt prefix (not random) to hit the warm KV cache; add priority classes + deadlines so a 10k-token job can't starve interactive chat. (Dynamo bundles all three.)

On YOUR cluster your traffic

You're 19–47:1 prefill-heavy RAG with prefix caching ON — so prefix-hash routing is your biggest win: turn shared system prompts into cache hits, slashing TTFT. Add load-aware balancing + a priority tier; route on per-replica num_requests_waiting. Your Lab →

Read this next — primary source runnable: day24 notebook · Little's Law.

Final check — teach it back

Explain to a colleague: "Our LLM router should differ from a web LB by…"
…balancing by tokens/KV load (requests vary 100×, so count is meaningless), and routing by prompt-prefix to the replica with the warm cache (cache-aware affinity) to cut TTFT — plus priority queues so big jobs don't starve interactive traffic. Keep each replica just below its knee. (tap/hover)
I'm your teacher — ask me anything. Want to sketch a prefix-hash router in front of your two vLLM deployments?
← Lesson 24Next: Lesson 26 →
References
  1. day24 — routing/LB/queueing (notebook); Little's Law.

Routing, Load Balancing & Queueing

Why LLM routing isn't round-robin — and how queues protect your tail latency.

Today's win: you'll explain why naive round-robin load balancing fails for LLMs, how token-aware and KV-cache-aware routing fix it, and how queueing discipline defends your P99 — your ops world, now with the inference-specific twists.

The picture: seating heterogeneous parties

A round-robin host seats the next party at the next table regardless of size — so one waiter ends up with five big parties and another with none. LLM requests are wildly uneven (10 tokens vs 10,000), so you must route by work, not by count — and seat repeat parties where their order's already prepped (the warm prefix cache).

seat by party size, not headcounttoken-aware routing (balance by load)
seat regulars where their prep is readyKV-cache-aware routing (prefix affinity)
a managed waitlist with prioritiesqueueing (FIFO / priority / deadlines)

1 · Round-robin fails for LLMs

Requests vary 100× in token count, and decode time scales with length. Round-robin (or least-connections) ignores that, so some replicas get swamped while others idle — tail latency explodes. Token-aware routing balances by actual load: pending tokens, KV-cache occupancy, or running sequences per replica.1

round-robin — by count, not work replica A: swamped replica B: idle load-aware — balanced by tokens/KV replica A replica B even load → stable tail latency
Because LLM requests are so uneven, balancing by request count overloads replicas. Balance by work (tokens / KV occupancy) instead.

2 · KV-cache-aware routing

From Lesson 14: prefix caches are per-replica. So route a request to the replica that already holds its prefix — hash the prefix and send matching requests to the same replica. That converts a cache miss (full re-prefill) into a hit (low TTFT). It also keeps LoRA adapters local.2

3 · Queueing protects the tail

When all replicas are near their knee, new requests queue. Discipline matters: FIFO is fair but lets a giant request block small ones; priority queues / SLA tiers / deadlines let you protect interactive traffic. The golden rule (from L24): keep utilization just below the knee — past it, the queue and your P99 blow up (M/M/c).1

In Kubernetes terms infra bridge

This is your Service / Ingress / load-balancer layer — with three LLM twists: balance by tokens, not connections (requests aren't fungible); use consistent-hash / session affinity on the prompt prefix (not random) so you hit the warm KV cache; and add priority classes + deadlines at the queue so a 10k-token batch job can't starve interactive chat. A smart router (e.g. NVIDIA Dynamo) bundles all three.

On YOUR cluster — cache-aware routing is the big lever your traffic

You're 19–47:1 prefill-heavy RAG with --enable-prefix-caching on (Lesson 14) — so prefix-hash routing is your highest-value routing change: it turns those big shared system prompts into cache hits instead of repeated prefills, slashing TTFT. Pair it with load-aware balancing across your replicas and a priority tier for interactive traffic. Watch num_requests_waiting per replica as your routing signal. · Your Lab →

Read this next — primary source Runnable companion: day24 notebook — token-aware & KV-aware routing, queueing models. Pair with Little's Law.

Check yourself (recall, don't peek)

I'm your teacher — ask me anything. Want to sketch a prefix-hash router in front of your two vLLM deployments? Just ask.
← Lesson 24 — latency & SLOs Next: Lesson 26 — autoscaling →
References
  1. Routing, load balancing & queueing — day24 (routing-load-balancing-queueing.ipynb).
  2. KV-cache-aware routing — SGLang / NVIDIA Dynamo (see Lesson 14).