LLM-aware routing — one guess at a time.
You have several vLLM replicas behind a load balancer. Your instinct from stateless web is round-robin.
Your Service/Ingress/LB layer, with three twists: balance by tokens not connections; use consistent-hash / session affinity on the prompt prefix (not random) to hit the warm KV cache; add priority classes + deadlines so a 10k-token job can't starve interactive chat. (Dynamo bundles all three.)
You're 19–47:1 prefill-heavy RAG with prefix caching ON — so prefix-hash routing
is your biggest win: turn shared system prompts into cache hits, slashing TTFT. Add load-aware balancing + a
priority tier; route on per-replica num_requests_waiting. Your Lab →
Why LLM routing isn't round-robin — and how queues protect your tail latency.
A round-robin host seats the next party at the next table regardless of size — so one waiter ends up with five big parties and another with none. LLM requests are wildly uneven (10 tokens vs 10,000), so you must route by work, not by count — and seat repeat parties where their order's already prepped (the warm prefix cache).
| seat by party size, not headcount | token-aware routing (balance by load) |
| seat regulars where their prep is ready | KV-cache-aware routing (prefix affinity) |
| a managed waitlist with priorities | queueing (FIFO / priority / deadlines) |
Requests vary 100× in token count, and decode time scales with length. Round-robin (or least-connections) ignores that, so some replicas get swamped while others idle — tail latency explodes. Token-aware routing balances by actual load: pending tokens, KV-cache occupancy, or running sequences per replica.1
From Lesson 14: prefix caches are per-replica. So route a request to the replica that already holds its prefix — hash the prefix and send matching requests to the same replica. That converts a cache miss (full re-prefill) into a hit (low TTFT). It also keeps LoRA adapters local.2
When all replicas are near their knee, new requests queue. Discipline matters: FIFO is fair but lets a giant request block small ones; priority queues / SLA tiers / deadlines let you protect interactive traffic. The golden rule (from L24): keep utilization just below the knee — past it, the queue and your P99 blow up (M/M/c).1
This is your Service / Ingress / load-balancer layer — with three LLM twists: balance by tokens, not connections (requests aren't fungible); use consistent-hash / session affinity on the prompt prefix (not random) so you hit the warm KV cache; and add priority classes + deadlines at the queue so a 10k-token batch job can't starve interactive chat. A smart router (e.g. NVIDIA Dynamo) bundles all three.
You're 19–47:1 prefill-heavy RAG with --enable-prefix-caching on
(Lesson 14) — so prefix-hash routing is your
highest-value routing change: it turns those big shared system prompts into cache hits instead of repeated
prefills, slashing TTFT. Pair it with load-aware balancing across your replicas and a priority tier for
interactive traffic. Watch num_requests_waiting per replica as your routing signal. · Your Lab →