Where the internals meet your ops layer — predicted step by step.
Two clocks your users feel: TTFT (time to first token) and TPOT (time per output token). And one system-wide number: throughput.
Batching more requests raises throughput — but bigger batches and fuller queues raise latency.
vLLM emits exactly these: time_to_first_token (TTFT SLO), time_per_output_token (TPOT SLO), num_requests_running = live concurrency, num_requests_waiting = the queue (your cleanest "autoscale now" signal). --max-num-seqs is the Little's-Law cap; --enable-chunked-prefill protects TTFT under load. Your Lab →
Where the internals meet your ops layer — the knee that drives routing, autoscaling, and cost.
TTFT (time to first token) is set by prefill — processing the whole prompt. TPOT (time per output token, a.k.a. inter-token latency) is set by decode. Those are per-request latencies; throughput (tokens or requests/sec) is the system-wide rate. They are different axes — and often in tension.1
Batching more requests raises throughput (the memory-bound decode weight-read is shared — Lessons 9–12). But bigger batches and fuller queues also raise latency. Push load up and throughput climbs… until it saturates, while latency keeps climbing — steeply. That bend is the knee. You want to run near it, not past it.1
Raw throughput lies: a request served late is a failure, not a success. Goodput = the throughput that meets both the TTFT and TPOT SLOs. Past the knee, throughput may look flat while goodput collapses as requests blow their SLO — which is why you optimize for goodput, not throughput.2
One equation turns all this into headcount:3
Your --max-num-seqs is the concurrency cap. Divide it by your average
request latency and you have a replica's max QPS; divide peak demand by that and you have how many
replicas to autoscale to. When the queue grows, you're past the knee — add a replica.
This lesson is your world. TTFT/TPOT are your latency SLOs; the knee is where you'd burn your error budget; goodput is SLO-meeting throughput; and Little's Law (concurrency = throughput × latency) is the same math behind an HPA target and replica sizing. The next lessons — routing, autoscaling — are literally your day job, now wired to the internals underneath.
Everything in this lesson is already emitted by your vLLM servers:
time_to_first_token histogram → your TTFT SLO (prefill).
time_per_output_token → TPOT SLO (decode).num_requests_running = live concurrency; num_requests_waiting =
the queue — the cleanest "past the knee → autoscale now" signal.--max-num-seqs (8 / 64) = the concurrency cap in Little's Law; bump it only if
KV memory allows (Lesson 10).--enable-chunked-prefill + --max-num-batched-tokens = the
Sarathi-Serve token budget that tames the knee — protecting TPOT while big prefills run.
You're prefill-heavy (19–47:1), so this guards your TTFT directly.Ops loop: route to keep each replica below its knee, autoscale on
num_requests_waiting / TTFT-P99, size replicas with Little's Law. Watch it:
bash learning/tools/cluster-probe.sh · Your Lab →
That closes the loop back to your day-job: prefill/decode (L9) set the two latencies; KV memory (L10) and paging/batching (L12) set the concurrency cap; the roofline (L11) says decode is bandwidth-bound; interconnect (L19) sets TP latency; and goodput + Little's Law (L24) turn all of it into routing, autoscaling, and cost.
Picture the two clocks and the knee, then answer from memory.
num_requests_waiting? Just ask.