Inference Engineering · Lesson 28 · Multi-Cloud Capacity Home · Glossary · Your Lab

Multi-Cloud Capacity

Fleets across clouds — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict why inference fleets span clouds, the control-plane architecture, and how procurement trades cost for certainty.

The setup

Demand outgrows one cluster/region. You consider spreading inference across clouds.

Step 1 — the drivers

Step 2 — the architecture

Recall — cover the screen: the multi-cloud shape in one line.
A global control plane (policy, routing, capacity) over per-cloud/region workload planes that run the GPUs, with geo-aware load balancing steering each request to the nearest healthy region in its latency budget. Capacity is sourced wherever GPUs are available. (tap/hover to check)

Step 3 — procurement

In Kubernetes terms infra bridge

Multi-cluster fleet management (a control plane like Karmada over regional clusters) + GSLB/global ingress for geo-routing; spot = preemptible node pools (drain-on-reclaim); data-residency/compliance = scheduling constraints. The GPU version of any global-service playbook.

On YOUR cluster context

You're single-site (on-prem OpenShift, 4×H100) — fine for now. Scale-out path: keep on-prem as the reserved baseline and burst to cloud (on-demand/spot) under a global control plane with geo-aware routing. Your Lab →

Read this next — primary source runnable: day25 notebook.

Final check — teach it back

Explain to a colleague: "We'd go multi-cloud for inference to…"
…get GPU supply (they're scarce, so chase capacity across providers), cut latency (place near users), and survive any one cloud's outage — run a global control plane over regional workload planes, geo-route by RTT, and mix reserved baseline + on-demand burst + spot slack for cost vs certainty, within compliance/residency limits. (tap/hover)
I'm your teacher — ask me anything. Want to sketch an on-prem-baseline + cloud-burst topology for your 4×H100?
← Lesson 27Next: Lesson 29 →
References
  1. day25 — multi-cloud capacity (notebook).

Multi-Cloud Capacity

Why inference fleets span clouds — for supply, latency, and reliability.

Today's win: you'll explain why large inference fleets run across multiple clouds (GPU supply, latency, reliability), the global-control-plane architecture, and how procurement (reserved / on-demand / spot) trades cost against certainty.

The picture: a restaurant chain across cities

One location can't get enough ovens (GPU scarcity), can't be near every customer (latency), and is one fire away from total outage. So you run branches in many cities with a single head office coordinating them — buying ovens wherever they're available, seating diners at the nearest branch, and surviving any one branch going dark.

head officeglobal control plane (one brain)
each city branchper-cloud / per-region workload plane
seat diners at the nearest branchgeo-aware load balancing (by RTT)
owned vs rented vs day-rate ovensreserved / on-demand / spot

1 · Why span clouds at all

Three drivers:1

2 · The architecture: one brain, many planes

A global control plane holds policy, routing, and capacity state; per-cloud workload planes actually run the GPUs. A geo-aware load balancer sends each request to the nearest healthy region within its latency budget.1

global control plane cloud A · us-east neocloud · eu cloud B · apac users routed to the nearest healthy region (geo-aware LB, by RTT)
One control plane, many regional workload planes. Capacity is sourced wherever it's available; traffic is steered by latency and health.

3 · Procurement: reserved vs on-demand vs spot

Mix purchase types to balance cost and certainty: reserved (committed, cheapest per hour, for baseline load), on-demand (flexible, priciest, for bursts), and spot/preemptible (cheapest, but can be reclaimed any time — for interruptible or buffered work). A typical fleet is reserved baseline + on-demand burst + spot for slack.2

In Kubernetes terms infra bridge

This is multi-cluster fleet management: a control plane (think Karmada / fleet manager) over regional clusters, with a GSLB / global ingress doing geo-routing — the GPU version of what you'd build for any global service. Spot is preemptible/spot node pools (drain-on-reclaim), reserved is committed node groups, and data-residency/compliance are scheduling constraints (affinity to in-region clusters). Same playbook, GPU-flavored.

4 · The constraints that shape it

Latency budgets (per-region RTT), active-active vs active-passive failover, and compliance (SOC 2, HIPAA, data residency) all bound where workloads can run. Cost and resilience are the dials; compliance is the fence.1

On YOUR cluster — you're single-site (the scale-out path) context

Today you run one on-prem OpenShift cluster (4× H100) — no multi-cloud yet, and for many workloads that's fine. This lesson is the scale-out future: if demand outgrows your 4 GPUs or you need geo-presence, the pattern is to keep on-prem as the reserved baseline and burst to cloud (on-demand/spot) under a global control plane with geo-aware routing — the same fleet thinking you'd apply to any service, now sized in GPUs. · Your Lab →

Read this next — primary source Runnable companion: day25 notebook — control planes, procurement mix, geo-aware capacity.

Check yourself (recall, don't peek)

I'm your teacher — ask me anything. Want to sketch an on-prem-baseline + cloud-burst topology for your 4×H100? Just ask.
← Lesson 27 — containerization Next: Lesson 29 — zero-downtime & cost →
References
  1. Multi-cloud capacity management — day25 (multi-cloud-capacity.ipynb).
  2. Reserved / on-demand / spot procurement for GPU fleets — day25.