Inference Engineering · Lesson 23 · Multi-Instance GPU (MIG) Home · Glossary · Your Lab

Multi-Instance GPU (MIG)

Slicing a GPU — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict how MIG partitions a GPU, how it differs from time-slicing, and which your cluster uses.

The setup

A 94 GB GPU running a model that needs only 20 GB wastes most of itself. You want several workloads to share one physical GPU.

Step 1 — what MIG does

Step 2 — MIG vs time-slicing

Recall — cover the screen: MIG vs time-slicing, one line each.
MIG = hardware partition into isolated instances (own SMs/HBM/L2) — guaranteed, no noisy neighbor, but rigid. Time-slicing = tenants share the whole GPU in turns — flexible and oversubscribable, but no isolation (a heavy tenant starves others). (tap/hover to check)

Step 3 — when to use which

Step 4 — your cluster (real)

In Kubernetes terms infra bridge

MIG = partition a node into Guaranteed-QoS slices with hard quotas (can't overcommit). Time-slicing = overcommit the node (Burstable QoS): more pods than guaranteed capacity, flexible but noisy-neighbor-prone. The same requests/limits isolation-vs-density call.

On YOUR cluster real config

You run time-slicing ×5 → 20 logical GPUs, MIG disabled — flexible/oversubscribed, but no hardware isolation (noisy-neighbor risk to SLOs). Your 27B Qwen wants a whole GPU regardless. MIG would be the switch if you needed guaranteed per-tenant isolation. Your Lab →

Read this next — primary source NVIDIA MIG guide · runnable: day21 notebook.

Final check — teach it back

Explain to a colleague: "We'd switch from time-slicing to MIG if…"
…we needed guaranteed per-tenant isolation — MIG gives each instance its own SMs/HBM in hardware, so no noisy neighbor, at the cost of flexibility (fixed profiles, no oversubscription). Time-slicing (what we run) packs more logical GPUs but lets a heavy tenant degrade others. (tap/hover)
I'm your teacher — ask me anything. Want to weigh MIG vs time-slicing for a specific multi-tenant scenario?
← Lesson 22Next: Lesson 24 →
References
  1. day21 — MIG (notebook); NVIDIA MIG.

Multi-Instance GPU (MIG)

Slice one physical GPU into smaller isolated GPUs — or time-slice it. The tradeoff.

Today's win: you'll explain how MIG hardware-partitions one GPU into several isolated instances for multi-tenant serving, how that differs from time-slicing, and when to reach for each — using your cluster's actual setup as the example.

The picture: private stalls vs a shared kitchen on a timer

A 94 GB GPU is a big kitchen. If your dish only needs a third of it, the rest is wasted. Two ways to share: build permanent walls into separate stalls each with their own oven and counter (MIG — hard isolation), or let several cooks share the one kitchen on a timer (time-slicing — flexible, but they can step on each other).

permanent walls + dedicated equipmentMIG — hardware partition, isolated SMs/HBM
sharing the kitchen on a timertime-slicing — soft, oversubscribed, no isolation
fitting the dish to the stallright-sizing by VRAM + KV headroom

1 · Right-sizing: don't waste a whole GPU on a small model

A model needs its weights plus KV headroom. A small model on a 94 GB GPU leaves most of it idle. For multi-tenant or small-model serving, you want to subdivide the GPU so several workloads share it.1

2 · MIG — hardware partitions

Multi-Instance GPU splits one physical GPU into up to 7 isolated instances, each with its own dedicated SMs, HBM slice, and L2 (profiles like 1g.10gb, 3g.40gb, 7g.80gb). The isolation is in hardware: one tenant literally cannot touch another's compute or memory — guaranteed performance, no noisy neighbor.1

MIG — hard partitions (dedicated SMs/HBM each) 3g.40gb2g.20gb1g1g isolated — one tenant can't affect another time-slicing — share the whole GPU in turns A → B → C → A → B → C … (full GPU each turn) flexible + oversubscribed, but no isolation (noisy neighbor)
MIG carves dedicated slices with hardware walls; time-slicing lets tenants take turns on the whole GPU. Isolation vs flexibility.

3 · MIG vs time-slicing — the tradeoff

MIG = hard, guaranteed isolation, but rigid (fixed profiles, can't oversubscribe). Time-slicing = soft sharing: flexible and you can pack more logical GPUs than you have physical ones, but there's no isolation — a heavy tenant starves the others (noisy neighbor).2

In Kubernetes terms infra bridge

MIG is partitioning a node into smaller schedulable units with hard resource quotas — Guaranteed-QoS slices that can't be overcommitted, like dedicated nodepools with strict limits. Time-slicing is overcommitting the node (Burstable QoS): more pods than guaranteed capacity, flexible but subject to noisy-neighbor contention. Same isolation-vs-density call you make with resource requests/limits.

4 · When to use which

On YOUR cluster — you time-slice, not MIG real config

Your 4 H100s run time-slicing ×5 → 20 logical GPUs, with MIG disabled. That's the flexible/oversubscribed choice: great for packing many workloads, but no hardware isolation — a heavy tenant can degrade neighbors sharing the same physical GPU (a real noisy-neighbor risk for your SLOs). If you needed guaranteed isolation per tenant, MIG would be the switch — at the cost of that flexibility. Your big Qwen, though, wants a whole GPU regardless. · Your Lab →

Read this next — primary source NVIDIA MIG User Guide. Runnable companion: day21 notebook.

Check yourself (recall, don't peek)

I'm your teacher — ask me anything. Want to weigh MIG vs your current time-slicing for a specific multi-tenant scenario? Just ask.
← Lesson 22 — GPU generations Next: Lesson 24 — latency & SLOs →
References
  1. Multi-Instance GPU (MIG) — day21 (multi-gpu-instances-mig.ipynb); NVIDIA MIG guide.
  2. MIG vs time-slicing tradeoffs — NVIDIA GPU Operator docs.