Understand Nvidia MIG (Multi-Instance GPU)
There’s a wealth of technical documentation from NVIDIA about MIG, but many delve too deeply and there is no one simple and structure approach to understanding MIG. The objective of this blog is to make MIG simple to understand so that when you start reading those technical document, it would be much easier. This might be a long read but trust me, it is definitely worth it.
Why the need for MIG ? GPUs are both expensive and scarce, and workloads that fail to fully utilize their compute capacity often result in wasted resources. For simplicity illustrate, you run a workload that need 10GB on a 40GB GPU result a waste of 30 GB.
To solve this, the Multi-Instance GPU (MIG) feature, introduced with NVIDIA’s Ampere architecture, enables GPUs to be securely partitioned into up to seven independent GPU instances for CUDA applications. This allows multiple users to access dedicated GPU resources, optimizing GPU utilization. MIG is particularly advantageous for workloads that do not fully utilize the GPU’s compute capacity, enabling users to run multiple workloads concurrently to maximize efficiency.
What is MIG? MIG allows a single GPU to be securely partitioned into multiple independent GPU instances. Each instance functions as a separate GPU with its own dedicated memory, cache, and compute resources. This ensures isolation between workloads, making it ideal for scenarios where multiple users or applications need to share a single GPU without interference. One common use case of MIG is for Cloud Service Providers (CSPs) or company who have or need multi-tenant use cases, MIG ensures one user cannot impact the work or scheduling of other users, in addition to providing enhanced isolation.
How MIG works ? MIG allows a single GPU to be partitioned into up to seven GPU instances (GIs). For example, an NVIDIA A100 GPU with 40 GB of memory can be divided into eight memory slices and seven compute slices. Each memory slice represents 1/8 of the total GPU memory, while each compute slice contains approximately 1/7 of the GPU’s streaming multiprocessors (SMs).
For those unfamiliar with GPU architecture, streaming multiprocessors (SMs) are the core units responsible for executing tasks and are thus the primary compute resources of the GPU. While the A100 has a total of 108 SMs, dividing these evenly into seven instances is not straightforward due to indivisibility. Let’s explore this further to understand how MIG handles such configurations.
Why can’t GPU be partition into eight compute slide? According to Nvidia’s official reasoning for this is is that if you go for eight, the chip yield is not that good. If you go for seven, the yield is acceptable.
Understanding the Partitioning Process. To create a GPU Instance (GI) requires combining some number of memory slices with some number of compute slices. In the diagram below, a 5GB memory slice is combined with 1 compute slice to create a 1g.5gb
GI profile. You can have up to seven of this. To understand this canonical representation of “1g.5gb”, please refer here.
You can also do 2g.10gb GI Profile (Figure 3). Take note, you can also do 1g.10gb GI profile also (not shown here).
You cannot do 3g.15gb GI profile. It is just not allowed. You can only do 3g.20g profile (Figure 4). This is how Nvidia design it.
When you go for 4g, you get 4g.20gb profile.
And, you cannot go for 5g or 6g, the next and last one is 7g.40gb profile.
The rule to remember is this:
- you can only provision GPU slice of 1,2,3,4 and 7.
- If you GPU slide is not a power of 2, such as 1, 2 or 4, then you get an additional memory slice (such as the case in figure 3 and 5).
Rules of partition when you are using nvidia-smi: you can create the GI instance using Nvidia-SMI. The first step you need to do is to enable MIG. It is disabled by default.
MIG mode can be enabled on a per-GPU basis with the following command:
The GPUs can be selected using comma separated GPU indexes, PCI Bus IDs or UUIDs. If no GPU ID is specified, then MIG mode is applied to all the GPUs on the system. When MIG is enabled on the GPU, depending on the GPU product, the driver will attempt to reset the GPU so that MIG mode can take effect.
The order in which you enable partition matters. Let me explain using the diagram in figure 9. If you enable a 3g.20gb first, it will “lock out” one compute slice and the 4g.20gb cannot be enable.
Similarly, if you try to do 3g.20gb and 4 x 1g.5gb, you can only get to 3g.20gb and 3 x1g.5gb (Figure 10). This is because the one and only compute slice is being locked by the 3g.20gb and you can’t provision the last 1g.5gb profile.
So, the general rule of thumb is don’t partition the 3g.20gb first. Always do the gpu slice that is to the power of 2 first. Hence, those that has 1,2 and 4g before the 3g.20gb.
You can also further partition the GPU slices into compute instances so the compute instances will share memory and this also means that you would lose isolation. With GPU instance, you get memory and compute isolation, any failure or error will not affect other GPU instance. With compute instance, one process fails will affect others since they are sharing the same GPU and memory. In figure 12 below, it show a two of 2c.4g.20gb compute instance profile.
Finally, let’s talk about the SM. Remember, the A100 has 108 SMs and it is not divisible by 7. If you run this command: “nvidia-smi mig -lgip”, it show that a 1g.5gb GPU Instance has 14 SM (Streaming Multi-processor) and 1 CE (compute instance). If you need more SM, you can combine the compute slide, yielding for example, a 4g.20gb profile, getting you 42 SM.
Why doesn’t each GPU instance get 15 SMs? MIG ensures isolation between instances, which means no two instances can share an SM. This strict isolation is essential for workloads that require determinism and security. As a result, the SMs are distributed in a way that maintains the integrity of the MIG architecture, even if it results in fewer SMs being allocated to smaller instances. Other hardware resources such as memory controllers, L2 cache, and bandwidth are also split among instances. NVIDIA optimizes the SM allocation to align with the needs of the memory and other components assigned to the instance. For instance, allocating 15 SMs to a 1g.5gb instance might exceed the bandwidth or memory constraints of that slice. The smaller number of SMs (14) ensures that the instance operates efficiently within the constraints of the allocated memory and other GPU resources.
That’s all for this blog. Feel free to give comment!