GPU becomes unavailable #3680

rogelioamancisidor · 2023-05-25T16:50:29Z

I send several jobs from a bash file, e.g.

run my_code.py --some_params
run my_code.py --other_params
....

to a nodepool running a A100 GPU. The first job uses the GPU. I can see that the GPU is used in wandb and also because I ping the VM and confirmed that the GPU is available using tensorflow. However, when the second run starts the GPU is unavailable. Again, I confirmed that in wandb and I ping the VM.

Why is happening this? I have never experienced this before.

UPDATE I ping the VM and running nvidia-smi returns

Failed to initialize NVML: Unknown Error

I found this link and this other one link

The text was updated successfully, but these errors were encountered:

olsenme · 2023-06-02T18:42:46Z

What sku are you using?

rogelioamancisidor · 2023-06-03T17:51:45Z

@olsenme It happened with both standard_nc24ads_a100_v4 and standard_nc6s_v3. It was working fine couple of days ago, but today the error is back.

alexeldeib · 2023-06-10T00:10:07Z

will be fixed in 202306.07.0 VM image from AKS.

Ensure you deploy nvidia device plugin with PASS_DEVICE_SPECS set to true via env or CLI.

NVIDIA/nvidia-docker#966 (comment)
NVIDIA/nvidia-docker#1671 (comment)

this deployment (can disable privileged for most scenarios) works fine: https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/deployments/static/nvidia-device-plugin-compat-with-cpumanager.yml

AKS preview GPU dedicated VHD needs an additional tweak with TBD ETA, probably in the next few weeks.

rogelioamancisidor · 2023-07-05T07:57:25Z

@alexeldeib I guess I can simply add PASS_DEVICE_SPECS in the nvidia device pluging given in the AKS tutorial and set it to true, right? or should I use the one you linked? BTW when is the 202306.07.0 VM released? because the GPU in my jobs still becomes unavailable (I deployed the new plugin with the parameter as you suggested 2 weeks ago)

alexeldeib · 2023-07-13T13:05:35Z

right, this example worked for me: https://github.com/NVIDIA/k8s-device-plugin/blob/main/deployments/static/nvidia-device-plugin-compat-with-cpumanager.yml

the fixed VHD should be out now, you have to trigger node image upgrade: https://learn.microsoft.com/en-us/azure/aks/node-image-upgrade#check-for-available-node-image-upgrades

wjhhuizi · 2023-08-16T21:51:48Z

Found the same issue:

When the container started on the NC24s_v3 node, most of the time the nvidia-smi will works fine, however after a while if I'm lucky enough, I will get the Failed to initialize NVML: Unknown Error output from running nvidia-smi command again.

Node image
kubernetes.azure.com/node-image-version=AKSUbuntu-2204gen2containerd-202304.20.0

Input

import torch
print("CUDA Version used in Python: " + torch.version.cuda)
print("CUDA Device is Available: " + str(torch.cuda.is_available()))

Output:

CUDA Version used in Python: 11.7
CUDA Device is Available: False

When this happens, the GPU would be kept unavailable until I restart the Pod.

After restart the pod, the Nvidia-smi command run as expected.

Wed Aug 16 21:51:33 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000001:00:00.0 Off |                  Off |
| N/A   30C    P0    24W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000002:00:00.0 Off |                  Off |
| N/A   30C    P0    23W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This is a super annoying issue which causing our ML workflow broken... Please do some investigation into this issue.

++ Aha, was too hurry to post this finding... saw the fix provided above :)

wjhhuizi · 2023-08-16T22:07:09Z

will be fixed in 202306.07.0 VM image from AKS.

Ensure you deploy nvidia device plugin with PASS_DEVICE_SPECS set to true via env or CLI.

NVIDIA/nvidia-docker#966 (comment) NVIDIA/nvidia-docker#1671 (comment)

this deployment (can disable privileged for most scenarios) works fine: https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/deployments/static/nvidia-device-plugin-compat-with-cpumanager.yml

AKS preview GPU dedicated VHD needs an additional tweak with TBD ETA, probably in the next few weeks.

So can you please give a little more information on the Root cause? Was so curios since this problem bothered us for several month...

alexeldeib · 2023-08-16T22:10:27Z

take a peek at these links

NVIDIA/nvidia-container-toolkit#48
NVIDIA/nvidia-docker#966 (comment)
NVIDIA/nvidia-docker#1671 (comment)

it's an issue with systemd as the cgroup manager, we applied the char device fix mentioned in the first link via udev rules

rogelioamancisidor · 2023-09-02T17:01:54Z

@alexeldeib The problem is back (well I am not sure if it actually disappeared). As mentioned before, I set PASS_DEVICE_SPECS to True in the nvidia device pluging given in the AKS tutorial. I'll use the script that you posted and see if that helps. BTW, I dont need to upgrade the node image, as it is a nodepool deployed after 202306.07.0.

rogelioamancisidor · 2023-09-08T09:17:13Z

@wjhhuizi did you solve the problem by following the above solution? I have deployed the suggested nvidia device plugin (after delinting the previous one), updated the VM image in both the agentpool node and the gpu node, but the problem isnt solved! Very frustrating

klueska · 2023-09-12T08:20:19Z

NVIDIA/nvidia-docker#1671 (comment)

recommended by Azure/AKS#3680 to fix losing access to GPU

justindavies · 2024-03-26T20:04:03Z

Can you let me know if you are still seeing this issue?

rogelioamancisidor · 2024-04-01T19:01:20Z

I started using the AKS GPU image instead of the device plugin as the solution posted here never worked for me.

microsoft-github-policy-service · 2024-05-01T21:05:58Z

Action required from @Azure/aks-pm

microsoft-github-policy-service · 2024-05-17T03:09:29Z