Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU becomes unavailable #3680

Closed
rogelioamancisidor opened this issue May 25, 2023 · 22 comments
Closed

GPU becomes unavailable #3680

rogelioamancisidor opened this issue May 25, 2023 · 22 comments
Labels
action-required Needs Attention 👋 Issues needs attention/assignee/owner question

Comments

@rogelioamancisidor
Copy link

rogelioamancisidor commented May 25, 2023

I send several jobs from a bash file, e.g.

run my_code.py --some_params
run my_code.py --other_params
....

to a nodepool running a A100 GPU. The first job uses the GPU. I can see that the GPU is used in wandb and also because I ping the VM and confirmed that the GPU is available using tensorflow. However, when the second run starts the GPU is unavailable. Again, I confirmed that in wandb and I ping the VM.

Why is happening this? I have never experienced this before.

UPDATE I ping the VM and running nvidia-smi returns

Failed to initialize NVML: Unknown Error

I found this link and this other one link

@rogelioamancisidor rogelioamancisidor changed the title Why is the GPU becoming unavailable? GPU becomes unavailable May 27, 2023
@olsenme
Copy link
Contributor

olsenme commented Jun 2, 2023

What sku are you using?

@rogelioamancisidor
Copy link
Author

rogelioamancisidor commented Jun 3, 2023

@olsenme It happened with both standard_nc24ads_a100_v4 and standard_nc6s_v3. It was working fine couple of days ago, but today the error is back.

@alexeldeib
Copy link
Contributor

will be fixed in 202306.07.0 VM image from AKS.

Ensure you deploy nvidia device plugin with PASS_DEVICE_SPECS set to true via env or CLI.

NVIDIA/nvidia-docker#966 (comment)
NVIDIA/nvidia-docker#1671 (comment)

this deployment (can disable privileged for most scenarios) works fine: https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/deployments/static/nvidia-device-plugin-compat-with-cpumanager.yml

AKS preview GPU dedicated VHD needs an additional tweak with TBD ETA, probably in the next few weeks.

@rogelioamancisidor
Copy link
Author

rogelioamancisidor commented Jul 5, 2023

@alexeldeib I guess I can simply add PASS_DEVICE_SPECS in the nvidia device pluging given in the AKS tutorial and set it to true, right? or should I use the one you linked? BTW when is the 202306.07.0 VM released? because the GPU in my jobs still becomes unavailable (I deployed the new plugin with the parameter as you suggested 2 weeks ago)

@alexeldeib
Copy link
Contributor

@ghost ghost added the action-required label Aug 7, 2023
@wjhhuizi
Copy link

wjhhuizi commented Aug 16, 2023

Found the same issue:

When the container started on the NC24s_v3 node, most of the time the nvidia-smi will works fine, however after a while if I'm lucky enough, I will get the Failed to initialize NVML: Unknown Error output from running nvidia-smi command again.

Node image
kubernetes.azure.com/node-image-version=AKSUbuntu-2204gen2containerd-202304.20.0

Input

import torch
print("CUDA Version used in Python: " + torch.version.cuda)
print("CUDA Device is Available: " + str(torch.cuda.is_available()))

Output:

CUDA Version used in Python: 11.7
CUDA Device is Available: False

When this happens, the GPU would be kept unavailable until I restart the Pod.

After restart the pod, the Nvidia-smi command run as expected.

Wed Aug 16 21:51:33 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000001:00:00.0 Off |                  Off |
| N/A   30C    P0    24W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000002:00:00.0 Off |                  Off |
| N/A   30C    P0    23W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This is a super annoying issue which causing our ML workflow broken... Please do some investigation into this issue.

++ Aha, was too hurry to post this finding... saw the fix provided above :)

@wjhhuizi
Copy link

will be fixed in 202306.07.0 VM image from AKS.

Ensure you deploy nvidia device plugin with PASS_DEVICE_SPECS set to true via env or CLI.

NVIDIA/nvidia-docker#966 (comment) NVIDIA/nvidia-docker#1671 (comment)

this deployment (can disable privileged for most scenarios) works fine: https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/deployments/static/nvidia-device-plugin-compat-with-cpumanager.yml

AKS preview GPU dedicated VHD needs an additional tweak with TBD ETA, probably in the next few weeks.

So can you please give a little more information on the Root cause? Was so curios since this problem bothered us for several month...

@alexeldeib
Copy link
Contributor

take a peek at these links

NVIDIA/nvidia-container-toolkit#48
NVIDIA/nvidia-docker#966 (comment)
NVIDIA/nvidia-docker#1671 (comment)

it's an issue with systemd as the cgroup manager, we applied the char device fix mentioned in the first link via udev rules

@rogelioamancisidor
Copy link
Author

@alexeldeib The problem is back (well I am not sure if it actually disappeared). As mentioned before, I set PASS_DEVICE_SPECS to True in the nvidia device pluging given in the AKS tutorial. I'll use the script that you posted and see if that helps. BTW, I dont need to upgrade the node image, as it is a nodepool deployed after 202306.07.0.

@rogelioamancisidor
Copy link
Author

@wjhhuizi did you solve the problem by following the above solution? I have deployed the suggested nvidia device plugin (after delinting the previous one), updated the VM image in both the agentpool node and the gpu node, but the problem isnt solved! Very frustrating

@klueska
Copy link

klueska commented Sep 12, 2023

Jose-Matsuda added a commit to StatCan/charts that referenced this issue Dec 4, 2023
recommended by Azure/AKS#3680 to fix losing access to GPU
Jose-Matsuda added a commit to StatCan/charts that referenced this issue Dec 4, 2023
recommended by Azure/AKS#3680 to fix losing access to GPU
@microsoft-github-policy-service microsoft-github-policy-service bot added action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Feb 20, 2024
@justindavies
Copy link
Contributor

Can you let me know if you are still seeing this issue?

@microsoft-github-policy-service microsoft-github-policy-service bot removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Mar 26, 2024
@rogelioamancisidor
Copy link
Author

I started using the AKS GPU image instead of the device plugin as the solution posted here never worked for me.

Copy link
Contributor

Action required from @Azure/aks-pm

@microsoft-github-policy-service microsoft-github-policy-service bot added the Needs Attention 👋 Issues needs attention/assignee/owner label May 1, 2024
Copy link
Contributor

Issue needing attention of @Azure/aks-leads

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

6 similar comments
Copy link
Contributor

Issue needing attention of @Azure/aks-leads

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action-required Needs Attention 👋 Issues needs attention/assignee/owner question
Projects
None yet
Development

No branches or pull requests

7 participants