-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU becomes unavailable #3680
Comments
What sku are you using? |
@olsenme It happened with both |
will be fixed in 202306.07.0 VM image from AKS. Ensure you deploy nvidia device plugin with NVIDIA/nvidia-docker#966 (comment) this deployment (can disable privileged for most scenarios) works fine: https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/deployments/static/nvidia-device-plugin-compat-with-cpumanager.yml AKS preview GPU dedicated VHD needs an additional tweak with TBD ETA, probably in the next few weeks. |
@alexeldeib I guess I can simply add |
right, this example worked for me: https://github.com/NVIDIA/k8s-device-plugin/blob/main/deployments/static/nvidia-device-plugin-compat-with-cpumanager.yml the fixed VHD should be out now, you have to trigger node image upgrade: https://learn.microsoft.com/en-us/azure/aks/node-image-upgrade#check-for-available-node-image-upgrades |
Found the same issue: When the container started on the NC24s_v3 node, most of the time the Node image Input import torch
print("CUDA Version used in Python: " + torch.version.cuda)
print("CUDA Device is Available: " + str(torch.cuda.is_available())) Output:
When this happens, the GPU would be kept unavailable until I restart the Pod. After restart the pod, the Nvidia-smi command run as expected.
This is a super annoying issue which causing our ML workflow broken... Please do some investigation into this issue. ++ Aha, was too hurry to post this finding... saw the fix provided above :) |
So can you please give a little more information on the Root cause? Was so curios since this problem bothered us for several month... |
take a peek at these links NVIDIA/nvidia-container-toolkit#48 it's an issue with systemd as the cgroup manager, we applied the char device fix mentioned in the first link via udev rules |
@alexeldeib The problem is back (well I am not sure if it actually disappeared). As mentioned before, I set |
@wjhhuizi did you solve the problem by following the above solution? I have deployed the suggested nvidia device plugin (after delinting the previous one), updated the VM image in both the agentpool node and the gpu node, but the problem isnt solved! Very frustrating |
recommended by Azure/AKS#3680 to fix losing access to GPU
recommended by Azure/AKS#3680 to fix losing access to GPU
Can you let me know if you are still seeing this issue? |
I started using the AKS GPU image instead of the device plugin as the solution posted here never worked for me. |
Action required from @Azure/aks-pm |
Issue needing attention of @Azure/aks-leads |
Issue needing attention of @Azure/aks-leads |
6 similar comments
Issue needing attention of @Azure/aks-leads |
Issue needing attention of @Azure/aks-leads |
Issue needing attention of @Azure/aks-leads |
Issue needing attention of @Azure/aks-leads |
Issue needing attention of @Azure/aks-leads |
Issue needing attention of @Azure/aks-leads |
I send several jobs from a bash file, e.g.
to a nodepool running a A100 GPU. The first job uses the GPU. I can see that the GPU is used in wandb and also because I ping the VM and confirmed that the GPU is available using tensorflow. However, when the second run starts the GPU is unavailable. Again, I confirmed that in wandb and I ping the VM.
Why is happening this? I have never experienced this before.
UPDATE I ping the VM and running
nvidia-smi
returnsI found this link and this other one link
The text was updated successfully, but these errors were encountered: