-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to initialize NVML: Unknown Error #430
Comments
@hoangtnm Can you confirm the OS version you are using along with runtime(containerd, docker) version? Also, is cgroup v2 enabled on the nodes? (i.e systemd.unified_cgroup_hierarchy=1 kernel command line is passed and /sys/fs/cgroup/cgroup.controllers exists?) |
@shivamerla I'm using Ubuntu 22.04.1 LTS and docker, this is my docker daemon's config along with its version : docker-ce-cli/jammy,now 5:20.10.20~3-0~ubuntu-jammy amd64 [installed,upgradable to: 5:20.10.21~3-0~ubuntu-jammy]
docker-ce-rootless-extras/jammy,now 5:20.10.20~3-0~ubuntu-jammy amd64 [installed,upgradable to: 5:20.10.21~3-0~ubuntu-jammy]
docker-ce/jammy,now 5:20.10.20~3-0~ubuntu-jammy amd64 [installed,upgradable to: 5:20.10.21~3-0~ubuntu-jammy]
docker-compose-plugin/jammy,now 2.12.0~ubuntu-jammy amd64 [installed,upgradable to: 2.12.2~ubuntu-jammy]
docker-scan-plugin/jammy,now 0.17.0~ubuntu-jammy amd64 [installed,upgradable to: 0.21.0~ubuntu-jammy]
{
"default-runtime": "nvidia",
"exec-opts": [
"native.cgroupdriver=systemd"
],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"runtimes": {
"nvidia": {
"args": [],
"path": "/usr/local/nvidia/toolkit/nvidia-container-runtime"
},
"nvidia-experimental": {
"args": [],
"path": "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
}
},
"storage-driver": "overlay2"
} Btw, I don't think |
The default in Ubuntu 22.04 is
|
I have |
I tried to set |
containderd version is v1.6.6-k3s1 (rke2), kubernetes 1.24.8. |
I checked when upgrading 1.24.2 to 1.24.8, I got this error. Versions later than 1.24.2 require Whole runtime configuration:
any chance for quick fix? The problem is, that all
|
All this is happening because systemd removes device from cgroup as it needs to be set via systemd, not directly into file. |
I found out that this is connected with static |
The issue with CPUManager compatibility is a well known one that had a (until recently) stable workaround, using the Unfortunately, it seems that recent combinations of systemd / containerd / runc do not allow this workaround to work anymore. As mentioned in the link above, the underlying issue is due to a flaw in the design of the existing nvidia-container-stack, and not something that is easily worked around. We have been working on a redesign of the nvidia-container-stack (based on CDI) for a few years now that architects this problem away, but it is not yet enabled by default. For many uses cases, it is already a stable / better solution than what is provided today, but it does not have full feature parity with the existing stack yet, which is why we can't just make the switch. That said, for most (possibly all) GPU operator uses cases it should have feature parity though, and we plan on switching to this new approach as the default in the the next couple of releases (likely the March release). In the meantime, I will see if we can slip in an option for the next operator release (coming out in 2 weeks) to at least provide the ability to enable CDI as the default mechanism for device injection so that those of you facing this problem at least have a way out of it. |
Thank you for explanation! I can also disable static manager for a while, but I will also test |
@xhejtman Just out of curiosity, can you try to apply the following to see it if resolves your issue: We would like to understand if creating these symlinks in a system that exhibits the issue is enough to work around the issue. |
It seems it works as well. |
Btw, does it work correctly, if you have multiple cards in system and request only one? |
However, with latest nvidia driver (520), there are no |
What do you mean there are no
Based on this bug, 3 won't work anymore, but 1 and 2 still should. |
So I mean, that /dev/nviida node files are in /run chroot only and in container. I do not see them in the host /dev. But that's my bad, they are actually missing also with older driver, I thought they are created by loading the nvidia module. |
@xhejtman This is the expected behavior with driver container root under |
I am definitely using SystemdCgroup and static cpu-manager-policy FWIW |
Could you see if manually creating the Regardless of whether you are running with the driver container or not, these char devices will need to be created in the root |
Should it be fixed in 22.9.1 version? |
No, unfortunately, not. |
@klueska I followed NVIDIA/nvidia-docker#1671, and the conclusion seems to be that gpu-operator won't be compatible with runc version newer than 1.1.3 (containerd version newer 1.6.7). This I think this issue should definitely be added to known issue in release note, otherwise people who upgrade their containerd version in production will face detrimental consquence.... |
I was able to reproduce this and verify that manually creating symlinks to the various nvidia devices in /dev/char resolves the issue. I need to talk to our driver team to determine why these are not automatically created and how to get them created going forward. At least we seem to fully understand the problem now, and know what is necessary to resolve it. In the meantime, I would recommend creating these symlinks manually to work around this issue. |
The existence of these symlinks is required to address the following bug: #430 This bug impacts container runtimes configured with systemd cgroup management enabled.
The existence of these symlinks is required to address the following bug: #430 This bug impacts container runtimes configured with systemd cgroup management enabled.
Hello, it seems that char dev symlink does not solve this issue with MIG devices, nvidia-smi complains about:
should it be fixed in the newer versions? |
opencontainers/runc@bf7492e 升级runc解决 |
set device-plugin param PASS_DEVICE_SPECS to true。 |
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
Hi, I'm deploying Kubeflow v1.6.1 along with
nvidia/gpu-operator
for training DL models. It works great, but after a random of time (maybe 1-2 days I guess), I cannot usenvidia-smi
to check GPU status anymore. When this happens, it raises:(base) jovyan@agm-0:~/vol-1$ nvidia-smi Failed to initialize NVML: Unknown Error
I'm not so sure why this happens because it runs training without any problem for several epochs, and when I come back the next day, this error happens. Do you have any idea?
2. Steps to reproduce the issue
This is how I deploy
nvidia/gpu-operator
:The text was updated successfully, but these errors were encountered: