Failed to initialize NVML: Unknown Error #430

hoangtnm · 2022-11-01T02:35:33Z

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

Are you running on an Ubuntu 18.04 node?
Are you running Kubernetes v1.13+?
Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
Do you have i2c_core and ipmi_msghandler loaded on the nodes?
Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

Hi, I'm deploying Kubeflow v1.6.1 along with nvidia/gpu-operator for training DL models. It works great, but after a random of time (maybe 1-2 days I guess), I cannot use nvidia-smi to check GPU status anymore. When this happens, it raises:

(base) jovyan@agm-0:~/vol-1$ nvidia-smi
Failed to initialize NVML: Unknown Error

I'm not so sure why this happens because it runs training without any problem for several epochs, and when I come back the next day, this error happens. Do you have any idea?

2. Steps to reproduce the issue

This is how I deploy nvidia/gpu-operator:

sudo snap install helm --classic
helm repo add nvidia https://nvidia.github.io/gpu-operator \
  && helm repo update \
  && helm install \
  --version=v22.9.0 \
  --generate-name \
  --create-namespace \
  --namespace=gpu-operator-resources \
  nvidia/gpu-operator \
  --set driver.enabled=false \
  --set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
  --set devicePlugin.env[0].value="volume-mounts" \
  --set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
  --set-string toolkit.env[0].value=false \
  --set toolkit.env[1].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS \
  --set-string toolkit.env[1].value=true

The text was updated successfully, but these errors were encountered:

shivamerla · 2022-11-01T23:23:46Z

@hoangtnm Can you confirm the OS version you are using along with runtime(containerd, docker) version? Also, is cgroup v2 enabled on the nodes? (i.e systemd.unified_cgroup_hierarchy=1 kernel command line is passed and /sys/fs/cgroup/cgroup.controllers exists?)

hoangtnm · 2022-11-02T02:35:31Z

@shivamerla I'm using Ubuntu 22.04.1 LTS and docker, this is my docker daemon's config along with its version :

docker-ce-cli/jammy,now 5:20.10.20~3-0~ubuntu-jammy amd64 [installed,upgradable to: 5:20.10.21~3-0~ubuntu-jammy]
docker-ce-rootless-extras/jammy,now 5:20.10.20~3-0~ubuntu-jammy amd64 [installed,upgradable to: 5:20.10.21~3-0~ubuntu-jammy]
docker-ce/jammy,now 5:20.10.20~3-0~ubuntu-jammy amd64 [installed,upgradable to: 5:20.10.21~3-0~ubuntu-jammy]
docker-compose-plugin/jammy,now 2.12.0~ubuntu-jammy amd64 [installed,upgradable to: 2.12.2~ubuntu-jammy]
docker-scan-plugin/jammy,now 0.17.0~ubuntu-jammy amd64 [installed,upgradable to: 0.21.0~ubuntu-jammy]

{
    "default-runtime": "nvidia",
    "exec-opts": [
        "native.cgroupdriver=systemd"
    ],
    "log-driver": "json-file",
    "log-opts": {
        "max-size": "100m"
    },
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime"
        },
        "nvidia-experimental": {
            "args": [],
            "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
        }
    },
    "storage-driver": "overlay2"
}

Btw, I don't think cgroup v2 is configured on my system. I only installed fresh Ubuntu, docker with the mentioned config and then deployed gpu-operator.

klueska · 2022-11-02T13:44:34Z

The default in Ubuntu 22.04 is cgroupv2. Just to confirm though, can you show us the contents of this folder:

/sys/fs/cgroup/

xhejtman · 2022-12-04T01:06:28Z

I have cgroupv2 on Ubuntu 22.04 and have the same problem. Is that so that cgroupv2 is not supported here?

xhejtman · 2022-12-04T01:40:25Z

I tried to set systemd.unified_cgroup_hierarchy=0. But it is the same. I guess it can be related to SystemdCgroup = true in containerd config.toml?

xhejtman · 2022-12-04T10:46:32Z

containderd version is v1.6.6-k3s1 (rke2), kubernetes 1.24.8.

xhejtman · 2022-12-04T12:18:06Z

I checked when upgrading 1.24.2 to 1.24.8, I got this error. Versions later than 1.24.2 require SystemdCgroup = true which seems to be incompatible with nvidia toolkit. I tried both 1.11.0 and 22.9.0 operator versions.

Whole runtime configuration:

      [plugins.cri.containerd.runtimes]

        [plugins.cri.containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v2"

          [plugins.cri.containerd.runtimes.nvidia.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
            Runtime = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
            SystemdCgroup = true

        [plugins.cri.containerd.runtimes.nvidia-experimental]
          runtime_type = "io.containerd.runc.v2"

          [plugins.cri.containerd.runtimes.nvidia-experimental.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
            Runtime = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
            SystemdCgroup = true

        [plugins.cri.containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"

          [plugins.cri.containerd.runtimes.runc.options]
            SystemdCgroup = true

any chance for quick fix?

The problem is, that all /dev/nvidia* in containers are not accessible:

cat /dev/nvidiactl 
cat: /dev/nvidiactl: Operation not permitted

xhejtman · 2022-12-05T10:00:22Z

All this is happening because systemd removes device from cgroup as it needs to be set via systemd, not directly into file.

xhejtman · 2022-12-05T12:29:10Z

I found out that this is connected with static cpu-manager-policy. If both SystemdCgroup and static cpu-manager-policy is used then access rights to device are removed and GPU is unusable. May be related to #455

klueska · 2022-12-05T12:54:20Z

The issue with CPUManager compatibility is a well known one that had a (until recently) stable workaround, using the --compatWithCPUManager option to the device plugin helm chart (or more spcifically the --pass-device-specs flag directly to the plugin binary). Please see NVIDIA/nvidia-docker#966 for a discussion about why this is an issue and this workaround.

Unfortunately, it seems that recent combinations of systemd / containerd / runc do not allow this workaround to work anymore. As mentioned in the link above, the underlying issue is due to a flaw in the design of the existing nvidia-container-stack, and not something that is easily worked around.

We have been working on a redesign of the nvidia-container-stack (based on CDI) for a few years now that architects this problem away, but it is not yet enabled by default. For many uses cases, it is already a stable / better solution than what is provided today, but it does not have full feature parity with the existing stack yet, which is why we can't just make the switch.

That said, for most (possibly all) GPU operator uses cases it should have feature parity though, and we plan on switching to this new approach as the default in the the next couple of releases (likely the March release).

In the meantime, I will see if we can slip in an option for the next operator release (coming out in 2 weeks) to at least provide the ability to enable CDI as the default mechanism for device injection so that those of you facing this problem at least have a way out of it.

xhejtman · 2022-12-05T13:01:41Z

Thank you for explanation! I can also disable static manager for a while, but I will also test pass-device-specs to see whether it works.

klueska · 2022-12-05T16:03:44Z

@xhejtman Just out of curiosity, can you try to apply the following to see it if resolves your issue:
NVIDIA/nvidia-container-toolkit#251

We would like to understand if creating these symlinks in a system that exhibits the issue is enough to work around the issue.

xhejtman · 2022-12-05T18:47:36Z

It seems it works as well.

xhejtman · 2022-12-05T19:00:27Z

Btw, does it work correctly, if you have multiple cards in system and request only one?

xhejtman · 2022-12-05T23:50:13Z

However, with latest nvidia driver (520), there are no /dev/nvidia* nodes on the host, so workaround with ln -s is not applicable. Version 520 is required for H100 card.

klueska · 2022-12-06T00:04:07Z

What do you mean there are no /dev/nvidia* nodes on the host? Nothing has changed in thAt regard with respect to the driver. That said is has never been the case that these nodes get created by the driver itself (due to GPL limitations). They typically get created in one of three ways:

Running nvidia-smi on the host once the driver installation has completed (which will create all device nodes)
Manually running nvidia-modprobe telling it which specific device nodes to create
Relying on the nvidia container stack (and libnvidia-container specifically) to create them for you before injecting them into a container

Based on this bug, 3 won't work anymore, but 1 and 2 still should.

xhejtman · 2022-12-06T00:09:42Z

root@kub-b10:~# ls /dev/nvid*
ls: cannot access '/dev/nvid*': No such file or directory
root@kub-b10:~#
root@kub-b10:~# chroot /run/nvidia/driver
root@kub-b10:/# ls /dev/nvid*
/dev/nvidia-modeset  /dev/nvidia-uvm  /dev/nvidia-uvm-tools  /dev/nvidia0  /dev/nvidiactl

/dev/nvidia-caps:
nvidia-cap1  nvidia-cap2
root@kub-b10:/#

So I mean, that /dev/nviida node files are in /run chroot only and in container. I do not see them in the host /dev.

But that's my bad, they are actually missing also with older driver, I thought they are created by loading the nvidia module.

shivamerla · 2022-12-06T00:24:54Z

@xhejtman This is the expected behavior with driver container root under /run/nvidia/driver. If driver is directly installed on the node, then we would see /dev/nvidia* device nodes.

benlsheets · 2022-12-12T16:24:02Z

May be related to NVIDIA/nvidia-docker#455

I am definitely using SystemdCgroup and static cpu-manager-policy FWIW

klueska · 2022-12-12T16:30:54Z

Could you see if manually creating the /dev/char devices as described here helps to resolve your issue:
NVIDIA/nvidia-container-toolkit#251

Regardless of whether you are running with the driver container or not, these char devices will need to be created in the root /dev/char folder.

xhejtman · 2022-12-18T11:32:56Z

Should it be fixed in 22.9.1 version?

klueska · 2023-01-03T10:46:06Z

No, unfortunately, not.

we10710aa · 2023-01-10T13:44:32Z

@klueska I followed NVIDIA/nvidia-docker#1671, and the conclusion seems to be that gpu-operator won't be compatible with runc version newer than 1.1.3 (containerd version newer 1.6.7).

This Failed to initialize NVML: Unknown Error will happen even if cpuManager is not set ( at least, this is our case ).

I think this issue should definitely be added to known issue in release note, otherwise people who upgrade their containerd version in production will face detrimental consquence....

klueska · 2023-01-13T11:28:30Z

I was able to reproduce this and verify that manually creating symlinks to the various nvidia devices in /dev/char resolves the issue. I need to talk to our driver team to determine why these are not automatically created and how to get them created going forward.

At least we seem to fully understand the problem now, and know what is necessary to resolve it. In the meantime, I would recommend creating these symlinks manually to work around this issue.

The existence of these symlinks is required to address the following bug: #430 This bug impacts container runtimes configured with systemd cgroup management enabled.

cdesiniotis · 2023-02-02T04:00:23Z

We just released GPU Operator 22.9.2 which contains a workaround for this issue. After the driver is installed, we create the symlinks under '/dev/char' pointing to all NVIDIA character devices.

@hoangtnm @xhejtman would you be able to verify 22.9.2 resolves this issue?

xhejtman · 2023-05-02T07:56:58Z

Hello, it seems that char dev symlink does not solve this issue with MIG devices, nvidia-smi complains about:

517971 openat(AT_FDCWD, "/proc/driver/nvidia/capabilities/gpu0/mig/gi13/access", O_RDONLY) = -1 ENOENT (No such file or directory)

should it be fixed in the newer versions?

wangzhipeng · 2023-06-05T09:56:12Z

opencontainers/runc@bf7492e 升级runc解决

zlianzhuang · 2024-04-28T07:59:11Z

set device-plugin param PASS_DEVICE_SPECS to true。

shivamerla pushed a commit that referenced this issue Jan 27, 2023

Create all /dev/char symlinks in driver validator

63d00c9

The existence of these symlinks is required to address the following bug: #430 This bug impacts container runtimes configured with systemd cgroup management enabled.

shivamerla pushed a commit that referenced this issue Jan 30, 2023

Create all /dev/char symlinks in driver validator

84ef9b3

The existence of these symlinks is required to address the following bug: #430 This bug impacts container runtimes configured with systemd cgroup management enabled.

superbrothers mentioned this issue Feb 3, 2023

"Failed to initialize NVML: Unknown Error" after random amount of time NVIDIA/nvidia-docker#1671

Closed

7 tasks

heilerich mentioned this issue Aug 9, 2023

Toolkit DaemonSet stuck in init phase after upgrade #567

Open

lolwww mentioned this issue Jun 25, 2024

Unable to use H100 gpu with microk8s canonical/microk8s#4556

Closed

saicharithpasula mentioned this issue Jul 1, 2024

[QUESTION/HELP] Installing NVIDIA GPU operator on a k3d cluster k3d-io/k3d#1457

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to initialize NVML: Unknown Error #430

Failed to initialize NVML: Unknown Error #430

hoangtnm commented Nov 1, 2022

shivamerla commented Nov 1, 2022

hoangtnm commented Nov 2, 2022 •

edited

Loading

klueska commented Nov 2, 2022

xhejtman commented Dec 4, 2022

xhejtman commented Dec 4, 2022

xhejtman commented Dec 4, 2022

xhejtman commented Dec 4, 2022 •

edited

Loading

xhejtman commented Dec 5, 2022

xhejtman commented Dec 5, 2022

klueska commented Dec 5, 2022

xhejtman commented Dec 5, 2022

klueska commented Dec 5, 2022

xhejtman commented Dec 5, 2022

xhejtman commented Dec 5, 2022

xhejtman commented Dec 5, 2022

klueska commented Dec 6, 2022

xhejtman commented Dec 6, 2022

shivamerla commented Dec 6, 2022

benlsheets commented Dec 12, 2022

klueska commented Dec 12, 2022

xhejtman commented Dec 18, 2022

klueska commented Jan 3, 2023

we10710aa commented Jan 10, 2023

klueska commented Jan 13, 2023

cdesiniotis commented Feb 2, 2023 •

edited

Loading

xhejtman commented May 2, 2023

wangzhipeng commented Jun 5, 2023

zlianzhuang commented Apr 28, 2024

Failed to initialize NVML: Unknown Error #430

Failed to initialize NVML: Unknown Error #430

Comments

hoangtnm commented Nov 1, 2022

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

shivamerla commented Nov 1, 2022

hoangtnm commented Nov 2, 2022 • edited Loading

klueska commented Nov 2, 2022

xhejtman commented Dec 4, 2022

xhejtman commented Dec 4, 2022

xhejtman commented Dec 4, 2022

xhejtman commented Dec 4, 2022 • edited Loading

xhejtman commented Dec 5, 2022

xhejtman commented Dec 5, 2022

klueska commented Dec 5, 2022

xhejtman commented Dec 5, 2022

klueska commented Dec 5, 2022

xhejtman commented Dec 5, 2022

xhejtman commented Dec 5, 2022

xhejtman commented Dec 5, 2022

klueska commented Dec 6, 2022

xhejtman commented Dec 6, 2022

shivamerla commented Dec 6, 2022

benlsheets commented Dec 12, 2022

klueska commented Dec 12, 2022

xhejtman commented Dec 18, 2022

klueska commented Jan 3, 2023

we10710aa commented Jan 10, 2023

klueska commented Jan 13, 2023

cdesiniotis commented Feb 2, 2023 • edited Loading

xhejtman commented May 2, 2023

wangzhipeng commented Jun 5, 2023

zlianzhuang commented Apr 28, 2024

hoangtnm commented Nov 2, 2022 •

edited

Loading

xhejtman commented Dec 4, 2022 •

edited

Loading

cdesiniotis commented Feb 2, 2023 •

edited

Loading