-
Notifications
You must be signed in to change notification settings - Fork 2k
Failed to initialize NVML: Unknown Error without any kublet update(cpu-manager-policy is default none) #1618
Comments
I'd really appreciate any help to let me can dig more. |
@ReyRen did you try adding systemd.unified_cgroup_hierarchy=0 on your boot cmdline? that solved the issue for me |
@su-tjones Thanks your suggestion, and I will give it a try. But I really want know what's the |
unified_cgroup_hierarchy=0 means switching the cgroup from v2 to v1. |
@kentwelcome Thanks.
and cgroup v1 in docker
The host version is |
I've actually heard that switching to cgroupv2 (i.e. flipping |
@klueska Thanks a lot. I'll give it a shot |
1. Issue or feature description
Yes, @klueska already described in #1469. And I really tried all of those methods, including use nvidia-device-plugin-compat-with-cpumanager.yml. But, the error still there. So, let me give more details.
Failed to initialize NVML: Unknown Error
not occurred in initial NVIDIA docker created and not in couple of seconds(my kubernetes config file is using defaultnodeStatusUpdateFrequency
time, which is 10s), it's happened after couple of days(sometimes some of hours).2. What I Found
The Error docker:
The healthy docker:
3. Information to attach (optional if deemed irrelevant)
nvidia-container-cli -k -d /dev/tty info
uname -a
dmesg
nvidia-smi -a
docker version
Client: Docker Engine - Community
Version: 20.10.5
API version: 1.41
Go version: go1.13.15
Git commit: 55c4c88
Built: Tue Mar 2 20:18:05 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.5
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: 363e9a8
Built: Tue Mar 2 20:16:00 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.3
GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
nvidia:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.19.0
GitCommit: de40ad0
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=====================================-=======================-=======================-===============================================================================
un libgldispatch0-nvidia (no description available)
ii libnvidia-container-tools 1.6.0
rc.2-1 amd64 NVIDIA container runtime library (command-line tools)rc.2-1 amd64 NVIDIA container runtime libraryii libnvidia-container1:amd64 1.6.0
un nvidia-304 (no description available)
un nvidia-340 (no description available)
un nvidia-384 (no description available)
un nvidia-common (no description available)
ii nvidia-container-runtime 3.6.0
rc.1-1 amd64 NVIDIA container runtimerc.2-1 amd64 NVIDIA container runtime hookun nvidia-container-runtime-hook (no description available)
ii nvidia-container-toolkit 1.6.0
un nvidia-docker (no description available)
ii nvidia-docker2 2.7.0
rc.2-1 all nvidia-docker CLI wrapper0.18.04.1 all Tools to enable NVIDIA's Primeii nvidia-prime 0.8.16
nvidia-container-cli -V
version: 1.6.0~rc.2
build date: 2021-11-05T14:19+00:00
build revision: badec1fa4a2c085aa9396f95b6bb1d69f1c7996b
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
The text was updated successfully, but these errors were encountered: