Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

"Failed to initialize NVML: Unknown Error" after random amount of time #1671

Closed
7 tasks done
iFede94 opened this issue Aug 31, 2022 · 79 comments
Closed
7 tasks done

"Failed to initialize NVML: Unknown Error" after random amount of time #1671

iFede94 opened this issue Aug 31, 2022 · 79 comments

Comments

@iFede94
Copy link

iFede94 commented Aug 31, 2022

1. Issue or feature description

After a random amount of time (it could be hours or days) the GPUs become unavailable inside all the running containers and nvidia-smi returns "Failed to initialize NVML: Unknown Error".
A restart of all the containers fixes the issue and the GPUs return available.
Outside the containers the GPUs are still working correctly.
I tried searching in the open/closed issues but I could not find any solution.

2. Steps to reproduce the issue

All the containers are run with docker run --gpus all -it tensorflow/tensorflow:latest-gpu /bin/bash

3. Information to attach

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
-- WARNING, the following logs are for debugging purposes only --

I0831 10:36:45.129762 2174149 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)
I0831 10:36:45.129878 2174149 nvc.c:350] using root /
I0831 10:36:45.129892 2174149 nvc.c:351] using ldcache /etc/ld.so.cache
I0831 10:36:45.129906 2174149 nvc.c:352] using unprivileged user 1000:1000
I0831 10:36:45.129960 2174149 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0831 10:36:45.130411 2174149 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0831 10:36:45.132458 2174150 nvc.c:273] failed to set inheritable capabilities
W0831 10:36:45.132555 2174150 nvc.c:274] skipping kernel modules load due to failure
I0831 10:36:45.133242 2174151 rpc.c:71] starting driver rpc service
I0831 10:36:45.141625 2174152 rpc.c:71] starting nvcgo rpc service
I0831 10:36:45.144941 2174149 nvc_info.c:766] requesting driver information with ''
I0831 10:36:45.146226 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.515.48.07
I0831 10:36:45.146379 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.48.07
I0831 10:36:45.146563 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.48.07
I0831 10:36:45.146792 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.515.48.07
I0831 10:36:45.146986 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.515.48.07
I0831 10:36:45.147178 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.515.48.07
I0831 10:36:45.147375 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.48.07
I0831 10:36:45.147400 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.515.48.07
I0831 10:36:45.147598 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.48.07
I0831 10:36:45.147777 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.48.07
I0831 10:36:45.147986 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.48.07
I0831 10:36:45.148258 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.515.48.07
I0831 10:36:45.148506 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.515.48.07
I0831 10:36:45.148699 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.48.07
I0831 10:36:45.148915 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.515.48.07
I0831 10:36:45.148942 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.515.48.07
I0831 10:36:45.149219 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.515.48.07
I0831 10:36:45.149467 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.515.48.07
I0831 10:36:45.149591 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.515.48.07
I0831 10:36:45.149814 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.515.48.07
I0831 10:36:45.149996 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.515.48.07
I0831 10:36:45.150224 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.515.48.07
I0831 10:36:45.150437 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.515.48.07
I0831 10:36:45.150772 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.515.48.07
I0831 10:36:45.150978 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.515.48.07
I0831 10:36:45.151147 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.515.48.07
I0831 10:36:45.151335 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.515.48.07
I0831 10:36:45.151592 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.515.48.07
I0831 10:36:45.151786 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.515.48.07
I0831 10:36:45.151970 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.515.48.07
I0831 10:36:45.152225 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.515.48.07
I0831 10:36:45.152480 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.515.48.07
I0831 10:36:45.152791 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.515.48.07
I0831 10:36:45.152999 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.515.48.07
I0831 10:36:45.153254 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.515.48.07
I0831 10:36:45.153580 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.515.48.07
I0831 10:36:45.153853 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libcuda.so.515.48.07
I0831 10:36:45.154063 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.515.48.07
I0831 10:36:45.154259 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.515.48.07
I0831 10:36:45.154473 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.515.48.07
I0831 10:36:45.154696 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.515.48.07
W0831 10:36:45.154723 2174149 nvc_info.c:399] missing library libnvidia-nscq.so
W0831 10:36:45.154726 2174149 nvc_info.c:399] missing library libcudadebugger.so
W0831 10:36:45.154729 2174149 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
W0831 10:36:45.154731 2174149 nvc_info.c:399] missing library libnvidia-pkcs11.so
W0831 10:36:45.154733 2174149 nvc_info.c:399] missing library libvdpau_nvidia.so
W0831 10:36:45.154735 2174149 nvc_info.c:399] missing library libnvidia-ifr.so
W0831 10:36:45.154737 2174149 nvc_info.c:399] missing library libnvidia-cbl.so
W0831 10:36:45.154739 2174149 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W0831 10:36:45.154741 2174149 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W0831 10:36:45.154743 2174149 nvc_info.c:403] missing compat32 library libcudadebugger.so
W0831 10:36:45.154746 2174149 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W0831 10:36:45.154748 2174149 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W0831 10:36:45.154750 2174149 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W0831 10:36:45.154752 2174149 nvc_info.c:403] missing compat32 library libnvidia-ngx.so
W0831 10:36:45.154754 2174149 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
W0831 10:36:45.154756 2174149 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
W0831 10:36:45.154758 2174149 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so
W0831 10:36:45.154760 2174149 nvc_info.c:403] missing compat32 library libnvoptix.so
W0831 10:36:45.154762 2174149 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
I0831 10:36:45.154919 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-smi
I0831 10:36:45.154945 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
I0831 10:36:45.154954 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
I0831 10:36:45.154970 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
I0831 10:36:45.154980 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server
W0831 10:36:45.155027 2174149 nvc_info.c:425] missing binary nv-fabricmanager
I0831 10:36:45.155044 2174149 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/515.48.07/gsp.bin
I0831 10:36:45.155058 2174149 nvc_info.c:529] listing device /dev/nvidiactl
I0831 10:36:45.155061 2174149 nvc_info.c:529] listing device /dev/nvidia-uvm
I0831 10:36:45.155063 2174149 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I0831 10:36:45.155065 2174149 nvc_info.c:529] listing device /dev/nvidia-modeset
I0831 10:36:45.155080 2174149 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket
W0831 10:36:45.155092 2174149 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W0831 10:36:45.155100 2174149 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I0831 10:36:45.155102 2174149 nvc_info.c:822] requesting device information with ''
I0831 10:36:45.161039 2174149 nvc_info.c:713] listing device /dev/nvidia0 (GPU-13fd0930-06c3-5975-8720-72c72ee7a823 at 00000000:01:00.0)
I0831 10:36:45.166471 2174149 nvc_info.c:713] listing device /dev/nvidia1 (GPU-a76d37d7-5ed0-58d9-6087-b18fee984570 at 00000000:02:00.0)
NVRM version:   515.48.07
CUDA version:   11.7

Device Index:   0
Device Minor:   0
Model:          NVIDIA GeForce RTX 2080 Ti
Brand:          GeForce
GPU UUID:       GPU-13fd0930-06c3-5975-8720-72c72ee7a823
Bus Location:   00000000:01:00.0
Architecture:   7.5

Device Index:   1
Device Minor:   1
Model:          NVIDIA GeForce RTX 2080 Ti
Brand:          GeForce
GPU UUID:       GPU-a76d37d7-5ed0-58d9-6087-b18fee984570
Bus Location:   00000000:02:00.0
Architecture:   7.5
I0831 10:36:45.166493 2174149 nvc.c:434] shutting down library context
I0831 10:36:45.166540 2174152 rpc.c:95] terminating nvcgo rpc service
I0831 10:36:45.166751 2174149 rpc.c:135] nvcgo rpc service terminated successfully
I0831 10:36:45.167790 2174151 rpc.c:95] terminating driver rpc service
I0831 10:36:45.167907 2174149 rpc.c:135] driver rpc service terminated successfully
  • Kernel version from uname -a
Linux wds-co-ml 5.15.0-43-generic NVIDIA/nvidia-docker#46-Ubuntu SMP Tue Jul 12 10:30:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • Driver information from nvidia-smi -a
==============NVSMI LOG==============

Timestamp                                 : Wed Aug 31 12:42:55 2022
Driver Version                            : 515.48.07
CUDA Version                              : 11.7

Attached GPUs                             : 2
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 2080 Ti
    Product Brand                         : GeForce
    Product Architecture                  : Turing
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-13fd0930-06c3-5975-8720-72c72ee7a823
    Minor Number                          : 0
    VBIOS Version                         : 90.02.0B.00.C7
    MultiGPU Board                        : No
    Board ID                              : 0x100
    GPU Part Number                       : N/A
    Module ID                             : 0
    Inforom Version
        Image Version                     : G001.0000.02.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1E0710DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x150319DA
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 0 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 11264 MiB
        Reserved                          : 244 MiB
        Used                              : 1 MiB
        Free                              : 11018 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 3 MiB
        Free                              : 253 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 30 C
        GPU Shutdown Temp                 : 94 C
        GPU Slowdown Temp                 : 91 C
        GPU Max Operating Temp            : 89 C
        GPU Target Temperature            : 84 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 20.87 W
        Power Limit                       : 260.00 W
        Default Power Limit               : 260.00 W
        Enforced Power Limit              : 260.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 300.00 W
    Clocks
        Graphics                          : 300 MHz
        SM                                : 300 MHz
        Memory                            : 405 MHz
        Video                             : 540 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2160 MHz
        SM                                : 2160 MHz
        Memory                            : 7000 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Processes                             : None

GPU 00000000:02:00.0
    Product Name                          : NVIDIA GeForce RTX 2080 Ti
    Product Brand                         : GeForce
    Product Architecture                  : Turing
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-a76d37d7-5ed0-58d9-6087-b18fee984570
    Minor Number                          : 1
    VBIOS Version                         : 90.02.17.00.58
    MultiGPU Board                        : No
    Board ID                              : 0x200
    GPU Part Number                       : N/A
    Module ID                             : 0
    Inforom Version
        Image Version                     : G001.0000.02.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x02
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1E0710DE
        Bus Id                            : 00000000:02:00.0
        Sub System Id                     : 0x150319DA
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 35 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 11264 MiB
        Reserved                          : 244 MiB
        Used                              : 1 MiB
        Free                              : 11018 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 27 MiB
        Free                              : 229 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 28 C
        GPU Shutdown Temp                 : 94 C
        GPU Slowdown Temp                 : 91 C
        GPU Max Operating Temp            : 89 C
        GPU Target Temperature            : 84 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 6.66 W
        Power Limit                       : 260.00 W
        Default Power Limit               : 260.00 W
        Enforced Power Limit              : 260.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 300.00 W
    Clocks
        Graphics                          : 300 MHz
        SM                                : 300 MHz
        Memory                            : 405 MHz
        Video                             : 540 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2160 MHz
        SM                                : 2160 MHz
        Memory                            : 7000 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Processes                             : None
  • Docker version from docker version
Client: Docker Engine - Community
 Version:           20.10.17
 API version:       1.41
 Go version:        go1.17.11
 Git commit:        100c701
 Built:             Mon Jun  6 23:02:46 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.17
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.17.11
  Git commit:       a89b842
  Built:            Mon Jun  6 23:00:51 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.6
  GitCommit:        10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
 runc:
  Version:          1.1.2
  GitCommit:        v1.1.2-0-ga916309
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
ii  libnvidia-cfg1-515:amd64                   515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-515                       515.48.07-0ubuntu0.22.04.2 all          Shared files used by the NVIDIA libraries
ii  libnvidia-compute-515:amd64                515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA libcompute package
ii  libnvidia-compute-515:i386                 515.48.07-0ubuntu0.22.04.2 i386         NVIDIA libcompute package
ii  libnvidia-container-tools                  1.10.0-1                   amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                 1.10.0-1                   amd64        NVIDIA container runtime library
ii  libnvidia-decode-515:amd64                 515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-decode-515:i386                  515.48.07-0ubuntu0.22.04.2 i386         NVIDIA Video Decoding runtime libraries
ii  libnvidia-egl-wayland1:amd64               1:1.1.9-1.1                amd64        Wayland EGL External Platform library -- shared library
ii  libnvidia-encode-515:amd64                 515.48.07-0ubuntu0.22.04.2 amd64        NVENC Video Encoding runtime library
ii  libnvidia-encode-515:i386                  515.48.07-0ubuntu0.22.04.2 i386         NVENC Video Encoding runtime library
ii  libnvidia-extra-515:amd64                  515.48.07-0ubuntu0.22.04.2 amd64        Extra libraries for the NVIDIA driver
ii  libnvidia-fbc1-515:amd64                   515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-fbc1-515:i386                    515.48.07-0ubuntu0.22.04.2 i386         NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-515:amd64                     515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-gl-515:i386                      515.48.07-0ubuntu0.22.04.2 i386         NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  linux-modules-nvidia-515-5.15.0-43-generic 5.15.0-43.46               amd64        Linux kernel nvidia modules for version 5.15.0-43
ii  linux-modules-nvidia-515-generic-hwe-22.04 5.15.0-43.46               amd64        Extra drivers for nvidia-515 for the generic-hwe-22.04 flavour
ii  linux-objects-nvidia-515-5.15.0-43-generic 5.15.0-43.46               amd64        Linux kernel nvidia modules for version 5.15.0-43 (objects)
ii  linux-signatures-nvidia-5.15.0-43-generic  5.15.0-43.46               amd64        Linux kernel signatures for nvidia modules for version 5.15.0-43-generic
ii  nvidia-compute-utils-515                   515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA compute utilities
ii  nvidia-container-toolkit                   1.10.0-1                   amd64        NVIDIA container runtime hook
ii  nvidia-docker2                             2.11.0-1                   all          nvidia-docker CLI wrapper
ii  nvidia-driver-515                          515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA driver metapackage
ii  nvidia-kernel-common-515                   515.48.07-0ubuntu0.22.04.2 amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-515                   515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA kernel source package
ii  nvidia-prime                               0.8.17.1                   all          Tools to enable NVIDIA's Prime
ii  nvidia-settings                            510.47.03-0ubuntu1         amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-utils-515                           515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA driver support binaries
ii  xserver-xorg-video-nvidia-515              515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA binary Xorg driver
  • NVIDIA container library version from nvidia-container-cli -V
cli-version: 1.10.0
lib-version: 1.10.0
build date: 2022-06-13T10:39+00:00
build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
  • Docker command, image and tag used
docker run --gpus all -it tensorflow/tensorflow:latest-gpu /bin/bash
@elezar
Copy link
Member

elezar commented Sep 2, 2022

The nvidia-smi output show persistence mode as being disabled. Does the behaviour still exist when this is enabled?

@kevin-bockman
Copy link

Hey, I have the same problem.

2. Steps to reproduce the issue

docker run --gpus all --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
root@098b49afe624:/# nvidia-smi 
Fri Sep  2 21:54:31 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.68.02    Driver Version: 510.68.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+

This works until you do systemctl daemon-reload either manually or automatically through the OS (I assume, since it eventually will fail).

(on host):
systemctl daemon-reload

(inside same running container):

root@098b49afe624:/# nvidia-smi 
Failed to initialize NVML: Unknown Error

Running the container again will work fine until you do another systemctl daemon-reload.

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
I0902 21:40:53.603015 2836338 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)
I0902 21:40:53.603083 2836338 nvc.c:350] using root /                                                                  
I0902 21:40:53.603093 2836338 nvc.c:351] using ldcache /etc/ld.so.cache                
I0902 21:40:53.603100 2836338 nvc.c:352] using unprivileged user 1000:1000                
I0902 21:40:53.603133 2836338 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0902 21:40:53.603287 2836338 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0902 21:40:53.607634 2836339 nvc.c:273] failed to set inheritable capabilities        
W0902 21:40:53.607692 2836339 nvc.c:274] skipping kernel modules load due to failure
I0902 21:40:53.608141 2836340 rpc.c:71] starting driver rpc service              
I0902 21:40:53.620107 2836341 rpc.c:71] starting nvcgo rpc service                  
I0902 21:40:53.621514 2836338 nvc_info.c:766] requesting driver information with ''     
I0902 21:40:53.623204 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.510.68.02
I0902 21:40:53.623384 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.510.68.02
I0902 21:40:53.623470 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.510.68.02  
I0902 21:40:53.623534 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.510.68.02
I0902 21:40:53.623599 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02 
I0902 21:40:53.623686 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.510.68.02
I0902 21:40:53.623774 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02
I0902 21:40:53.623838 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.510.68.02
I0902 21:40:53.623900 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02
I0902 21:40:53.623987 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.510.68.02
I0902 21:40:53.624046 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.510.68.02
I0902 21:40:53.624105 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.510.68.02                                                                                                                               
I0902 21:40:53.624167 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.510.68.02                                                                                                                                  
I0902 21:40:53.624270 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.510.68.02                                                                                                                               
I0902 21:40:53.624362 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.510.68.02                                                                                                                              
I0902 21:40:53.624430 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02                                                                                                                             
I0902 21:40:53.624507 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02                                                                                                                                  
I0902 21:40:53.624590 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02                                                                                                                            
I0902 21:40:53.624684 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.510.68.02                                                                                                                                     
I0902 21:40:53.624959 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02
I0902 21:40:53.625088 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.510.68.02
I0902 21:40:53.625151 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.510.68.02        
I0902 21:40:53.625213 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.510.68.02     
I0902 21:40:53.625277 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.510.68.02           
W0902 21:40:53.625310 2836338 nvc_info.c:399] missing library libnvidia-nscq.so                                        
W0902 21:40:53.625322 2836338 nvc_info.c:399] missing library libcudadebugger.so                                       
W0902 21:40:53.625330 2836338 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
W0902 21:40:53.625340 2836338 nvc_info.c:399] missing library libnvidia-pkcs11.so                                      
W0902 21:40:53.625349 2836338 nvc_info.c:399] missing library libnvidia-ifr.so                                         
W0902 21:40:53.625359 2836338 nvc_info.c:399] missing library libnvidia-cbl.so                                         
W0902 21:40:53.625368 2836338 nvc_info.c:403] missing compat32 library libnvidia-ml.so                                 
W0902 21:40:53.625376 2836338 nvc_info.c:403] missing compat32 library libnvidia-cfg.so                                
W0902 21:40:53.625386 2836338 nvc_info.c:403] missing compat32 library libnvidia-nscq.so                               
W0902 21:40:53.625394 2836338 nvc_info.c:403] missing compat32 library libcuda.so                                      
W0902 21:40:53.625404 2836338 nvc_info.c:403] missing compat32 library libcudadebugger.so                              
W0902 21:40:53.625413 2836338 nvc_info.c:403] missing compat32 library libnvidia-opencl.so                             
W0902 21:40:53.625422 2836338 nvc_info.c:403] missing compat32 library libnvidia-ptxjitcompiler.so                     
W0902 21:40:53.625432 2836338 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so                    
W0902 21:40:53.625441 2836338 nvc_info.c:403] missing compat32 library libnvidia-allocator.so                          
W0902 21:40:53.625450 2836338 nvc_info.c:403] missing compat32 library libnvidia-compiler.so                           
W0902 21:40:53.625459 2836338 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so                             
W0902 21:40:53.625468 2836338 nvc_info.c:403] missing compat32 library libnvidia-ngx.so                                
W0902 21:40:53.625477 2836338 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
W0902 21:40:53.625486 2836338 nvc_info.c:403] missing compat32 library libnvidia-encode.so
W0902 21:40:53.625495 2836338 nvc_info.c:403] missing compat32 library libnvidia-opticalflow.so
W0902 21:40:53.625505 2836338 nvc_info.c:403] missing compat32 library libnvcuvid.so
W0902 21:40:53.625514 2836338 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so
W0902 21:40:53.625523 2836338 nvc_info.c:403] missing compat32 library libnvidia-glcore.so                             
W0902 21:40:53.625532 2836338 nvc_info.c:403] missing compat32 library libnvidia-tls.so                  
W0902 21:40:53.625541 2836338 nvc_info.c:403] missing compat32 library libnvidia-glsi.so                               
W0902 21:40:53.625551 2836338 nvc_info.c:403] missing compat32 library libnvidia-fbc.so                                
W0902 21:40:53.625561 2836338 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
W0902 21:40:53.625570 2836338 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so                             
W0902 21:40:53.625579 2836338 nvc_info.c:403] missing compat32 library libnvoptix.so                                                                                                                                                          
W0902 21:40:53.625588 2836338 nvc_info.c:403] missing compat32 library libGLX_nvidia.so                                
W0902 21:40:53.625598 2836338 nvc_info.c:403] missing compat32 library libEGL_nvidia.so
W0902 21:40:53.625607 2836338 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so
W0902 21:40:53.625616 2836338 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so                                                                                                                                                 
W0902 21:40:53.625625 2836338 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so                   
W0902 21:40:53.625631 2836338 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
I0902 21:40:53.626022 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-smi         
I0902 21:40:53.626055 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
I0902 21:40:53.626088 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
I0902 21:40:53.626139 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
I0902 21:40:53.626172 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server                             
W0902 21:40:53.626281 2836338 nvc_info.c:425] missing binary nv-fabricmanager                            
I0902 21:40:53.626333 2836338 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/510.68.02/gsp.bin
I0902 21:40:53.626375 2836338 nvc_info.c:529] listing device /dev/nvidiactl                                    
I0902 21:40:53.626385 2836338 nvc_info.c:529] listing device /dev/nvidia-uvm                                                                                                                                                                  
I0902 21:40:53.626395 2836338 nvc_info.c:529] listing device /dev/nvidia-uvm-tools                                  
I0902 21:40:53.626404 2836338 nvc_info.c:529] listing device /dev/nvidia-modeset                               
W0902 21:40:53.626447 2836338 nvc_info.c:349] missing ipc path /var/run/nvidia-persistenced/socket          
W0902 21:40:53.626483 2836338 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket        
W0902 21:40:53.626510 2836338 nvc_info.c:349] missing ipc path /tmp/nvidia-mps                                    
I0902 21:40:53.626521 2836338 nvc_info.c:822] requesting device information with ''                          
I0902 21:40:53.633742 2836338 nvc_info.c:713] listing device /dev/nvidia0 (GPU-9c416c82-d801-d28f-0867-dd438d4be914 at 00000000:04:00.0)                                                                                                      
I0902 21:40:53.640730 2836338 nvc_info.c:713] listing device /dev/nvidia1 (GPU-32a56b8c-943e-03e7-d539-3e97e5ef5f7a at 00000000:05:00.0)                                                                                                      
I0902 21:40:53.647954 2836338 nvc_info.c:713] listing device /dev/nvidia2 (GPU-a0e33485-87cd-ceb1-2702-2c58a64a9dbe at 00000000:08:00.0)                                                                                                      
I0902 21:40:53.655371 2836338 nvc_info.c:713] listing device /dev/nvidia3 (GPU-1ab2485c-121c-77db-6719-0b616d1673f4 at 00000000:09:00.0)                                                                                                      
I0902 21:40:53.663009 2836338 nvc_info.c:713] listing device /dev/nvidia4 (GPU-e7e3d7b6-ddce-355a-7988-80c4ba18319c at 00000000:0b:00.0)                                                                                                      
I0902 21:40:53.670891 2836338 nvc_info.c:713] listing device /dev/nvidia5 (GPU-c16444fb-bedb-106d-c188-1f330773cf39 at 00000000:84:00.0)                                                                                                      
I0902 21:40:53.679015 2836338 nvc_info.c:713] listing device /dev/nvidia6 (GPU-2545ac9e-3ff1-8b38-8ad6-b8c82fea6cd0 at 00000000:85:00.0)                                                                                                      
I0902 21:40:53.687078 2836338 nvc_info.c:713] listing device /dev/nvidia7 (GPU-fcc35ab7-1afd-e678-b5f0-d1e1f8842d28 at 00000000:89:00.0)                                                                                                      
NVRM version:   510.68.02                                                                                              
CUDA version:   11.6                                                                                                   
                                                                                                                      
Device Index:   0                                                                                                      
Device Minor:   0                                                                                                      
Model:          NVIDIA TITAN X (Pascal)                                                                                
Brand:          TITAN                                                                                                  
GPU UUID:       GPU-9c416c82-d801-d28f-0867-dd438d4be914                                                               
Bus Location:   00000000:04:00.0                                                                                       
Architecture:   6.1                                                                                                    
                                                                                                                      
Device Index:   1                                                                                                      
Device Minor:   1                                                                                                      
Model:          NVIDIA TITAN X (Pascal)                                                                                
Brand:          TITAN                                                                                                  
GPU UUID:       GPU-32a56b8c-943e-03e7-d539-3e97e5ef5f7a                                                               
Bus Location:   00000000:05:00.0                                                                                       
Architecture:   6.1                                                                                                    
                                                                                                                      
Device Index:   2                                                                                                      
Device Minor:   2                                                                                                      
Model:          NVIDIA TITAN X (Pascal)                                                                                
Brand:          TITAN                                                                                                  
GPU UUID:       GPU-a0e33485-87cd-ceb1-2702-2c58a64a9dbe                                                               
Bus Location:   00000000:08:00.0                                                                                       
Architecture:   6.1                                                                                                    
                                                                                                                      
Device Index:   3                                                                                                      
Device Minor:   3                                                                                                      
Model:          NVIDIA TITAN X (Pascal)                                                                                
Brand:          TITAN                                                                                                  
GPU UUID:       GPU-1ab2485c-121c-77db-6719-0b616d1673f4                                                               
Bus Location:   00000000:09:00.0                                                                                       
Architecture:   6.1                                                                                                    
                                                                                                                                                                                                                                             
Device Index:   4                                                                                                      
Device Minor:   4                                                                                                      
Model:          NVIDIA TITAN X (Pascal)                                                                                
Brand:          TITAN                                                                                                                                                                                                                         
GPU UUID:       GPU-e7e3d7b6-ddce-355a-7988-80c4ba18319c                                                               
Bus Location:   00000000:0b:00.0                                                                                       
Architecture:   6.1                                                                                                    
                                                                                                                      
Device Index:   5                                                                                                      
Device Minor:   5                                                                                                      
Model:          NVIDIA TITAN X (Pascal)                                                                                
Brand:          TITAN                                                                                                  
GPU UUID:       GPU-c16444fb-bedb-106d-c188-1f330773cf39                                                               
Bus Location:   00000000:84:00.0                                                                                       
Architecture:   6.1                                                                                                                                                                                                                           
                                                                                                                      
Device Index:   6                                                                                                      
Device Minor:   6                                                                                                      
Model:          NVIDIA TITAN X (Pascal)                                                                                
Brand:          TITAN                                                                                                  
GPU UUID:       GPU-2545ac9e-3ff1-8b38-8ad6-b8c82fea6cd0                                                               
Bus Location:   00000000:85:00.0                                                                                                                                                                                                              
Architecture:   6.1                                                                                                                                                                                                                           
                                                                                                                                                                                                                                             
Device Index:   7                                                                                                                                                                                                                             
Device Minor:   7                                                                                                                                                                                                                             
Model:          NVIDIA TITAN X (Pascal)                                                                                                                                                                                                       
Brand:          TITAN                                                                                                                                                                                                                         
GPU UUID:       GPU-fcc35ab7-1afd-e678-b5f0-d1e1f8842d28                                                                                                                                                                                      
Bus Location:   00000000:89:00.0                                                                                       
Architecture:   6.1                                                                                                    
I0902 21:40:53.687293 2836338 nvc.c:434] shutting down library context                                                 
I0902 21:40:53.687347 2836341 rpc.c:95] terminating nvcgo rpc service                                                  
I0902 21:40:53.687881 2836338 rpc.c:135] nvcgo rpc service terminated successfully                                     
I0902 21:40:53.692819 2836340 rpc.c:95] terminating driver rpc service                                                 
I0902 21:40:53.693046 2836338 rpc.c:135] driver rpc service terminated successfully                                                                                                                    
  • Kernel version from uname -a
    Linux node5-4 5.15.0-46-generic NVIDIA/nvidia-docker#49-Ubuntu SMP Thu Aug 4 18:03:25 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

  • Any relevant kernel output lines from dmesg
    Nothing relevant from dmesg, but only thing relevant from journalctl is
    Sep 02 21:17:56 node5-4 systemd[1]: Reloading. once I do a systemctl daemon-reload

  • Driver information from nvidia-smi -a

Fri Sep  2 21:22:32 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.68.02    Driver Version: 510.68.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN X ...  On   | 00000000:04:00.0 Off |                  N/A |
| 23%   23C    P8     8W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN X ...  On   | 00000000:05:00.0 Off |                  N/A |
| 23%   26C    P8     9W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA TITAN X ...  On   | 00000000:08:00.0 Off |                  N/A |
| 23%   22C    P8     7W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA TITAN X ...  On   | 00000000:09:00.0 Off |                  N/A |
| 23%   24C    P8     8W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA TITAN X ...  On   | 00000000:0B:00.0 Off |                  N/A |
| 23%   26C    P8     9W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA TITAN X ...  On   | 00000000:84:00.0 Off |                  N/A |
| 23%   25C    P8     8W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA TITAN X ...  On   | 00000000:85:00.0 Off |                  N/A |
| 23%   22C    P8     8W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA TITAN X ...  On   | 00000000:89:00.0 Off |                  N/A |
| 23%   23C    P8     7W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
  • Docker version from docker version
Client: Docker Engine - Community
 Version:           20.10.17
 API version:       1.41
 Go version:        go1.17.11
 Git commit:        100c701
 Built:             Mon Jun  6 23:02:46 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.17
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.17.11
  Git commit:       a89b842
  Built:            Mon Jun  6 23:00:51 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.4
  GitCommit:        212e8b6fa2f44b9c21b2798135fc6fb7c53efc16
 runc:
  Version:          1.1.1
  GitCommit:        v1.1.1-0-g52de29d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                          Version      Architecture Description
+++-=============================-============-============-=====================================================
ii  libnvidia-container-tools     1.10.0-1     amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64    1.10.0-1     amd64        NVIDIA container runtime library
ii  nvidia-container-runtime      3.10.0-1     all          NVIDIA container runtime
un  nvidia-container-runtime-hook <none>       <none>       (no description available)
ii  nvidia-container-toolkit      1.10.0-1     amd64        NVIDIA container runtime hook
  • NVIDIA container library version from nvidia-container-cli -V
cli-version: 1.10.0
lib-version: 1.10.0
build date: 2022-06-13T10:39+00:00
build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
I0902 22:11:39.880399 2840718 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)                                                                                                        I0902 22:11:39.880483 2840718 nvc.c:350] using root /                                                                                                                                                                                         I0902 22:11:39.880501 2840718 nvc.c:351] using ldcache /etc/ld.so.cache                   
I0902 22:11:39.880514 2840718 nvc.c:352] using unprivileged user 65534:65534                                                                                                                                                                  
I0902 22:11:39.880559 2840718 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)                                                                                                           I0902 22:11:39.880751 2840718 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment                                                                                                                              I0902 22:11:39.884769 2840724 nvc.c:278] loading kernel module nvidia                  
I0902 22:11:39.884931 2840724 nvc.c:282] running mknod for /dev/nvidiactl                                                                                                                                                                     
I0902 22:11:39.884991 2840724 nvc.c:286] running mknod for /dev/nvidia0                 
I0902 22:11:39.885033 2840724 nvc.c:286] running mknod for /dev/nvidia1                                                                                                                                                                       
I0902 22:11:39.885071 2840724 nvc.c:286] running mknod for /dev/nvidia2                   
I0902 22:11:39.885109 2840724 nvc.c:286] running mknod for /dev/nvidia3                                                                                                                                                                       
I0902 22:11:39.885147 2840724 nvc.c:286] running mknod for /dev/nvidia4                            
I0902 22:11:39.885185 2840724 nvc.c:286] running mknod for /dev/nvidia5                                                                                                                                                                       
I0902 22:11:39.885222 2840724 nvc.c:286] running mknod for /dev/nvidia6                      
I0902 22:11:39.885260 2840724 nvc.c:286] running mknod for /dev/nvidia7                                                                                                                                                                       
I0902 22:11:39.885298 2840724 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps                                                                                                                                                     I0902 22:11:39.892775 2840724 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config                                                                                                      I0902 22:11:39.892935 2840724 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor                                                                                                     I0902 22:11:39.899624 2840724 nvc.c:296] loading kernel module nvidia_uvm                                                                                                                                                                     I0902 22:11:39.899673 2840724 nvc.c:300] running mknod for /dev/nvidia-uvm                                                                                                                                                                    I0902 22:11:39.899778 2840724 nvc.c:305] loading kernel module nvidia_modeset              
I0902 22:11:39.899820 2840724 nvc.c:309] running mknod for /dev/nvidia-modeset                                                                                                                                                                
I0902 22:11:39.900186 2840725 rpc.c:71] starting driver rpc service                                                                                                                                                                           I0902 22:11:39.911718 2840726 rpc.c:71] starting nvcgo rpc service                                                                                                                                                                            I0902 22:11:39.912892 2840718 nvc_container.c:240] configuring container with 'compute utility supervised'                                                                                                                                    I0902 22:11:39.913283 2840718 nvc_container.c:88] selecting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libcuda.so.470.129.06                                 I0902 22:11:39.913368 2840718 nvc_container.c:88] selecting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libnvidia-ptxjitcompiler.so.470.129.06                I0902 22:11:39.915116 2840718 nvc_container.c:262] setting pid to 2840712                                                                                                                                                                     I0902 22:11:39.915147 2840718 nvc_container.c:263] setting rootfs to /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged                                                                         I0902 22:11:39.915160 2840718 nvc_container.c:264] setting owner to 0:0                                                                                                                                                                       I0902 22:11:39.915171 2840718 nvc_container.c:265] setting bins directory to /usr/bin                                                                                                                                                         I0902 22:11:39.915182 2840718 nvc_container.c:266] setting libs directory to /usr/lib/x86_64-linux-gnu                                                                                                                                        I0902 22:11:39.915193 2840718 nvc_container.c:267] setting libs32 directory to /usr/lib/i386-linux-gnu                                                                                                                                        I0902 22:11:39.915204 2840718 nvc_container.c:268] setting cudart directory to /usr/local/cuda                                                                                                                                                I0902 22:11:39.915215 2840718 nvc_container.c:269] setting ldconfig to @/sbin/ldconfig.real (host relative)                                                                                                                                   I0902 22:11:39.915228 2840718 nvc_container.c:270] setting mount namespace to /proc/2840712/ns/mnt                                                                                                                                            I0902 22:11:39.915240 2840718 nvc_container.c:272] detected cgroupv2                                                                                                                                                                          I0902 22:11:39.915271 2840718 nvc_container.c:273] setting devices cgroup to /sys/fs/cgroup/system.slice/docker-5fff6f80850791d3858cb511015581375d55ae42df5eb98262ceae31ed47a7d5.scope                                                        I0902 22:11:39.915292 2840718 nvc_info.c:766] requesting driver information with ''                                                                                                                                                           I0902 22:11:39.916901 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.510.68.02                                                                                                                          I0902 22:11:39.917076 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.510.68.02                                                                                                                                     I0902 22:11:39.917165 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.510.68.02  
I0902 22:11:39.917236 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.510.68.02
I0902 22:11:39.917318 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02                                                                                                                       
I0902 22:11:39.917411 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.510.68.02
I0902 22:11:39.917503 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02
I0902 22:11:39.917574 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.510.68.02
I0902 22:11:39.917639 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02
I0902 22:11:39.917730 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.510.68.02
I0902 22:11:39.917794 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.510.68.02                                                                                                                                 
I0902 22:11:39.917859 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.510.68.02                                                                                                                               
I0902 22:11:39.917926 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.510.68.02                                                                                                                                  I0902 22:11:39.918018 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.510.68.02                                                                                                                               I0902 22:11:39.918109 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.510.68.02                                                                                                                              
I0902 22:11:39.918176 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02                                                                                                                             
I0902 22:11:39.918243 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02                                                                                                                                  
I0902 22:11:39.918335 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02                                                                                                                            
I0902 22:11:39.918429 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.510.68.02                                                                                                               
I0902 22:11:39.918628 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02                                                                                                                  
I0902 22:11:39.918758 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.510.68.02                                                                                                            
I0902 22:11:39.918827 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.510.68.02                                                                                                          
I0902 22:11:39.918896 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.510.68.02                                                                                                              
I0902 22:11:39.918968 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.510.68.02
W0902 22:11:39.919005 2840718 nvc_info.c:399] missing library libnvidia-nscq.so                                                                                                                                                               W0902 22:11:39.919022 2840718 nvc_info.c:399] missing library libcudadebugger.so                                                                                                                                                              W0902 22:11:39.919035 2840718 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so                                                                                                                                                    W0902 22:11:39.919049 2840718 nvc_info.c:399] missing library libnvidia-pkcs11.so         
W0902 22:11:39.919061 2840718 nvc_info.c:399] missing library libnvidia-ifr.so                                                                                                                                                                
W0902 22:11:39.919074 2840718 nvc_info.c:399] missing library libnvidia-cbl.so                                                                                                                                                                W0902 22:11:39.919088 2840718 nvc_info.c:403] missing compat32 library libnvidia-ml.so                                                                                                                                                        W0902 22:11:39.919107 2840718 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W0902 22:11:39.919119 2840718 nvc_info.c:403] missing compat32 library libnvidia-nscq.so                                                                                                                                                      
W0902 22:11:39.919131 2840718 nvc_info.c:403] missing compat32 library libcuda.so       
W0902 22:11:39.919144 2840718 nvc_info.c:403] missing compat32 library libcudadebugger.so                                                                                                                                                     
W0902 22:11:39.919156 2840718 nvc_info.c:403] missing compat32 library libnvidia-opencl.so
W0902 22:11:39.919168 2840718 nvc_info.c:403] missing compat32 library libnvidia-ptxjitcompiler.so                                                                                                                                            
W0902 22:11:39.919192 2840718 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W0902 22:11:39.919206 2840718 nvc_info.c:403] missing compat32 library libnvidia-allocator.so                                                                                                                                                 
W0902 22:11:39.919218 2840718 nvc_info.c:403] missing compat32 library libnvidia-compiler.so 
W0902 22:11:39.919230 2840718 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so                                                                                                                                                    
W0902 22:11:39.919242 2840718 nvc_info.c:403] missing compat32 library libnvidia-ngx.so                                                                                                                                                       W0902 22:11:39.919254 2840718 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so                                                                                                                                                     W0902 22:11:39.919266 2840718 nvc_info.c:403] missing compat32 library libnvidia-encode.so                                                                                                                                                    W0902 22:11:39.919279 2840718 nvc_info.c:403] missing compat32 library libnvidia-opticalflow.so                                                                                                                                               W0902 22:11:39.919291 2840718 nvc_info.c:403] missing compat32 library libnvcuvid.so                                                                                                                                                          W0902 22:11:39.919304 2840718 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so
W0902 22:11:39.919317 2840718 nvc_info.c:403] missing compat32 library libnvidia-glcore.so                                                                                                                                                    
W0902 22:11:39.919329 2840718 nvc_info.c:403] missing compat32 library libnvidia-tls.so                                                                                                                                                       W0902 22:11:39.919341 2840718 nvc_info.c:403] missing compat32 library libnvidia-glsi.so                                                                                                                                                      W0902 22:11:39.919353 2840718 nvc_info.c:403] missing compat32 library libnvidia-fbc.so                                                                                                                                                       W0902 22:11:39.919365 2840718 nvc_info.c:403] missing compat32 library libnvidia-ifr.so                                                                                                                                                       W0902 22:11:39.919377 2840718 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so                                                                                                                                                    W0902 22:11:39.919388 2840718 nvc_info.c:403] missing compat32 library libnvoptix.so                                                                                                                                                          W0902 22:11:39.919401 2840718 nvc_info.c:403] missing compat32 library libGLX_nvidia.so                                                                                                                                                       W0902 22:11:39.919413 2840718 nvc_info.c:403] missing compat32 library libEGL_nvidia.so                                                                                                                                                       W0902 22:11:39.919426 2840718 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so                                                                                                                                                    W0902 22:11:39.919438 2840718 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so                                                                                                                                                 W0902 22:11:39.919451 2840718 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so                                                                                                                                                 W0902 22:11:39.919463 2840718 nvc_info.c:403] missing compat32 library libnvidia-cbl.so                                                                                                                                                       I0902 22:11:39.919856 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-smi                                                                                                                                                                   I0902 22:11:39.919895 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump                                                                                                                                                             I0902 22:11:39.919931 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced                                                                                                                                                          I0902 22:11:39.919985 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control                                                                                                                                                      I0902 22:11:39.920022 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server                                                                                                                                                       W0902 22:11:39.920096 2840718 nvc_info.c:425] missing binary nv-fabricmanager                                                                                                                                                                 I0902 22:11:39.920152 2840718 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/510.68.02/gsp.bin                                                                                                                                I0902 22:11:39.920200 2840718 nvc_info.c:529] listing device /dev/nvidiactl                                   
I0902 22:11:39.920215 2840718 nvc_info.c:529] listing device /dev/nvidia-uvm                                   
I0902 22:11:39.920228 2840718 nvc_info.c:529] listing device /dev/nvidia-uvm-tools                                                                                                                                                            
I0902 22:11:39.920240 2840718 nvc_info.c:529] listing device /dev/nvidia-modeset                                    
W0902 22:11:39.920281 2840718 nvc_info.c:349] missing ipc path /var/run/nvidia-persistenced/socket             
W0902 22:11:39.920324 2840718 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket         
W0902 22:11:39.920355 2840718 nvc_info.c:349] missing ipc path /tmp/nvidia-mps                             
I0902 22:11:39.920371 2840718 nvc_info.c:822] requesting device information with ''                               
I0902 22:11:39.927586 2840718 nvc_info.c:713] listing device /dev/nvidia0 (GPU-9c416c82-d801-d28f-0867-dd438d4be914 at 00000000:04:00.0)                                                                                                      
I0902 22:11:39.934626 2840718 nvc_info.c:713] listing device /dev/nvidia1 (GPU-32a56b8c-943e-03e7-d539-3e97e5ef5f7a at 00000000:05:00.0)                                                                                                      
I0902 22:11:39.941796 2840718 nvc_info.c:713] listing device /dev/nvidia2 (GPU-a0e33485-87cd-ceb1-2702-2c58a64a9dbe at 00000000:08:00.0)                                                                                                      I0902 22:11:39.949011 2840718 nvc_info.c:713] listing device /dev/nvidia3 (GPU-1ab2485c-121c-77db-6719-0b616d1673f4 at 00000000:09:00.0)                                                                                                      I0902 22:11:39.956304 2840718 nvc_info.c:713] listing device /dev/nvidia4 (GPU-e7e3d7b6-ddce-355a-7988-80c4ba18319c at 00000000:0b:00.0)                                                                                                      
I0902 22:11:39.963862 2840718 nvc_info.c:713] listing device /dev/nvidia5 (GPU-c16444fb-bedb-106d-c188-1f330773cf39 at 00000000:84:00.0)                                                                                                      
I0902 22:11:39.971543 2840718 nvc_info.c:713] listing device /dev/nvidia6 (GPU-2545ac9e-3ff1-8b38-8ad6-b8c82fea6cd0 at 00000000:85:00.0)                                                                                                      
I0902 22:11:39.979406 2840718 nvc_info.c:713] listing device /dev/nvidia7 (GPU-fcc35ab7-1afd-e678-b5f0-d1e1f8842d28 at 00000000:89:00.0)                                                                                                      
I0902 22:11:39.979522 2840718 nvc_mount.c:366] mounting tmpfs at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia                                    
I0902 22:11:39.980084 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-smi at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-smi                      
I0902 22:11:39.980181 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-debugdump at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-debugdump          
I0902 22:11:39.980273 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-persistenced at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-persistenced     
I0902 22:11:39.980360 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-control at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-cuda-mps-control    
I0902 22:11:39.980443 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-server at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-cuda-mps-server     
I0902 22:11:39.980696 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02                                                                                                                                                                                              
I0902 22:11:39.980795 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02                                                                                                                                                                                            
I0902 22:11:39.980919 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02                                                                                                                                                                                                                                    I0902 22:11:39.981004 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02                                                                                                                                                                                                                  I0902 22:11:39.981090 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02                                                                                                                                                                                                  I0902 22:11:39.981182 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02                                                                                                                                                                                                            I0902 22:11:39.981272 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02                                                                                                                                                                                                              I0902 22:11:39.981314 2840718 nvc_mount.c:527] creating symlink /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1                          I0902 22:11:39.981482 2840718 nvc_mount.c:134] mounting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libcuda.so.470.129.06 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libcuda.so.470.129.06                                                                                                                                I0902 22:11:39.981569 2840718 nvc_mount.c:134] mounting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libnvidia-ptxjitcompiler.so.470.129.06 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.129.06                                                                                              I0902 22:11:39.981887 2840718 nvc_mount.c:85] mounting /usr/lib/firmware/nvidia/510.68.02/gsp.bin at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/lib/firmware/nvidia/510.68.02/gsp.bin with flags 0x7                                                                                                                                                                                                                                  I0902 22:11:39.981971 2840718 nvc_mount.c:230] mounting /dev/nvidiactl at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidiactl                                                      I0902 22:11:39.982876 2840718 nvc_mount.c:230] mounting /dev/nvidia-uvm at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia-uvm                                                    I0902 22:11:39.983470 2840718 nvc_mount.c:230] mounting /dev/nvidia-uvm-tools at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia-uvm-tools                                        I0902 22:11:39.983976 2840718 nvc_mount.c:230] mounting /dev/nvidia0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia0                                                          I0902 22:11:39.984099 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:04:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:04:00.0        I0902 22:11:39.984695 2840718 nvc_mount.c:230] mounting /dev/nvidia1 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia1                                                          I0902 22:11:39.984812 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:05:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:05:00.0        I0902 22:11:39.985425 2840718 nvc_mount.c:230] mounting /dev/nvidia2 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia2                                                          I0902 22:11:39.985541 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:08:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:08:00.0        I0902 22:11:39.986207 2840718 nvc_mount.c:230] mounting /dev/nvidia3 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia3                                                          I0902 22:11:39.986322 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:09:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:09:00.0        I0902 22:11:39.986963 2840718 nvc_mount.c:230] mounting /dev/nvidia4 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia4                                                          I0902 22:11:39.987076 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:0b:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:0b:00.0        I0902 22:11:39.987794 2840718 nvc_mount.c:230] mounting /dev/nvidia5 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia5                                                          I0902 22:11:39.987907 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:84:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:84:00.0        I0902 22:11:39.988593 2840718 nvc_mount.c:230] mounting /dev/nvidia6 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia6                                                          I0902 22:11:39.988707 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:85:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:85:00.0        I0902 22:11:39.989388 2840718 nvc_mount.c:230] mounting /dev/nvidia7 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia7                                                          I0902 22:11:39.989515 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:89:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:89:00.0        I0902 22:11:39.990197 2840718 nvc_ldcache.c:372] executing /sbin/ldconfig.real from host at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged                                                  I0902 22:11:40.012422 2840718 nvc.c:434] shutting down library context                                                                                                                                                                        I0902 22:11:40.012510 2840726 rpc.c:95] terminating nvcgo rpc service                                                                                                                                                                         I0902 22:11:40.013110 2840718 rpc.c:135] nvcgo rpc service terminated successfully                                                                                                                                                            I0902 22:11:40.018693 2840725 rpc.c:95] terminating driver rpc service                                                                                                                                                                        I0902 22:11:40.018995 2840718 rpc.c:135] driver rpc service terminated successfully           
  • Docker command, image and tag used
docker run --gpus all --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
nvidia-smi 

Other open issues

NVIDIA/nvidia-container-toolkit#251 but this is using cgroup v1
#1661 there isn't any information posted and it's on Ubuntu 20.04 instead of 22.04

Important notes / workaround

containerd.io v1.6.7 or v1.6.8 even with no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specifying the devices to docker run gives Failed to initialize NVML: Unknown Error after a systemctl daemon-reload.

Downgrading containerd.io to 1.6.6 works as long as you specify no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specify the devices to docker run like docker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash

@kevin-bockman
Copy link

@elezar Previously persistence mode was off, so this happens either way.

Also, on k8s-device-plugin/issues/289 @klueska said:
The only thing we've seen that fully resolves the issue is to upgrade to an "experimental" version of our NVIDIA container runtime that bypasses the need for libnvidia-container to change cgroup permissions out from underneath runC.
Was that merged, or is it something I should try?

@elezar
Copy link
Member

elezar commented Sep 5, 2022

@kevin-bockman the experimental mode is still a work in progress and we don't have a concrete timeline on when this will be available for testing. I will update the issue here as soon as I have more information.

@klueska
Copy link
Contributor

klueska commented Sep 5, 2022

The other option is to move to cgroupv2. Since devices are not an actual subsytem in cgroupv2, there is no chance for containerd to undo what libnvidia-container has done under the hood after a restart.

@kevin-bockman
Copy link

kevin-bockman commented Sep 6, 2022

@klueska Sorry, with all of the information, it wasn't really clear. The problem is that it's already on cgroupv2 AFAIK. I started from a fresh install of Ubuntu 22.04.1. docker info says it is at least.

The only way I could get this to work after a systemctl daemon-reload is downgrading containerd.io to 1.6.6 and specify no-cgroups. The other interesting thing is with containerd v1.6.7 or v1.6.8, even specifying no-cgroups still had the issue so I'm wondering if there's more than 1 issue here. I know cgroup v2 has 'fixed' the issue for some people or so they think (this can be an intermittent issue if you don't know that the reload triggers it), but it hasn't seemed to fix it for everyone unless I'm missing something but it doesn't work on a fresh install after doing a daemon reload, or just waiting for something to be triggered by the OS.

$ docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.8.2-docker)

Server:
 Containers: 4
  Running: 4
  Paused: 0
  Stopped: 0
 Images: 4
 Server Version: 20.10.17
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.15.0-46-generic
 Operating System: Ubuntu 22.04.1 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 94.36GiB
 Name: node5-4
 ID: PPB6:APYD:PKMA:BIOZ:2Y3H:LZUV:TPHD:SBZE:XRSL:NJCB:PWMX:ZVBY
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

@mf-giwoong-lee
Copy link

mf-giwoong-lee commented Sep 8, 2022

@kevin-bockman I had a similar experience.

In my case,

docker run -it --device /dev/nvidiactl:/dev/nvidiactl \
--device /dev/nvidia-uvm:/dev/nvidia-uvm  \
--device /dev/nvidia0:/dev/nvidia0 \
--device /dev/nvidia1:/dev/nvidia1 \
--device /dev/nvidia2:/dev/nvidia2 \
--device /dev/nvidia3:/dev/nvidia3 \
--name <container_name> <image_name>
(Replace/repeat nvidia0 with other/more devices as needed.)

This setting is working in some machines and not working in other machines.
Finally, I found that working machines has containerd.io version 1.4.6-1 (ubuntu 18.04)!!!
In ubuntu 20.04 machine, containerd.io which has version 1.5.2-1 makes it work.

I tried to downgrade and upgrade the version of containerd.io to check this strategy works or not.
It works for me.

@mf-giwoong-lee
Copy link

Above one is not the answer...

This prevents nmvl error from docker resource update, but nvml error still occurs after random amount of time.

@theluke
Copy link

theluke commented Oct 15, 2022

Same issue. Ubuntu 22,docker ce. I will just end up writing a cron job script to check for the error and restart the container

@iFede94
Copy link
Author

iFede94 commented Oct 16, 2022

The solution proposed by @kevin-bockman has been working without any problem for more than a month now.

Downgrading containerd.io to 1.6.6 works as long as you specify no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specify the devices to docker run like docker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash

@theluke
Copy link

theluke commented Oct 25, 2022

I am using docker-ce on Ubuntu 22, so I opted for this approach, working fine so far.

@myron
Copy link

myron commented Nov 2, 2022

same issue on Nvidia 3090
Ubuntu 22.04.1 LTS, Driver Version: 510.85.02 CUDA Version: 11.6

@fradsj
Copy link

fradsj commented Nov 10, 2022

Hello there.

I'm hitting the same issue here, but with containerd rather than docker.

Here's my configuration:

  • GPUs:

     # lspci | grep -i nvidia
     00:04.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)
  • OS:

     # cat /etc/lsb-release
     DISTRIB_ID=Ubuntu
     DISTRIB_RELEASE=22.04
     DISTRIB_CODENAME=jammy
     DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
  • containerd release:

     # containerd --version
     containerd containerd.io 1.6.8 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
  • nvidia-container-toolkit version:

     # nvidia-container-toolkit -version
     NVIDIA Container Runtime Hook version 1.11.0
     commit: d9de4a0
  • runc version:

    # runc --version
    runc version 1.1.4
    commit: v1.1.4-0-g5fd4c4d
    spec: 1.0.2-dev
    go: go1.17.13
    libseccomp: 2.5.1

Note that the Nvidia's container toolkit has been installed with the Nvidia's GPU operator on Kubernetes (v1.25.3).

I attached the containerd configuration file and the nvidia-container-runtime configuration file to my comment.
containerd.txt
nvidia-container-runtime.txt

How I reproduce this bug:

Running on my host the following command:

# nerdctl run -n k8s.io --runtime=/usr/local/nvidia/toolkit/nvidia-container-runtime --network=host --rm -ti --name ubuntu --gpus all -v /run/nvidia/driver/usr/bin:/tmp/nvidia-bin docker.io/library/ubuntu:latest bash

After some time, the nvidia-smicommand exits with the error Failed to initialize NVML: Unknown Error.

Traces, logs, etc...

  • Here are the devices listed in the state.json file:
      {
         "type": 99,
         "major": 195,
         "minor": 255,
         "permissions": "",
         "allow": false,
         "path": "/dev/nvidiactl",
         "file_mode": 438,
         "uid": 0,
         "gid": 0
       },
       {
         "type": 99,
         "major": 234,
         "minor": 0,
         "permissions": "",
         "allow": false,
         "path": "/dev/nvidia-uvm",
         "file_mode": 438,
         "uid": 0,
         "gid": 0
       },
       {
         "type": 99,
         "major": 234,
         "minor": 1,
         "permissions": "",
         "allow": false,
         "path": "/dev/nvidia-uvm-tools",
         "file_mode": 438,
         "uid": 0,
         "gid": 0
       },
       {
         "type": 99,
         "major": 195,
         "minor": 254,
         "permissions": "",
         "allow": false,
         "path": "/dev/nvidia-modeset",
         "file_mode": 438,
         "uid": 0,
         "gid": 0
       },
       {
         "type": 99,
         "major": 195,
         "minor": 0,
         "permissions": "",
         "allow": false,
         "path": "/dev/nvidia0",
         "file_mode": 438,
         "uid": 0,
         "gid": 0
       }

Thank you very much for your help. 🙏

@gengwg
Copy link

gengwg commented Nov 22, 2022

Here I wrote the detailed steps how I fixed this issue in our env with cgroup v2. Let me know if it works in your env.

https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc

@GuillaumeSmaha
Copy link

GuillaumeSmaha commented Nov 23, 2022

Here I wrote the detailed steps how I fixed this issue in our env with cgroup v2. Let me know if it works in your env.

https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc

@gengwg Can you try if your solution works by calling sudo systemctl daemon-reload on the host? In my case (cgroupv1), it is directly breaking the pod ; so from the pod, nvidia-smi is returning Failed to initialize NVML: Unknown Error.

@gengwg
Copy link

gengwg commented Nov 23, 2022

yes. that's actually the first thing i tested when upgraded v1 --> v2. it's easy to test, because it doesn't need wait a few hours/days.

to double check, i just tested it again right now.

Before:

$ k exec -it gengwg-test-gpu-9 -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-212a30ad-0ea4-8201-1be0-cdc575e55034)

Do the reload on that node itself:

# systemctl daemon-reload

After:

$ k exec -it gengwg-test-gpu-9 -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-212a30ad-0ea4-8201-1be0-cdc575e55034)

I will update the note to reflect this test too.

@gengwg
Copy link

gengwg commented Nov 23, 2022

And I can also confirm that's what I saw on our cgroupv1 nodes too, i.e. sudo systemctl daemon-reload immediately breaks nvidia-smi.

@panli889
Copy link

Here I wrote the detailed steps how I fixed this issue in our env with cgroup v2. Let me know if it works in your env.

https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc

Hi, what's your cgroup driver for kubelet and containerd? We meed the same problem in cgroup v2, our cgroup driver is systemd, but if we switch the cgroup driver to cgroupfs, the problem will disappear. I think it's the systemd cgroup driver cause the problem.

Also, if we switch the cgroup driver of docker to cgroupfs, it will also solve the problem.

@panli889
Copy link

Important notes / workaround

containerd.io v1.6.7 or v1.6.8 even with no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specifying the devices to docker run gives Failed to initialize NVML: Unknown Error after a systemctl daemon-reload.

Downgrading containerd.io to 1.6.6 works as long as you specify no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specify the devices to docker run like docker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash

I've also tried this way, the reason why containerd 1.6.7 can't work is because runc has been updated to 1.1.3, in this version runc will ignore some char devices can't be os.Stat in this PR. Unfortunately, the GPU related device is that kind of device, so it will go wrong.

@fradsj
Copy link

fradsj commented Nov 24, 2022

@gengwg Thanks for sharing your document. As I run my kubernetes cluster on ubuntu 22.04, cgroupv2 is the default cgroup subsystem used.

I deployed two environments to help me making some comparisons:

  • One environment is running kubernetes v1.25.3, with Nvidia's GPU operator
  • One environment with only containerd & nvidia-container-toolkit

Interestingly, I never face this issue on the second environment, everything is running perfectly well.

The first environment though is running into this issue after some time.

That would probably means that Nvidia's container runtime isn't the faulty component here, but it needs more investigations on my side to be sure that I'm not missing anything.

I'll have a look at the cgroup driver as @panli889 mentioned.

Thanks again for your help

@gengwg
Copy link

gengwg commented Nov 24, 2022

cgroup driver for kubelet, docker and containerd are all systemd. In fact, in cgroupv1 we used to use cgroupfs, but kubelet won't start, complaining mismatch between kubelet and docker cgroup drivers. After that I changed the docker (and containerd) cgroup driver to systemd, kubelet was able to start.

# cat /etc/systemd/system/kubelet.service | grep -i cgroup
  --runtime-cgroups=/systemd/system.slice \
  --kubelet-cgroups=/systemd/system.slice \
  --cgroup-driver=systemd \

We are in the middle of migrating docker to containerd, so we have both docker and containerd nodes. This seem fixed it for BOTH.

Docker nodes:

# docker info | grep -i cgroup
WARNING: No swap limit support
 Cgroup Driver: systemd
 Cgroup Version: 2
  cgroupns

Containerd nodes:

$ sudo crictl info | grep -i cgroup
            "SystemdCgroup": true
            "SystemdCgroup": true
    "systemdCgroup": false,
    "disableCgroup": false,

Here is our k8s version:

$ k version --short
Client Version: v1.21.3
Server Version: v1.22.9

@gengwg
Copy link

gengwg commented Nov 24, 2022

@gengwg Thanks for sharing your document. As I run my kubernetes cluster on ubuntu 22.04, cgroupv2 is the default cgroup subsystem used.

I deployed two environments to help me making some comparisons:

  • One environment is running kubernetes v1.25.3, with Nvidia's GPU operator
  • One environment with only containerd & nvidia-container-toolkit

Interestingly, I never face this issue on the second environment, everything is running perfectly well.

The first environment though is running into this issue after some time.

That would probably means that Nvidia's container runtime isn't the faulty component here, but it needs more investigations on my side to be sure that I'm not missing anything.

I'll have a look at the cgroup driver as @panli889 mentioned.

Thanks again for your help

I think ours is similar to your 2nd env, i.e. containerd & nvidia-container-toolkit. we are on k8s v1.22.9.

# containerd --version
containerd containerd.io 1.6.6 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1

# dnf info nvidia-container-toolkit | grep Version
Version      : 1.11.0

i posted cgroup driver info above.

@panli889
Copy link

@gengwg thx for your reply!

cgroup driver for kubelet, docker and containerd are all systemd.

Hmm, that's interesting, it's quite different from my situation. Would you please share your systemd version?

I can share the problems we meet, if we create a pod with GPU, there will be a related systemd scope created at the same time like cri-containerd-xxxxxx.scope, and it records the cgroup info, if we run systemctl status to check the status:

Warning: The unit file, source configuration file or drop-ins of cri-containerd-xxxxx.scope changed on disk. Run 'systemctl daemon-reload' to  reload units.
● cri-containerd-xxx.scope - libcontainer container xxxx
     Loaded: loaded (/run/systemd/transient/cri-containerd-xxxx.scope; transient)
  Transient: yes
    Drop-In: /run/systemd/transient/cri-containerd-xxxxx.scope.d
             └─50-DevicePolicy.conf, 50-DeviceAllow.conf, 50-CPUWeight.conf, 50-CPUQuotaPeriodSec.conf, 50-CPUQuota.conf, 50-AllowedCPUs.conf
     Active: active (running) since Fri 2022-11-25 12:13:33 +08; 1min 47s ago
         IO: 404.0K read, 0B written
      Tasks: 1
     Memory: 528.0K
        CPU: 2.562s
     CGroup: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podb6b36d39_ef5b_4eb9_850d_d710bbd06096.slice/cri-containerd-xxx.scope>
             └─61265 sleep infinity

And if we check the content of file 50-DeviceAllow.conf, we found no GPU devices info in there. Then if we run systemctl daemon-reload, if will reproduce a ebpf cgroup program about the devices, and it will block the access of GPU devices.

So would you please also take a look at the content of DeviceAllow.conf for some systemd scope of pod, what's in there?

@numerical2017
Copy link

Same issue with 2 x Nvidia 3090 Ti, Ubuntu 22.04.1 LTS, Driver Version: 510.85.02, CUDA Version: 11.6
I adopted the solution proposed by @kevin-bockman downgrading containerd.io from 1.6.10 to 1.6.6. After running systemctl daemon-reload on the host machine the nvidia-smi within the container still works properly. I will check how long it lasts and I'll keep you updated.

@fradsj
Copy link

fradsj commented Nov 29, 2022

@panli889 I checked the scope unit with systemctl status, and this message popped up:

Warning: The unit file, source configuration file or drop-ins of cri-containerd-d35333ac42f1e08a33632fccd63028a28443f95f3c126860a8c9da20b6d27102.scope changed on disk. Run 'systemctl daemon-reload' to reload units.

After running systemctl daemon-reload, I get the error on my container:

root@ubuntu:/# nvidia-smi
Failed to initialize NVML: Unknown Error

Here's the content of the 50-DeviceAllow.conf file:

[Scope]
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/10:200 rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/1:8 rwm
DeviceAllow=/dev/char/1:7 rwm
DeviceAllow=/dev/char/1:5 rwm
DeviceAllow=/dev/char/1:3 rwm
DeviceAllow=char-* m
DeviceAllow=block-* m

There's indeed no reference to nvidia's devices here:

crw-rw-rw- 1 root root 195, 254 Nov 29 10:18 nvidia-modeset
crw-rw-rw- 1 root root 234,   0 Nov 29 10:18 nvidia-uvm
crw-rw-rw- 1 root root 234,   1 Nov 29 10:18 nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   0 Nov 29 10:18 nvidia0
crw-rw-rw- 1 root root 195, 255 Nov 29 10:18 nvidiactl

nvidia-caps:
total 0
cr-------- 1 root root 237, 1 Nov 29 10:18 nvidia-cap1
cr--r--r-- 1 root root 237, 2 Nov 29 10:18 nvidia-cap2

@panli889
Copy link

@fradsj thanks for your reply, seems the same problem as us.

Here is how we solve it, hope it will help:

@Navino16
Copy link

Hi,

Any official way to fix this error ?

@pomodorox
Copy link

See https://github.com/NVIDIA/k8s-device-plugin#setting-other-helm-chart-values (which needs an update for a disscussion on the options and setting up privileged). Privileged mode is required when passing the device specs so that the device plugin can see all the required device nodes. Otherwise it would not have the required accesss (even though this is also provided by the nvidia container toolkit).

Using privileged mode for DP didn't work.. But using privileged mode for user workload Pod did work. Also, it seems that as long as the user workload Pod is privileged, there aren't any problems -- DP doesn't need to be privileged, no symlinks for the char devices need to be created.

@klueska
Copy link
Contributor

klueska commented May 10, 2023

That is true, but most users don't want to run their user pods as privileged (and they shouldn't have to if everything else is set up properly).

@gaopeiliang
Copy link

en ... we k8s cluster use old gpu-device-plugin with not support pass-device-specs ; so we should test it;

Which version are you using?

The no-cgroups option is used to control whether the NVIDIA Container Library should update the cgroups for a container to allow access to a device. For the rootless case, where a user does not have permissions to manage cgroups, this must be disabled. I don't have enough experience to know whether your proposed combination would work as expected.

device-plugin version 1.0.0-beta

runc will also write cgroup fs if has device list ; so pass-device + no-cgroup=true can always set sucess I tested ....

@breakingflower
Copy link

Some relevant comments & solutions from @cdesiniotis at nvidia on the matter: #1730

@punkerpunker
Copy link

punkerpunker commented May 18, 2023

Some relevant comments & solutions from @cdesiniotis at nvidia on the matter: #1730

Thanks @breakingflower, that's very useful.

FYI: From the Notice:

Deploying GPU Operator 22.9.2 will automatically fix the issue on all K8s nodes of the cluster (the fix is integrated inside the validator pod which will run when a new node is deployed or at every reboot of the node).

Does sound very promising but unfortunately doesn't solve the issue.

I can confirm that using the new version of GPU Operator resolves the issue when CDI is enabled in gpu-operator config:

  cdi:
    enabled: true
    default: true

However, I am facing the issue where nvidia-container-toolkit-daemonset couldn't start properly after the reboot of the machine:

  Warning  Failed          4m34s (x4 over 6m10s)  kubelet          Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices management.nvidia.com/gpu=all: unknown

@zhanwenchen
Copy link

Any update on this?

@klueska
Copy link
Contributor

klueska commented Aug 4, 2023

Please see this notice from February:
#1730

@leemingeer
Copy link

leemingeer commented Aug 23, 2023

@klueska

Please see this notice from February: #1730
i have seen it in detail, could explain how to get the correct {{NVIDIA_DRIVER_ROOT}} in cases where the driver container is also in use.
i am not clear, the default value in nvidia-ctk is /

@pcanas
Copy link

pcanas commented Sep 5, 2023

Is there any timeline for a solution besides the workarounds exposed in #1730 ?

@rogelioamancisidor
Copy link

rogelioamancisidor commented Sep 9, 2023

I tried the suggested approach in #6380, but it didn't solve the problem. It is quite frustrating as I cannot rely on AKS at the moment. I hope this issue is solved soon.

@klueska
Copy link
Contributor

klueska commented Sep 12, 2023

@rogelioamancisidor we've heard that AKS ships with a really old version of the k8s-device-plugin (from 2019!) which doesn't support the PASS_DEVICE_SPECS flag. You will need to update the plugin to a newer one and pass this flag for things to work on AKS.

@rogelioamancisidor
Copy link

rogelioamancisidor commented Sep 13, 2023

@klueska Here is the plugin that I got suggested in the other discussion plugin. Do you have a link for a newer k8s-device-plugin? I'll really appreciate it as I have tried different things without any luck.

@elezar
Copy link
Member

elezar commented Sep 13, 2023

@klueska Here is the plugin that I got suggested in the other discussion plugin and I just noticed, as you mentioned, that the plugin dates 2019. Do you have a link for a newer k8s-device-plugin? I'll really appreciate it as I have tried different things without any luck.

The plugin is available here: https://github.com/NVIDIA/k8s-device-plugin the README should cover a variety of deployment options, where helm is recommended.

The latest version of the plugin is v0.14.1.

@rogelioamancisidor
Copy link

I deployed a DaemonSet for the NVIDIA device plugin using the yaml manifest in the link that I posted. The manifest in the link includes this line - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.1. isnt that manifest deploying the latest version then? PASS_DEVICE_SPECS is also set to true as suggested by AKS.

@homjay
Copy link

homjay commented Sep 29, 2023

here is the official soluton

#1730 (comment)

modify /etc/docker/docker.json

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    },
    "exec-opts": ["native.cgroupdriver=cgroupfs"]
}

it is working.

@YochayTzur
Copy link

modify /etc/docker/docker.json

Isn't it /etc/docker/daemon.json?

@rogelioamancisidor
Copy link

@homjay I dont think that solution works on K8s

@elezar
Copy link
Member

elezar commented Nov 19, 2023

This is an issue as described in NVIDIA/nvidia-container-toolkit#48

Since this issue has a number of different failure modes discussed, I'm going to close this issue and ask that those still having a problem open new issues in the respective repositories.

We are looking to migrate all issues in this repo to https://github.com/NVIDIA/nvidia-container-toolkit in the near term.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests