Containerd : Getting random SIGQUIT #28

olevitt · 2022-08-01T13:36:23Z

Running a basic nginx (created by Kubernetes) on a containerd + nvidia-container-toolkit leads to random (and most of the time really quick like startup + 1 sec) SIGQUIT signals. This happens every time (I'm at restart number 1000+). Running the same container in the same host WITHOUT nvidia-container-runtime (so with default runc runtime) works fine.
I'm kind of lost on how to further debug this. Hopefully someone can point me towards the next debugging step for this. I tried to include as much context as I could below.

Container logs :

/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: /etc/nginx/conf.d/default.conf is not a file or does not exist
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2022/08/01 13:26:56 [notice] 1#1: using the "epoll" event method
2022/08/01 13:26:56 [notice] 1#1: nginx/1.21.4
2022/08/01 13:26:56 [notice] 1#1: built by gcc 10.2.1 20210110 (Debian 10.2.1-6) 
2022/08/01 13:26:56 [notice] 1#1: OS: Linux 5.10.0-16-amd64
2022/08/01 13:26:56 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2022/08/01 13:26:56 [notice] 1#1: start worker processes
2022/08/01 13:26:56 [notice] 1#1: start worker process 22
2022/08/01 13:26:56 [notice] 1#1: start worker process 23
2022/08/01 13:26:58 [notice] 1#1: signal 3 (SIGQUIT) received, shutting down
2022/08/01 13:26:58 [notice] 22#22: gracefully shutting down
2022/08/01 13:26:58 [notice] 23#23: gracefully shutting down
2022/08/01 13:26:58 [notice] 23#23: exiting
2022/08/01 13:26:58 [notice] 23#23: exit
2022/08/01 13:26:58 [notice] 1#1: signal 17 (SIGCHLD) received from 23
2022/08/01 13:26:58 [notice] 1#1: worker process 23 exited with code 0
2022/08/01 13:26:58 [notice] 1#1: signal 29 (SIGIO) received
2022/08/01 13:27:08 [notice] 22#22: exiting
2022/08/01 13:27:08 [notice] 22#22: exit
2022/08/01 13:27:08 [notice] 1#1: signal 17 (SIGCHLD) received from 22
2022/08/01 13:27:08 [notice] 1#1: worker process 22 exited with code 0
2022/08/01 13:27:08 [notice] 1#1: exit

Container toolkit logs :
nvidia-container-toolkit.log

Container runtime logs :

nvidia-container-runtime.log

Versions :

root@boss10:~# nvidia-container-toolkit --version
NVIDIA Container Runtime Hook version 1.10.0
commit: 7cfd3bd
root@boss10:~# nvidia-container-runtime --version
NVIDIA Container Runtime version 1.10.0
commit: 7cfd3bd
spec: 1.0.2-dev

runc version 1.1.1
commit: v1.1.0-20-g52de29d7
spec: 1.0.2-dev
go: go1.17.6
libseccomp: 2.5.3

Containerd config :

[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    sandbox_image = "k8s.gcr.io/pause:3.3"
    max_container_log_line_size = -1
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"
      snapshotter = "overlayfs"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"
          runtime_engine = ""
          runtime_root = ""
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            systemdCgroup = true
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v2"
          runtime_engine = ""
          runtime_root = ""
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

root@boss10:~# nvidia-smi
Mon Aug  1 15:34:03 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A2           Off  | 00000000:17:00.0 Off |                    0 |
|  0%   35C    P8     8W /  60W |      0MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A2           Off  | 00000000:18:00.0 Off |                    0 |
|  0%   34C    P8     8W /  60W |      0MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A2           Off  | 00000000:B1:00.0 Off |                    0 |
|  0%   34C    P8     8W /  60W |      0MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A2           Off  | 00000000:CA:00.0 Off |                    0 |
|  0%   35C    P8     8W /  60W |      0MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Additional info :

containerd github.com/containerd/containerd v1.6.4 212e8b6fa2f44b9c21b2798135fc6fb7c53efc16
Kubernetes v1.23.7 (Kubespray v2.19.0)  
Debian 11 (kernel 5.10.0-16-amd64)

crictl inspect

"runtimeOptions": {
      "binary_name": "/usr/bin/nvidia-container-runtime"
    }

Thanks !

The text was updated successfully, but these errors were encountered:

klueska · 2022-08-01T13:58:56Z

It looks like your contianerd config differs between the runc and nvidia runtimes on:

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            systemdCgroup = true

Can you add this setting to the nvidia runtime as well and see if there is any change.

olevitt · 2022-08-01T14:29:42Z

Looks like it worked !
I missed this option when adding the nvidia runtime, good catch !
Interesting fact : it worked fine without this option on other nodes equipped with tegra T4 and earlier versions of toolkit / driver. I will make sure to apply it on every node.
Thanks !

olevitt closed this as completed Aug 1, 2022

sidewinder12s mentioned this issue Sep 27, 2023

pods with runtimeClassName: nvidia get killed after a few seconds k3s-io/k3s#7130

Closed

brandond mentioned this issue Sep 27, 2023

Pass SystemdCgroup setting through to nvidia runtime options k3s-io/k3s#8470

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Containerd : Getting random SIGQUIT #28

Containerd : Getting random SIGQUIT #28

olevitt commented Aug 1, 2022 •

edited

Loading

klueska commented Aug 1, 2022

olevitt commented Aug 1, 2022

Containerd : Getting random SIGQUIT #28

Containerd : Getting random SIGQUIT #28

Comments

olevitt commented Aug 1, 2022 • edited Loading

klueska commented Aug 1, 2022

olevitt commented Aug 1, 2022

olevitt commented Aug 1, 2022 •

edited

Loading