Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containerd : Getting random SIGQUIT #28

Closed
olevitt opened this issue Aug 1, 2022 · 2 comments
Closed

Containerd : Getting random SIGQUIT #28

olevitt opened this issue Aug 1, 2022 · 2 comments

Comments

@olevitt
Copy link

olevitt commented Aug 1, 2022

Running a basic nginx (created by Kubernetes) on a containerd + nvidia-container-toolkit leads to random (and most of the time really quick like startup + 1 sec) SIGQUIT signals. This happens every time (I'm at restart number 1000+). Running the same container in the same host WITHOUT nvidia-container-runtime (so with default runc runtime) works fine.
I'm kind of lost on how to further debug this. Hopefully someone can point me towards the next debugging step for this. I tried to include as much context as I could below.

Container logs :

/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: /etc/nginx/conf.d/default.conf is not a file or does not exist
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2022/08/01 13:26:56 [notice] 1#1: using the "epoll" event method
2022/08/01 13:26:56 [notice] 1#1: nginx/1.21.4
2022/08/01 13:26:56 [notice] 1#1: built by gcc 10.2.1 20210110 (Debian 10.2.1-6) 
2022/08/01 13:26:56 [notice] 1#1: OS: Linux 5.10.0-16-amd64
2022/08/01 13:26:56 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2022/08/01 13:26:56 [notice] 1#1: start worker processes
2022/08/01 13:26:56 [notice] 1#1: start worker process 22
2022/08/01 13:26:56 [notice] 1#1: start worker process 23
2022/08/01 13:26:58 [notice] 1#1: signal 3 (SIGQUIT) received, shutting down
2022/08/01 13:26:58 [notice] 22#22: gracefully shutting down
2022/08/01 13:26:58 [notice] 23#23: gracefully shutting down
2022/08/01 13:26:58 [notice] 23#23: exiting
2022/08/01 13:26:58 [notice] 23#23: exit
2022/08/01 13:26:58 [notice] 1#1: signal 17 (SIGCHLD) received from 23
2022/08/01 13:26:58 [notice] 1#1: worker process 23 exited with code 0
2022/08/01 13:26:58 [notice] 1#1: signal 29 (SIGIO) received
2022/08/01 13:27:08 [notice] 22#22: exiting
2022/08/01 13:27:08 [notice] 22#22: exit
2022/08/01 13:27:08 [notice] 1#1: signal 17 (SIGCHLD) received from 22
2022/08/01 13:27:08 [notice] 1#1: worker process 22 exited with code 0
2022/08/01 13:27:08 [notice] 1#1: exit

Container toolkit logs :
nvidia-container-toolkit.log

Container runtime logs :

nvidia-container-runtime.log

Versions :

root@boss10:~# nvidia-container-toolkit --version
NVIDIA Container Runtime Hook version 1.10.0
commit: 7cfd3bd
root@boss10:~# nvidia-container-runtime --version
NVIDIA Container Runtime version 1.10.0
commit: 7cfd3bd
spec: 1.0.2-dev

runc version 1.1.1
commit: v1.1.0-20-g52de29d7
spec: 1.0.2-dev
go: go1.17.6
libseccomp: 2.5.3

Containerd config :

[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    sandbox_image = "k8s.gcr.io/pause:3.3"
    max_container_log_line_size = -1
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"
      snapshotter = "overlayfs"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"
          runtime_engine = ""
          runtime_root = ""
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            systemdCgroup = true
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v2"
          runtime_engine = ""
          runtime_root = ""
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
root@boss10:~# nvidia-smi
Mon Aug  1 15:34:03 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A2           Off  | 00000000:17:00.0 Off |                    0 |
|  0%   35C    P8     8W /  60W |      0MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A2           Off  | 00000000:18:00.0 Off |                    0 |
|  0%   34C    P8     8W /  60W |      0MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A2           Off  | 00000000:B1:00.0 Off |                    0 |
|  0%   34C    P8     8W /  60W |      0MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A2           Off  | 00000000:CA:00.0 Off |                    0 |
|  0%   35C    P8     8W /  60W |      0MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+  

Additional info :

containerd github.com/containerd/containerd v1.6.4 212e8b6fa2f44b9c21b2798135fc6fb7c53efc16
Kubernetes v1.23.7 (Kubespray v2.19.0)  
Debian 11 (kernel 5.10.0-16-amd64)

crictl inspect

"runtimeOptions": {
      "binary_name": "/usr/bin/nvidia-container-runtime"
    }

Thanks !

@klueska
Copy link
Contributor

klueska commented Aug 1, 2022

It looks like your contianerd config differs between the runc and nvidia runtimes on:

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            systemdCgroup = true

Can you add this setting to the nvidia runtime as well and see if there is any change.

@olevitt
Copy link
Author

olevitt commented Aug 1, 2022

Looks like it worked !
I missed this option when adding the nvidia runtime, good catch !
Interesting fact : it worked fine without this option on other nodes equipped with tegra T4 and earlier versions of toolkit / driver. I will make sure to apply it on every node.
Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants