You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running a basic nginx (created by Kubernetes) on a containerd + nvidia-container-toolkit leads to random (and most of the time really quick like startup + 1 sec) SIGQUIT signals. This happens every time (I'm at restart number 1000+). Running the same container in the same host WITHOUT nvidia-container-runtime (so with default runc runtime) works fine.
I'm kind of lost on how to further debug this. Hopefully someone can point me towards the next debugging step for this. I tried to include as much context as I could below.
Container logs :
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: /etc/nginx/conf.d/default.conf is not a file or does not exist
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2022/08/01 13:26:56 [notice] 1#1: using the "epoll" event method
2022/08/01 13:26:56 [notice] 1#1: nginx/1.21.4
2022/08/01 13:26:56 [notice] 1#1: built by gcc 10.2.1 20210110 (Debian 10.2.1-6)
2022/08/01 13:26:56 [notice] 1#1: OS: Linux 5.10.0-16-amd64
2022/08/01 13:26:56 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2022/08/01 13:26:56 [notice] 1#1: start worker processes
2022/08/01 13:26:56 [notice] 1#1: start worker process 22
2022/08/01 13:26:56 [notice] 1#1: start worker process 23
2022/08/01 13:26:58 [notice] 1#1: signal 3 (SIGQUIT) received, shutting down
2022/08/01 13:26:58 [notice] 22#22: gracefully shutting down
2022/08/01 13:26:58 [notice] 23#23: gracefully shutting down
2022/08/01 13:26:58 [notice] 23#23: exiting
2022/08/01 13:26:58 [notice] 23#23: exit
2022/08/01 13:26:58 [notice] 1#1: signal 17 (SIGCHLD) received from 23
2022/08/01 13:26:58 [notice] 1#1: worker process 23 exited with code 0
2022/08/01 13:26:58 [notice] 1#1: signal 29 (SIGIO) received
2022/08/01 13:27:08 [notice] 22#22: exiting
2022/08/01 13:27:08 [notice] 22#22: exit
2022/08/01 13:27:08 [notice] 1#1: signal 17 (SIGCHLD) received from 22
2022/08/01 13:27:08 [notice] 1#1: worker process 22 exited with code 0
2022/08/01 13:27:08 [notice] 1#1: exit
Looks like it worked !
I missed this option when adding the nvidia runtime, good catch !
Interesting fact : it worked fine without this option on other nodes equipped with tegra T4 and earlier versions of toolkit / driver. I will make sure to apply it on every node.
Thanks !
Running a basic nginx (created by Kubernetes) on a containerd + nvidia-container-toolkit leads to random (and most of the time really quick like startup + 1 sec)
SIGQUIT
signals. This happens every time (I'm at restart number 1000+). Running the same container in the same host WITHOUTnvidia-container-runtime
(so with default runc runtime) works fine.I'm kind of lost on how to further debug this. Hopefully someone can point me towards the next debugging step for this. I tried to include as much context as I could below.
Container logs :
Container toolkit logs :
nvidia-container-toolkit.log
Container runtime logs :
nvidia-container-runtime.log
Versions :
Containerd config :
Additional info :
crictl inspect
Thanks !
The text was updated successfully, but these errors were encountered: