Idle Jupyter server cause open file limit exhaustion #67

tlvu · 2020-09-03T15:01:34Z

This morning, none of the docker commands was responding because we have exhausted the open file limit on the user that runs the PAVICS platform (the user that does ./pavics-compose.sh up -d).

$ ./pavics-compose.sh ps
-bash: fork: retry: No child processes
-bash: fork: retry: No child processes
-bash: fork: retry: No child processes
-bash: fork: retry: No child processes
-bash: fork: Resource temporarily unavailable

The immediate work-around is to increase the soft and hard nofile limit for the corresponding user in /etc/security/limits.conf file and apply that new limit immediately ulimit -n NEW_LIMIT (ulimit -n show the current effective limit). Find the current number of open file this way sudo lsof -u $USER|wc -l and put something higher than that in the limits.conf file. Reference https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/tuning_and_optimizing_red_hat_enterprise_linux_for_oracle_9i_and_10g_databases/chap-oracle_9i_and_10g_tuning_guide-setting_shell_limits_for_the_oracle_user

Then using a different user than the user that usually runs PAVICS because that user can not do anything else and restart the docker daemon (sudo systemctl restart docker). Back to the regular user running PAVICS, ff the containers have problems to restart, destroy them and re-create them from scratch (./pavics-compose.sh down && sleep 10 && ./pavics-compose.sh up -d).

For a more permanent solution than keep increasing the limit each time we burst it is to setup culling idle Jupyter server as describe here https://discourse.jupyter.org/t/jupyterhub-doesnt-kill-processes-and-threads-when-notebooks-are-closed-or-user-log-out/2244/2

Also need to add monitoring for open file limit to we are alerted in advance of near exhaustion to avoid having to restart the entire docker daemon.

Ping @moulab88 @tlogan2000 if you guys have anything to add.

Edit:

add command to see current effective limit and set new limit immediately
add reference to redhat docs

The text was updated successfully, but these errors were encountered:

tlogan2000 · 2020-09-03T15:11:25Z

The multilevel culling control could be interesting (as long as I understand it correctly!) .... but it seems like it would be possible to cull individual user kernels that are idle for say a few hours (level 1) and also entire user servers after being idle maybe a few days or even a week say(level 2)? Can't speak for all deployments but on our end I think we want to give users the chance to run 'long' calculations (level 2 threshold of up to a week) but with the ability to track notebook kernels that are left open accidentally at the same time

tlvu · 2020-09-18T18:15:46Z

Duplicated by Ouranosinc/pavics-sdi#158.

tlvu · 2020-09-26T00:03:47Z

List of built-in metrics exposed by https://github.com/prometheus/node_exporter do not have the file open metric we need.

tlvu · 2020-09-30T16:37:01Z

Triggered this issue again during Ouranosinc/raven#251 with the 25 bogus users logged in, running 1 notebook and stayed logged in for 24 hours.

A weird issue is closing the Jupyter servers for some (6) those 25 test users did not release the open files.

tlvu · 2021-02-10T22:15:59Z

@tlogan2000 Just got this problem again so had to restart the docker daemon. You might hear complain from Jupyter users.

tlvu · 2021-02-22T19:48:38Z

@tlogan2000 Just got this problem again so had to restart the docker daemon. You might hear complain from Jupyter users.

tlvu · 2021-03-26T14:29:03Z

@tlogan2000 Just got this problem again today March 26 2021 so had to restart the docker daemon. You might hear complain from Jupyter users.

After all the other pending tasks on my plate I would very much like to attack this one.

tlvu · 2021-06-07T15:18:49Z

@tlogan2000 Just got this problem again today June 7 2021 so had to restart the docker daemon. You might hear complain from Jupyter users.

This is basically the same as `ENABLE_JUPYTERHUB_MULTI_NOTEBOOKS` but at the bottom of the file so it can override everything. `ENABLE_JUPYTERHUB_MULTI_NOTEBOOKS` is kept for backward-compat. First useful application is to enable server culling for auto shutdown of idle kernels and jupyter single server (fixes #67).

jupyterhub: allow config override via env.local ## Overview This is basically the same as `ENABLE_JUPYTERHUB_MULTI_NOTEBOOKS` but at the bottom of the file so it can override everything. `ENABLE_JUPYTERHUB_MULTI_NOTEBOOKS` is kept for backward-compat. First useful application is to enable server culling for auto shutdown of idle kernels and idle jupyter single server, hopefully fixes #67. The culling settings will only take effect the next time user restart their personal Jupyter server because it seems that the Jupyter server is the one culling itself. JupyterHub do not perform the culling, it simply forward the culling settings to the Jupyter server. ```sh $ docker inspect jupyter-lvu --format '{{ .Args }}' [run -n birdy /usr/local/bin/start-notebook.sh --ip=0.0.0.0 --port=8888 --notebook-dir=/notebook_dir --SingleUserNotebookApp.default_url=/lab --debug --disable-user-config --NotebookApp.terminals_enabled=False --NotebookApp.shutdown_no_activity_timeout=180 --MappingKernelManager.cull_idle_timeout=180 --MappingKernelManager.cull_connected=True] ``` ## Changes **Non-breaking changes** - jupyterhub: allow config override via env.local ## Tests Deployed to https://lvupavicsdev.ouranos.ca/jupyter (timeout set to 5 mins)

tlvu · 2021-07-23T15:33:50Z

Just got this problem again today July 23 2021 so had to restart the docker daemon.

tlvu · 2021-07-23T15:52:13Z

Documenting system status when this happened on July 23 2021: no CPU spike, below are some active containers:

All containers globally:

Geoserver:

jupyter-labonte:

jupyter-lizee:

tlvu · 2021-11-12T15:50:29Z

FYI @tlogan2000 @moulab88 Just got this problem again today Nov 12 2021 so had to restart the docker daemon.

Documenting system status when this happened:

No spike (Cpu, memory) anywhere in the past 6 hours:

Memory spike a few days ago (jupyter-logan and jupyter-barbeau):

jupyter-logan memory spike between Nov 9 and Nov 11:

jupyter-barbeau memory spike between Nov 8 and Nov 9:

tlvu · 2022-01-24T15:27:43Z

Documenting this issue again today Monday 24 January 2022.

So this problem happen again

$ docker ps
-bash: fork: retry: No child processes
-bash: fork: retry: No child processes

Checking the current open file from another user with sudo:

$ sudo lsof -u $PAVICS_USER|wc -l
[sudo] password for admin:
10361

Checking the current limits of the current $PAVICS_USER:

$ ulimit -n
40960

So we have not busted the limit yet or the command to find the current open file is not complete

@moulab88 any extra ideas?

@tlogan2000 FYI

tlvu · 2022-01-24T16:20:21Z

I might have found it. There probably is another limit, the number of "max user processes", we need to bump.

Find the current number of threads for the PAVICS_USER (need to use another user since can not start any command as the PAVICS_USER):

[admin ~]$ ps -eLf > ~/pseLf.txt

[admin ~]$ cat ~/pseLf.txt | grep $PAVICS_USER | wc -l
4509

Show all limits of the PAVICS_USER, notice the "max user processes" is very close to the current number of threads above:

[PAVICS_USER ~]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 515196
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 40960
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Double the "max user processes" for the PAVICS_USER:

[PAVICS_USER ~]$ ulimit -u 8192

[PAVICS_USER ~]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 515196
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 40960
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 8192
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Confirm it works, can start another command:

[PAVICS_USER ~]$ docker ps |grep jupyter-
c846d88a11c6        pavics/workflow-tests:211123-update211216       "conda run -n birdy …"   28 minutes ago      Up 28 minutes         8888/tcp                                                                                                                                                                jupyter-braun
ffb9031b5cd8        pavics/workflow-tests:211123-update211216       "conda run -n birdy …"   30 minutes ago      Up 30 minutes         8888/tcp                                                                                                                                                                jupyter-tojik

(...)

tlvu · 2022-01-24T16:32:50Z

@moulab88 made the "max user processes" change permanent upon reboot:

$ cat /etc/security/limits.d/20-nproc.conf 
# Default limit for number of user's processes to prevent
# accidental fork bombs.
# See rhbz #432903 for reasoning.

*          soft    nproc     4096
root       soft    nproc     unlimited
PAVICS_USER     soft    nproc     <higher limit than 4096>
PAVICS_USER     hard    nproc     <higher limit than 4096>

tlvu · 2022-01-24T16:49:29Z

Docker daemon has been restarted as well to ensure the new limits is effective since there is no way to run ulimit -u NEW_LIMIT in each of the running container.

tlvu changed the title ~~Idle Jupyter server caused open file limit exhaustion~~ Idle Jupyter server cause open file limit exhaustion Sep 18, 2020

tlvu mentioned this issue Sep 18, 2020

jupyterhub - open / idle kernels and files Ouranosinc/pavics-sdi#158

Closed

tlvu mentioned this issue Sep 21, 2020

Raven training notebooks can be executed smoothly on platform by 25 concurrent users. Ouranosinc/raven#251

Closed

huard assigned tlvu Dec 2, 2020

fmigneault added the bug Something isn't working label Jan 22, 2021

tlvu mentioned this issue Jun 9, 2021

jupyterhub: allow config override via env.local #177

Merged

tlvu closed this as completed in #177 Jun 10, 2021

tlvu mentioned this issue Oct 12, 2023

add optional-components/jupyterhub-stop-idle #389

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idle Jupyter server cause open file limit exhaustion #67

Idle Jupyter server cause open file limit exhaustion #67

tlvu commented Sep 3, 2020 •

edited

Loading

tlogan2000 commented Sep 3, 2020

tlvu commented Sep 18, 2020

tlvu commented Sep 26, 2020

tlvu commented Sep 30, 2020

tlvu commented Feb 10, 2021

tlvu commented Feb 22, 2021

tlvu commented Mar 26, 2021

tlvu commented Jun 7, 2021

tlvu commented Jul 23, 2021

tlvu commented Jul 23, 2021

tlvu commented Nov 12, 2021

tlvu commented Jan 24, 2022

tlvu commented Jan 24, 2022

tlvu commented Jan 24, 2022

tlvu commented Jan 24, 2022

Idle Jupyter server cause open file limit exhaustion #67

Idle Jupyter server cause open file limit exhaustion #67

Comments

tlvu commented Sep 3, 2020 • edited Loading

tlogan2000 commented Sep 3, 2020

tlvu commented Sep 18, 2020

tlvu commented Sep 26, 2020

tlvu commented Sep 30, 2020

tlvu commented Feb 10, 2021

tlvu commented Feb 22, 2021

tlvu commented Mar 26, 2021

tlvu commented Jun 7, 2021

tlvu commented Jul 23, 2021

tlvu commented Jul 23, 2021

tlvu commented Nov 12, 2021

tlvu commented Jan 24, 2022

tlvu commented Jan 24, 2022

tlvu commented Jan 24, 2022

tlvu commented Jan 24, 2022

tlvu commented Sep 3, 2020 •

edited

Loading