Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idle Jupyter server cause open file limit exhaustion #67

Closed
tlvu opened this issue Sep 3, 2020 · 15 comments · Fixed by #177
Closed

Idle Jupyter server cause open file limit exhaustion #67

tlvu opened this issue Sep 3, 2020 · 15 comments · Fixed by #177
Assignees
Labels
bug Something isn't working

Comments

@tlvu
Copy link
Collaborator

tlvu commented Sep 3, 2020

This morning, none of the docker commands was responding because we have exhausted the open file limit on the user that runs the PAVICS platform (the user that does ./pavics-compose.sh up -d).

$ ./pavics-compose.sh ps
-bash: fork: retry: No child processes
-bash: fork: retry: No child processes
-bash: fork: retry: No child processes
-bash: fork: retry: No child processes
-bash: fork: Resource temporarily unavailable

The immediate work-around is to increase the soft and hard nofile limit for the corresponding user in /etc/security/limits.conf file and apply that new limit immediately ulimit -n NEW_LIMIT (ulimit -n show the current effective limit). Find the current number of open file this way sudo lsof -u $USER|wc -l and put something higher than that in the limits.conf file. Reference https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/tuning_and_optimizing_red_hat_enterprise_linux_for_oracle_9i_and_10g_databases/chap-oracle_9i_and_10g_tuning_guide-setting_shell_limits_for_the_oracle_user

Then using a different user than the user that usually runs PAVICS because that user can not do anything else and restart the docker daemon (sudo systemctl restart docker). Back to the regular user running PAVICS, ff the containers have problems to restart, destroy them and re-create them from scratch (./pavics-compose.sh down && sleep 10 && ./pavics-compose.sh up -d).

For a more permanent solution than keep increasing the limit each time we burst it is to setup culling idle Jupyter server as describe here https://discourse.jupyter.org/t/jupyterhub-doesnt-kill-processes-and-threads-when-notebooks-are-closed-or-user-log-out/2244/2

Also need to add monitoring for open file limit to we are alerted in advance of near exhaustion to avoid having to restart the entire docker daemon.

Ping @moulab88 @tlogan2000 if you guys have anything to add.

Edit:

  • add command to see current effective limit and set new limit immediately
  • add reference to redhat docs
@tlogan2000
Copy link
Collaborator

The multilevel culling control could be interesting (as long as I understand it correctly!) .... but it seems like it would be possible to cull individual user kernels that are idle for say a few hours (level 1) and also entire user servers after being idle maybe a few days or even a week say(level 2)? Can't speak for all deployments but on our end I think we want to give users the chance to run 'long' calculations (level 2 threshold of up to a week) but with the ability to track notebook kernels that are left open accidentally at the same time

@tlvu tlvu changed the title Idle Jupyter server caused open file limit exhaustion Idle Jupyter server cause open file limit exhaustion Sep 18, 2020
@tlvu
Copy link
Collaborator Author

tlvu commented Sep 18, 2020

Duplicated by Ouranosinc/pavics-sdi#158.

@tlvu
Copy link
Collaborator Author

tlvu commented Sep 26, 2020

List of built-in metrics exposed by https://github.com/prometheus/node_exporter do not have the file open metric we need.

@tlvu
Copy link
Collaborator Author

tlvu commented Sep 30, 2020

Triggered this issue again during Ouranosinc/raven#251 with the 25 bogus users logged in, running 1 notebook and stayed logged in for 24 hours.

A weird issue is closing the Jupyter servers for some (6) those 25 test users did not release the open files.

@fmigneault fmigneault added the bug Something isn't working label Jan 22, 2021
@tlvu
Copy link
Collaborator Author

tlvu commented Feb 10, 2021

@tlogan2000 Just got this problem again so had to restart the docker daemon. You might hear complain from Jupyter users.

@tlvu
Copy link
Collaborator Author

tlvu commented Feb 22, 2021

@tlogan2000 Just got this problem again so had to restart the docker daemon. You might hear complain from Jupyter users.

@tlvu
Copy link
Collaborator Author

tlvu commented Mar 26, 2021

@tlogan2000 Just got this problem again today March 26 2021 so had to restart the docker daemon. You might hear complain from Jupyter users.

After all the other pending tasks on my plate I would very much like to attack this one.

@tlvu
Copy link
Collaborator Author

tlvu commented Jun 7, 2021

@tlogan2000 Just got this problem again today June 7 2021 so had to restart the docker daemon. You might hear complain from Jupyter users.

tlvu added a commit that referenced this issue Jun 9, 2021
This is basically the same as `ENABLE_JUPYTERHUB_MULTI_NOTEBOOKS` but at
the bottom of the file so it can override everything.
`ENABLE_JUPYTERHUB_MULTI_NOTEBOOKS` is kept for backward-compat.

First useful application is to enable server culling for auto shutdown
of idle kernels and jupyter single server (fixes #67).
@tlvu tlvu closed this as completed in #177 Jun 10, 2021
tlvu added a commit that referenced this issue Jun 10, 2021
jupyterhub: allow config override via env.local

## Overview

This is basically the same as `ENABLE_JUPYTERHUB_MULTI_NOTEBOOKS` but at the bottom of the file so it can override everything.

`ENABLE_JUPYTERHUB_MULTI_NOTEBOOKS` is kept for backward-compat.

First useful application is to enable server culling for auto shutdown of idle kernels and idle jupyter single server, hopefully fixes #67.

The culling settings will only take effect the next time user restart their personal Jupyter server because it seems that the Jupyter server is the one culling itself.  JupyterHub do not perform the culling, it simply forward the culling settings to the Jupyter server.

```sh
$ docker inspect jupyter-lvu --format '{{ .Args }}'
[run -n birdy /usr/local/bin/start-notebook.sh --ip=0.0.0.0 --port=8888 --notebook-dir=/notebook_dir --SingleUserNotebookApp.default_url=/lab --debug --disable-user-config --NotebookApp.terminals_enabled=False --NotebookApp.shutdown_no_activity_timeout=180 --MappingKernelManager.cull_idle_timeout=180 --MappingKernelManager.cull_connected=True]
```

## Changes

**Non-breaking changes**
- jupyterhub: allow config override via env.local

## Tests

Deployed to https://lvupavicsdev.ouranos.ca/jupyter (timeout set to 5 mins)
@tlvu
Copy link
Collaborator Author

tlvu commented Jul 23, 2021

Just got this problem again today July 23 2021 so had to restart the docker daemon.

@tlvu
Copy link
Collaborator Author

tlvu commented Jul 23, 2021

Documenting system status when this happened on July 23 2021: no CPU spike, below are some active containers:

All containers globally:
Screenshot 2021-07-23 at 11-44-51 Docker and system monitoring - Grafana

Geoserver:
Screenshot 2021-07-23 at 11-45-32 geoserver-Docker and system monitoring - Grafana

jupyter-labonte:
Screenshot 2021-07-23 at 11-46-17 jupyter-labonte-Docker and system monitoring - Grafana

jupyter-lizee:
Screenshot 2021-07-23 at 11-46-53 jupyter-lizee-Docker and system monitoring - Grafana

@tlvu
Copy link
Collaborator Author

tlvu commented Nov 12, 2021

FYI @tlogan2000 @moulab88 Just got this problem again today Nov 12 2021 so had to restart the docker daemon.

Documenting system status when this happened:

No spike (Cpu, memory) anywhere in the past 6 hours:

Screenshot from 2021-11-12 10-11-46

Memory spike a few days ago (jupyter-logan and jupyter-barbeau):

Screenshot from 2021-11-12 10-17-50

jupyter-logan memory spike between Nov 9 and Nov 11:

Screenshot from 2021-11-12 10-22-44

jupyter-barbeau memory spike between Nov 8 and Nov 9:

Screenshot from 2021-11-12 10-23-00

Screenshot from 2021-11-12 10-19-19

@tlvu
Copy link
Collaborator Author

tlvu commented Jan 24, 2022

Documenting this issue again today Monday 24 January 2022.

So this problem happen again

$ docker ps
-bash: fork: retry: No child processes
-bash: fork: retry: No child processes

Checking the current open file from another user with sudo:

$ sudo lsof -u $PAVICS_USER|wc -l
[sudo] password for admin:
10361

Checking the current limits of the current $PAVICS_USER:

$ ulimit -n
40960

So we have not busted the limit yet or the command to find the current open file is not complete

@moulab88 any extra ideas?

@tlogan2000 FYI

@tlvu
Copy link
Collaborator Author

tlvu commented Jan 24, 2022

I might have found it. There probably is another limit, the number of "max user processes", we need to bump.

Find the current number of threads for the PAVICS_USER (need to use another user since can not start any command as the PAVICS_USER):

[admin ~]$ ps -eLf > ~/pseLf.txt

[admin ~]$ cat ~/pseLf.txt | grep $PAVICS_USER | wc -l
4509

Show all limits of the PAVICS_USER, notice the "max user processes" is very close to the current number of threads above:

[PAVICS_USER ~]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 515196
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 40960
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Double the "max user processes" for the PAVICS_USER:

[PAVICS_USER ~]$ ulimit -u 8192

[PAVICS_USER ~]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 515196
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 40960
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 8192
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Confirm it works, can start another command:

[PAVICS_USER ~]$ docker ps |grep jupyter-
c846d88a11c6        pavics/workflow-tests:211123-update211216       "conda run -n birdy …"   28 minutes ago      Up 28 minutes         8888/tcp                                                                                                                                                                jupyter-braun
ffb9031b5cd8        pavics/workflow-tests:211123-update211216       "conda run -n birdy …"   30 minutes ago      Up 30 minutes         8888/tcp                                                                                                                                                                jupyter-tojik

(...)

@tlvu
Copy link
Collaborator Author

tlvu commented Jan 24, 2022

@moulab88 made the "max user processes" change permanent upon reboot:

$ cat /etc/security/limits.d/20-nproc.conf 
# Default limit for number of user's processes to prevent
# accidental fork bombs.
# See rhbz #432903 for reasoning.

*          soft    nproc     4096
root       soft    nproc     unlimited
PAVICS_USER     soft    nproc     <higher limit than 4096>
PAVICS_USER     hard    nproc     <higher limit than 4096>

@tlvu
Copy link
Collaborator Author

tlvu commented Jan 24, 2022

Docker daemon has been restarted as well to ensure the new limits is effective since there is no way to run ulimit -u NEW_LIMIT in each of the running container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants