-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idle Jupyter server cause open file limit exhaustion #67
Comments
The multilevel culling control could be interesting (as long as I understand it correctly!) .... but it seems like it would be possible to cull individual user kernels that are idle for say a few hours (level 1) and also entire user servers after being idle maybe a few days or even a week say(level 2)? Can't speak for all deployments but on our end I think we want to give users the chance to run 'long' calculations (level 2 threshold of up to a week) but with the ability to track notebook kernels that are left open accidentally at the same time |
Duplicated by Ouranosinc/pavics-sdi#158. |
List of built-in metrics exposed by https://github.com/prometheus/node_exporter do not have the file open metric we need. |
Triggered this issue again during Ouranosinc/raven#251 with the 25 bogus users logged in, running 1 notebook and stayed logged in for 24 hours. A weird issue is closing the Jupyter servers for some (6) those 25 test users did not release the open files. |
@tlogan2000 Just got this problem again so had to restart the docker daemon. You might hear complain from Jupyter users. |
@tlogan2000 Just got this problem again so had to restart the docker daemon. You might hear complain from Jupyter users. |
@tlogan2000 Just got this problem again today March 26 2021 so had to restart the docker daemon. You might hear complain from Jupyter users. After all the other pending tasks on my plate I would very much like to attack this one. |
@tlogan2000 Just got this problem again today June 7 2021 so had to restart the docker daemon. You might hear complain from Jupyter users. |
This is basically the same as `ENABLE_JUPYTERHUB_MULTI_NOTEBOOKS` but at the bottom of the file so it can override everything. `ENABLE_JUPYTERHUB_MULTI_NOTEBOOKS` is kept for backward-compat. First useful application is to enable server culling for auto shutdown of idle kernels and jupyter single server (fixes #67).
jupyterhub: allow config override via env.local ## Overview This is basically the same as `ENABLE_JUPYTERHUB_MULTI_NOTEBOOKS` but at the bottom of the file so it can override everything. `ENABLE_JUPYTERHUB_MULTI_NOTEBOOKS` is kept for backward-compat. First useful application is to enable server culling for auto shutdown of idle kernels and idle jupyter single server, hopefully fixes #67. The culling settings will only take effect the next time user restart their personal Jupyter server because it seems that the Jupyter server is the one culling itself. JupyterHub do not perform the culling, it simply forward the culling settings to the Jupyter server. ```sh $ docker inspect jupyter-lvu --format '{{ .Args }}' [run -n birdy /usr/local/bin/start-notebook.sh --ip=0.0.0.0 --port=8888 --notebook-dir=/notebook_dir --SingleUserNotebookApp.default_url=/lab --debug --disable-user-config --NotebookApp.terminals_enabled=False --NotebookApp.shutdown_no_activity_timeout=180 --MappingKernelManager.cull_idle_timeout=180 --MappingKernelManager.cull_connected=True] ``` ## Changes **Non-breaking changes** - jupyterhub: allow config override via env.local ## Tests Deployed to https://lvupavicsdev.ouranos.ca/jupyter (timeout set to 5 mins)
Just got this problem again today July 23 2021 so had to restart the docker daemon. |
FYI @tlogan2000 @moulab88 Just got this problem again today Nov 12 2021 so had to restart the docker daemon. Documenting system status when this happened: No spike (Cpu, memory) anywhere in the past 6 hours: Memory spike a few days ago (jupyter-logan and jupyter-barbeau): jupyter-logan memory spike between Nov 9 and Nov 11: jupyter-barbeau memory spike between Nov 8 and Nov 9: |
Documenting this issue again today Monday 24 January 2022. So this problem happen again
Checking the current open file from another user with sudo:
Checking the current limits of the current $PAVICS_USER:
So we have not busted the limit yet or the command to find the current open file is not complete @moulab88 any extra ideas? @tlogan2000 FYI |
I might have found it. There probably is another limit, the number of "max user processes", we need to bump. Find the current number of threads for the PAVICS_USER (need to use another user since can not start any command as the PAVICS_USER):
Show all limits of the PAVICS_USER, notice the "max user processes" is very close to the current number of threads above:
Double the "max user processes" for the PAVICS_USER:
Confirm it works, can start another command:
|
@moulab88 made the "max user processes" change permanent upon reboot:
|
Docker daemon has been restarted as well to ensure the new limits is effective since there is no way to run |
This morning, none of the docker commands was responding because we have exhausted the open file limit on the user that runs the PAVICS platform (the user that does
./pavics-compose.sh up -d
).The immediate work-around is to increase the soft and hard nofile limit for the corresponding user in
/etc/security/limits.conf
file and apply that new limit immediatelyulimit -n NEW_LIMIT
(ulimit -n
show the current effective limit). Find the current number of open file this waysudo lsof -u $USER|wc -l
and put something higher than that in thelimits.conf
file. Reference https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/tuning_and_optimizing_red_hat_enterprise_linux_for_oracle_9i_and_10g_databases/chap-oracle_9i_and_10g_tuning_guide-setting_shell_limits_for_the_oracle_userThen using a different user than the user that usually runs PAVICS because that user can not do anything else and restart the docker daemon (
sudo systemctl restart docker
). Back to the regular user running PAVICS, ff the containers have problems to restart, destroy them and re-create them from scratch (./pavics-compose.sh down && sleep 10 && ./pavics-compose.sh up -d
).For a more permanent solution than keep increasing the limit each time we burst it is to setup culling idle Jupyter server as describe here https://discourse.jupyter.org/t/jupyterhub-doesnt-kill-processes-and-threads-when-notebooks-are-closed-or-user-log-out/2244/2
Also need to add monitoring for open file limit to we are alerted in advance of near exhaustion to avoid having to restart the entire docker daemon.
Ping @moulab88 @tlogan2000 if you guys have anything to add.
Edit:
The text was updated successfully, but these errors were encountered: