Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Voila process is leaking threads #896

Closed
roman-kouzmenko opened this issue May 26, 2021 · 17 comments
Closed

Voila process is leaking threads #896

roman-kouzmenko opened this issue May 26, 2021 · 17 comments

Comments

@roman-kouzmenko
Copy link

roman-kouzmenko commented May 26, 2021

Each time a notebook is accessed, voila is adding 3 threads that don't go away even as kernels are culled, eventually depleting all server resources.

Is anyone observing the same / has a fix?

@roman-kouzmenko
Copy link
Author

roman-kouzmenko commented May 27, 2021

Debugging this a bit, it seems these threads are client channels that might not be disposed of properly, below is the output of lsof|grep [VOILA_PID]|grep TCP:

voila     146700                jovyan    4u     IPv4          753391854       0t0        TCP *:8813 (LISTEN)
voila     146700                jovyan   34u     IPv4          755101912       0t0        TCP localhost:59017->localhost:59017 (ESTABLISHED)
voila     146700                jovyan   47u     IPv4          755260227       0t0        TCP localhost:41223->localhost:41223 (ESTABLISHED)
voila     146700                jovyan   50u     IPv4          754614991       0t0        TCP localhost:49283->localhost:49283 (ESTABLISHED)
ZMQbg/Rea 146700 165411         jovyan    4u     IPv4          753391854       0t0        TCP *:8813 (LISTEN)
ZMQbg/Rea 146700 165411         jovyan   34u     IPv4          755101912       0t0        TCP localhost:59017->localhost:59017 (ESTABLISHED)
ZMQbg/Rea 146700 165411         jovyan   47u     IPv4          755260227       0t0        TCP localhost:41223->localhost:41223 (ESTABLISHED)
ZMQbg/Rea 146700 165411         jovyan   50u     IPv4          754614991       0t0        TCP localhost:49283->localhost:49283 (ESTABLISHED)
ZMQbg/IO/ 146700 165412         jovyan    4u     IPv4          753391854       0t0        TCP *:8813 (LISTEN)
ZMQbg/IO/ 146700 165412         jovyan   34u     IPv4          755101912       0t0        TCP localhost:59017->localhost:59017 (ESTABLISHED)
ZMQbg/IO/ 146700 165412         jovyan   47u     IPv4          755260227       0t0        TCP localhost:41223->localhost:41223 (ESTABLISHED)
ZMQbg/IO/ 146700 165412         jovyan   50u     IPv4          754614991       0t0        TCP localhost:49283->localhost:49283 (ESTABLISHED)
ZMQbg/Rea 146700 165413         jovyan    4u     IPv4          753391854       0t0        TCP *:8813 (LISTEN)
ZMQbg/Rea 146700 165413         jovyan   34u     IPv4          755101912       0t0        TCP localhost:59017->localhost:59017 (ESTABLISHED)
ZMQbg/Rea 146700 165413         jovyan   47u     IPv4          755260227       0t0        TCP localhost:41223->localhost:41223 (ESTABLISHED)
ZMQbg/Rea 146700 165413         jovyan   50u     IPv4          754614991       0t0        TCP localhost:49283->localhost:49283 (ESTABLISHED)
ZMQbg/IO/ 146700 165414         jovyan    4u     IPv4          753391854       0t0        TCP *:8813 (LISTEN)
ZMQbg/IO/ 146700 165414         jovyan   34u     IPv4          755101912       0t0        TCP localhost:59017->localhost:59017 (ESTABLISHED)
ZMQbg/IO/ 146700 165414         jovyan   47u     IPv4          755260227       0t0        TCP localhost:41223->localhost:41223 (ESTABLISHED)
ZMQbg/IO/ 146700 165414         jovyan   50u     IPv4          754614991       0t0        TCP localhost:49283->localhost:49283 (ESTABLISHED)
voila     146700 165415         jovyan    4u     IPv4          753391854       0t0        TCP *:8813 (LISTEN)
voila     146700 165415         jovyan   34u     IPv4          755101912       0t0        TCP localhost:59017->localhost:59017 (ESTABLISHED)
voila     146700 165415         jovyan   47u     IPv4          755260227       0t0        TCP localhost:41223->localhost:41223 (ESTABLISHED)
voila     146700 165415         jovyan   50u     IPv4          754614991       0t0        TCP localhost:49283->localhost:49283 (ESTABLISHED)
ZMQbg/Rea 146700 165417         jovyan    4u     IPv4          753391854       0t0        TCP *:8813 (LISTEN)
ZMQbg/Rea 146700 165417         jovyan   34u     IPv4          755101912       0t0        TCP localhost:59017->localhost:59017 (ESTABLISHED)
ZMQbg/Rea 146700 165417         jovyan   47u     IPv4          755260227       0t0        TCP localhost:41223->localhost:41223 (ESTABLISHED)
ZMQbg/Rea 146700 165417         jovyan   50u     IPv4          754614991       0t0        TCP localhost:49283->localhost:49283 (ESTABLISHED)
ZMQbg/IO/ 146700 165418         jovyan    4u     IPv4          753391854       0t0        TCP *:8813 (LISTEN)
ZMQbg/IO/ 146700 165418         jovyan   34u     IPv4          755101912       0t0        TCP localhost:59017->localhost:59017 (ESTABLISHED)
ZMQbg/IO/ 146700 165418         jovyan   47u     IPv4          755260227       0t0        TCP localhost:41223->localhost:41223 (ESTABLISHED)
ZMQbg/IO/ 146700 165418         jovyan   50u     IPv4          754614991       0t0        TCP localhost:49283->localhost:49283 (ESTABLISHED)
voila     146700 165419         jovyan    4u     IPv4          753391854       0t0        TCP *:8813 (LISTEN)
voila     146700 165419         jovyan   34u     IPv4          755101912       0t0        TCP localhost:59017->localhost:59017 (ESTABLISHED)
voila     146700 165419         jovyan   47u     IPv4          755260227       0t0        TCP localhost:41223->localhost:41223 (ESTABLISHED)
voila     146700 165419         jovyan   50u     IPv4          754614991       0t0        TCP localhost:49283->localhost:49283 (ESTABLISHED)
ZMQbg/Rea 146700 165518         jovyan    4u     IPv4          753391854       0t0        TCP *:8813 (LISTEN)
ZMQbg/Rea 146700 165518         jovyan   34u     IPv4          755101912       0t0        TCP localhost:59017->localhost:59017 (ESTABLISHED)
ZMQbg/Rea 146700 165518         jovyan   47u     IPv4          755260227       0t0        TCP localhost:41223->localhost:41223 (ESTABLISHED)
ZMQbg/Rea 146700 165518         jovyan   50u     IPv4          754614991       0t0        TCP localhost:49283->localhost:49283 (ESTABLISHED)
ZMQbg/IO/ 146700 165519         jovyan    4u     IPv4          753391854       0t0        TCP *:8813 (LISTEN)
ZMQbg/IO/ 146700 165519         jovyan   34u     IPv4          755101912       0t0        TCP localhost:59017->localhost:59017 (ESTABLISHED)
ZMQbg/IO/ 146700 165519         jovyan   47u     IPv4          755260227       0t0        TCP localhost:41223->localhost:41223 (ESTABLISHED)
ZMQbg/IO/ 146700 165519         jovyan   50u     IPv4          754614991       0t0        TCP localhost:49283->localhost:49283 (ESTABLISHED)
voila     146700 165520         jovyan    4u     IPv4          753391854       0t0        TCP *:8813 (LISTEN)
voila     146700 165520         jovyan   34u     IPv4          755101912       0t0        TCP localhost:59017->localhost:59017 (ESTABLISHED)
voila     146700 165520         jovyan   47u     IPv4          755260227       0t0        TCP localhost:41223->localhost:41223 (ESTABLISHED)
voila     146700 165520         jovyan   50u     IPv4          754614991       0t0        TCP localhost:49283->localhost:49283 (ESTABLISHED)

@clydebw
Copy link

clydebw commented May 28, 2021

@roman-kouzmenko I think you're right in that these are client channels not getting cleaned up.

This problem of leaking threads was pointed out to me this morning, and I spent some time today trying to track this down as well. It seems two threads get created here:

kernel_id = await ensure_async(self.kernel_manager.start_kernel(

Although I think these stick around for the app's entire lifecycle, and so are not a problem. Three threads seem to get added each time the notebook is accessed here though, and these don't seem to be cleaned up:

await ensure_async(self.executor.kc.start_channels())

I'm going to keep looking, but any help from someone more familiar with the codebase would be much appreciated.

@roman-kouzmenko
Copy link
Author

roman-kouzmenko commented Jun 2, 2021

FYI, I mitigated the leak on kubernetes by:

  1. creating a liveness probe counting the number of leaked threads and marking the voila container as bad above a certain threshold
  2. increasing the grace termination period to a large number to avoid shutting active connections immediately
  3. adding a preStop hook to count active kernels and wait on them to finish. If they don't terminate before the grace period in 2), they will get terminated automatically

@dou-du
Copy link

dou-du commented Jul 6, 2021

We are running Voila app with dokku. The CPUs are keeping eating up by Voila. We have to restart our dokku server every few days. I believe it is caused by this problem. Could anyone fix this issue? Thanks.

@ltalirz
Copy link

ltalirz commented Jul 7, 2021

This is a longstanding issue, see the very related issues
#209
#479
#849

There starts to be some work on it, though, see PRs linked in #849 (comment)

@giovannipizzi
Copy link

I would like to report the same issue.
We are running voila inside docker.
After a few hours of the website being online, If I grep all Kernel started and Kernel shutdown from the log file, each kernel is shut down, either by the web socket being closed, or by culling.
However, if I do ps -eLf | grep voil[a] | wc -l I get >400 threads. The voila process (only 1) is constantly at 50% CPU usage, and the number of threads keeps increasing at every new connection to the website (that, with so many leaked threads, is unresponsive or very slow).

The only solution for us is currently to restart the docker container, but we are now at the point where we need to do it twice a day :-(

I wanted to inquire if there is any work being done to address this issue, and if you already have a timeline for solving the issue: we are relying on voila a lot, but this issue is making our service unreliable and we might need to change technology. (I hope this is not needed, as we like a lot the work being done on voila! :-) )

@choldgraf
Copy link
Contributor

choldgraf commented Sep 7, 2021

Just a quick thought here - I wonder why this is not something that the Voila Gallery has run into: https://voila-gallery.org/ (repo here: https://github.com/voila-gallery/gallery)

That is (I believe) a persistent JupyterHub serving Voila dashboards via Jupyter servers. If they are not running into this "runaway threads" issue, then perhaps there is a workaround?

I guess another question is whether this is an issue with Voila itself, or dependent on something deeper in Jupyter machinery.

edit: ahh I might be mistaken, I think the voila gallery may use BinderHub, in which case it's probably shutting down the whole pod after inactivity

@dou-du
Copy link

dou-du commented Sep 7, 2021

Just a quick thought here - I wonder why this is not something that the Voila Gallery has run into: https://voila-gallery.org/ (repo here: https://github.com/voila-gallery/gallery)

That is (I believe) a persistent JupyterHub serving Voila dashboards via Jupyter servers. If they are not running into this "runaway threads" issue, then perhaps there is a workaround?

I guess another question is whether this is an issue with Voila itself, or dependent on something deeper in Jupyter machinery.

Voila can be runned in two way. One is from the Jupyter server extension by calling "/voila/render/" in the URL. In the dokku
server, we use the voila program from the terminal (voila --template=osscar --enable_nbextensions=True mynotebook.ipynb). I think the problem comes from calling the voila program in the terminal.

@choldgraf
Copy link
Contributor

Interesting - so this doesn't occur when running Voila via an already-running Jupyter Server, but it does occur when you run it via the command line?

@dou-du
Copy link

dou-du commented Sep 7, 2021

Interesting - so this doesn't occur when running Voila via an already-running Jupyter Server, but it does occur when you run it via the command line?

I am not sure about the voila Jupyter server extension. But I am calling the server extension a lot everyday and I did not experience any CPU overload related to voila. Our dokku apps problem is definitely from the voila program, since we
call it in the "Procfile" file.

@jtpio
Copy link
Member

jtpio commented Sep 9, 2021

edit: ahh I might be mistaken, I think the voila gallery may use BinderHub, in which case it's probably shutting down the whole pod after inactivity

Yes that's right. It used to be a JupyterHub running on a single instance, but switched to using BinderHub as a backend so the gallery can be deployed as a static site on GitHub Pages: voila-dashboards/tljh-voila-gallery#83

@martinRenou
Copy link
Member

Hopefully this is fixed now in Voila 0.2.14. Feel free to re-open the issue if you can still reproduce.

@dou-du
Copy link

dou-du commented Sep 27, 2021

We tested the Voila 0.2.14 for one app on our dokku server, the new Voila has a much better performance.
However, it seems the problem still exists. The usage of CPU increases from 0% to about 5% in one week.
Older version of Voila will make it to 120% in one week. Here is the data shown in the figure. One can
clearly see a slight linear increment of the CPU usage.
Figure 2

@martinRenou
Copy link
Member

Do you have kernel culling enabled? If so what is your culling config?

@dou-du
Copy link

dou-du commented Sep 28, 2021

Do you have kernel culling enabled? If so what is your culling config?

Here is my voila configuration:

{
"VoilaConfiguration": {
"enable_nbextensions": true,
"template": "osscar",
"strip_sources": true
},
"VoilaExecutePreprocessor": {
"timeout": 180
},
"NotebookApp": {
"shutdown_no_activity_timeout": 60
},
"MappingKernelManager": {
"cull_idle_timeout": 900,
"cull_interval": 60,
"cull_busy": true
}
}

@martinRenou
Copy link
Member

Would setting "cull_connected": true be possible for you? It might help in case users keep the tab open and never reboot their machine.

@martinRenou
Copy link
Member

It's definitely possible we missed another issue on Voila though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants