Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel not properly shutdown after NotebookApp receives SIGTERM. #606

Closed
jameskoch opened this issue Jan 15, 2021 · 7 comments · Fixed by #607
Closed

Kernel not properly shutdown after NotebookApp receives SIGTERM. #606

jameskoch opened this issue Jan 15, 2021 · 7 comments · Fixed by #607

Comments

@jameskoch
Copy link
Member

Describe the bug
Kernel not properly shutdown after NotebookApp receives SIGTERM.

I am seeing this w/ the Almond Scala kernel, w/ the Notebook stack running in automated testing environment. An atypical environment where many things other than Notebook could be at fault. Still, I wanted to file something in case others see it.

I'm sending SIGTERM to the Notebook server to shut things down after testing. NotebookApp handles the signal, shuts down the kernel. Then, many times, but not always, KernelRestarter kicks in and starts it right back up.

Logs:

[I 10:17:24.628 NotebookApp] Starting buffering for 1e26f233-b160-4b1e-b6f1-ebbe2a9a56e6:749d9fd9-8d9c-45f4-8f33-e17cdc78a0d5
[D 10:17:24.628 NotebookApp] Clearing buffer for 1e26f233-b160-4b1e-b6f1-ebbe2a9a56e6
[C 10:17:25.034 NotebookApp] received signal 15, stopping
[I 10:17:25.035 NotebookApp] Shutting down 1 kernel
[I 10:17:25.436 NotebookApp] Kernel shutdown: 1e26f233-b160-4b1e-b6f1-ebbe2a9a56e6
[I 10:17:25.436 NotebookApp] Shutting down 0 terminals
[I 10:17:25.437 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
[D 10:17:25.437 NotebookApp] Starting kernel: ['...', '--connection-file', '/home/user/.local/share/jupyter/runtime/kernel-1e26f233-b160-4b1e-b6f1-ebbe2a9a56e6.json']
[D 10:17:25.440 NotebookApp] Connecting to: tcp://127.0.0.1:38573

My kernels logs that it receives shutdown_request, sends shutdown_reply, then does some final cleanup. I don't have exact timestamps to speak to timeliness, nor firm evidence that the process actually terminates.

Versions:

jupyter core     : 4.6.3
jupyter-notebook : 6.1.4
qtconsole        : not installed
ipython          : 7.16.1
ipykernel        : 5.2.1
jupyter client   : 6.1.7
jupyter lab      : not installed
nbconvert        : 5.6.1
ipywidgets       : not installed
nbformat         : 5.0.7
traitlets        : 4.3.3
@kevin-bates
Copy link
Member

Hi @jameskoch - first, thank you for the great description of the issue!

This smells like a timing issue based on the fact that the kernel's posting of shutdown and restart differ by a millisecond. I don't know the intimate details of tornado's PeriodicCallback but the first thing shutdown_kernel does is stop the restarter (which is a PeriodicCallback instance). It then sends the shutdown request message followed by any necessary (and explicit) termination of the process.

I believe the kernel process is shut down because the restarter's first action is to check if the kernel is alive and it finds that it is not based on a call to the Popen.poll() method. Thus the restarter feels the kernel terminated in error and restarts the kernel.

So the timing window appears to be that:

  1. the restarter is activated (every 3 seconds) and subsequently suspended
  2. the kernel manager stops the callback mechanism (not sure about the details here, but it might be that restarter gets invoked during the stop of the callback)
  3. the kernel manager sends the shutdown request and terminates the process
  4. the restarter detects the process died "unexpectedly"
  5. the restarter issues the kernel's restart

I'm thinking we may want to indicate within the shutdown_kernel method that we're currently in that method (i.e., an entry/exit thing). The restarter would then check this value (or perhaps just the kernel manager restart_kernel method check this value) and not perform the restart if the kernel manager's state indicates it's within the shutdown_kernel method.

This issue and any changes relative to it would be applied within jupyter_client - so I'm inclined to transfer this issue there - although I'd prefer your (or other's) feedback prior to transferring.

It's strange that you tend to find this happening with this particular kernel, but since it's a JVM, there might be something there that exacerbates things - although the only window I see here is the one surrounding the callback's stop and its invocation.

@jameskoch
Copy link
Member Author

Nice to meet you @kevin-bates !

On my end, I've been exploring the idea that I may be inducing this behavior by sending SIGTERM to both Notebook and the kernel simultaneously, from my build script. If that's the case, then the restarter might be doing very sensible things in response to the kernel terminating unexpectedly (as far as it knows).

I don't know that I will get to the bottom of this on my end, and do not want to send anyone else on a wild goose chase. I'd suggest we close this, and reopen if anyone else reports something similar.

@kevin-bates
Copy link
Member

On my end, I've been exploring the idea that I may be inducing this behavior by sending SIGTERM to both Notebook and the kernel simultaneously, from my build script.

Notebook catches SIGTERM (and SIGINT) and will terminate any outstanding kernel or terminal sessions it knows about - which is all consistent with the log messages. However, if you're explicitly sending SIGTERM to the process group and your script creates its own process group (directly or indirectly) for the notebook server, then I suppose that your scenario could occur. In that case, I doubt the shutdown messaging (request/reply) would not be as smooth since your kernel would be busy honoring the SIGTERM.

I may still spend a bit looking into this. It will likely mean strategically (and temporarily) injecting some delays into the restarter and notebook app's signal handling to better-expose the (theoretical) race condition. No need to do anything on your end. Given I haven't seen this before, I doubt any goose chases will occur unless there's recent change that's starting to surface this kind of thing.

Question. In these cases where this occurs, are you left with a zombie kernel process or does the notebook fail to shutdown entirely? I suspect the former.

@jameskoch
Copy link
Member Author

jameskoch commented Jan 15, 2021

I don't have great observability of what exactly is left alive once this occurs. My build tooling ends up timing out because of a still-running subprocess, but I can't say for sure if that's the new kernel, notebook, or both. I can say that new kernel startup seems to be going normally, and that notebook (which is running w/ debug level logging) doesn't produce any output after what I pasted above.

On my end I've switched to sending SIGTERM just to notebook. I'll report back in a few days if we see any instances of KernelRestarter kicking in.

@jameskoch
Copy link
Member Author

Well, that was fast. I kicked off ten runs and got one to reproduce the issue again. I think this points to this being a real bug.

Annotated log:

Sending SIGTERM to 8799 # jupyter
[C 17:01:33.521 NotebookApp] received signal 15, stopping
[I 17:01:33.522 NotebookApp] Shutting down 1 kernel
[I 17:01:33.923 NotebookApp] Kernel shutdown: 782f6f7d-3fe1-4aad-816d-2f62efbdff60
[I 17:01:33.923 NotebookApp] Shutting down 0 terminals
[I 17:01:33.924 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
[D 17:01:33.925 NotebookApp] Starting kernel: ['...', '--connection-file', '/home/user/.local/share/jupyter/runtime/kernel-782f6f7d-3fe1-4aad-816d-2f62efbdff60.json']
[D 17:01:33.928 NotebookApp] Connecting to: tcp://127.0.0.1:38217

# Stdout from new kernel visible here; ommitted

# Sleeps for 10s

# Goes on to SIGKILL jupyter and any descendants that it observed _before_ sending SIGTERM.
Sending SIGKILL to 8799 # jupyter - Does not imply it was still alive.
Sending SIGKILL to 8866 # bash wrapper around old kernel
Sending SIGKILL to 8870 # old kernel

Then I'm left w/ some unidentified subprocess still running, which must be the new kernel.

@kevin-bates
Copy link
Member

Right on - thank you for spending time on this. If I were to build a candidate fix (it would be jupyter_client), might you be able to place it into your environment (or install from a github branch) - or do you rely on pulling things from PYPI, etc?

@kevin-bates
Copy link
Member

In case you're able (if not, no worries), I've pushed a candidate fix to my jupyter_client branch: https://github.com/kevin-bates/jupyter_client/tree/restarter-race-condition to clone/build jupyter_client yourself.

I've also attached this dev wheel file jupyter_client-6.1.11.dev0-py3-none-any.zip. Just download the zip, change the extension from .zip to .whl and pip install --upgrade jupyter_client-6.1.11.dev0-py3-none-any.whl into your notebook python environment.

If you run notebook with --debug you should see this debug statement: Kernel shutdown in progress... which would indicate that that run would have been a case that reproduced the issue.

Thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants