-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel not properly shutdown after NotebookApp receives SIGTERM. #606
Comments
Hi @jameskoch - first, thank you for the great description of the issue! This smells like a timing issue based on the fact that the kernel's posting of shutdown and restart differ by a millisecond. I don't know the intimate details of tornado's I believe the kernel process is shut down because the restarter's first action is to check if the kernel is alive and it finds that it is not based on a call to the Popen.poll() method. Thus the restarter feels the kernel terminated in error and restarts the kernel. So the timing window appears to be that:
I'm thinking we may want to indicate within the shutdown_kernel method that we're currently in that method (i.e., an entry/exit thing). The restarter would then check this value (or perhaps just the kernel manager restart_kernel method check this value) and not perform the restart if the kernel manager's state indicates it's within the shutdown_kernel method. This issue and any changes relative to it would be applied within jupyter_client - so I'm inclined to transfer this issue there - although I'd prefer your (or other's) feedback prior to transferring. It's strange that you tend to find this happening with this particular kernel, but since it's a JVM, there might be something there that exacerbates things - although the only window I see here is the one surrounding the callback's stop and its invocation. |
Nice to meet you @kevin-bates ! On my end, I've been exploring the idea that I may be inducing this behavior by sending SIGTERM to both Notebook and the kernel simultaneously, from my build script. If that's the case, then the restarter might be doing very sensible things in response to the kernel terminating unexpectedly (as far as it knows). I don't know that I will get to the bottom of this on my end, and do not want to send anyone else on a wild goose chase. I'd suggest we close this, and reopen if anyone else reports something similar. |
Notebook catches SIGTERM (and SIGINT) and will terminate any outstanding kernel or terminal sessions it knows about - which is all consistent with the log messages. However, if you're explicitly sending SIGTERM to the process group and your script creates its own process group (directly or indirectly) for the notebook server, then I suppose that your scenario could occur. In that case, I doubt the shutdown messaging (request/reply) would not be as smooth since your kernel would be busy honoring the SIGTERM. I may still spend a bit looking into this. It will likely mean strategically (and temporarily) injecting some delays into the restarter and notebook app's signal handling to better-expose the (theoretical) race condition. No need to do anything on your end. Given I haven't seen this before, I doubt any goose chases will occur unless there's recent change that's starting to surface this kind of thing. Question. In these cases where this occurs, are you left with a zombie kernel process or does the notebook fail to shutdown entirely? I suspect the former. |
I don't have great observability of what exactly is left alive once this occurs. My build tooling ends up timing out because of a still-running subprocess, but I can't say for sure if that's the new kernel, notebook, or both. I can say that new kernel startup seems to be going normally, and that notebook (which is running w/ debug level logging) doesn't produce any output after what I pasted above. On my end I've switched to sending SIGTERM just to notebook. I'll report back in a few days if we see any instances of KernelRestarter kicking in. |
Well, that was fast. I kicked off ten runs and got one to reproduce the issue again. I think this points to this being a real bug. Annotated log:
Then I'm left w/ some unidentified subprocess still running, which must be the new kernel. |
Right on - thank you for spending time on this. If I were to build a candidate fix (it would be jupyter_client), might you be able to place it into your environment (or install from a github branch) - or do you rely on pulling things from PYPI, etc? |
In case you're able (if not, no worries), I've pushed a candidate fix to my jupyter_client branch: https://github.com/kevin-bates/jupyter_client/tree/restarter-race-condition to clone/build jupyter_client yourself. I've also attached this dev wheel file jupyter_client-6.1.11.dev0-py3-none-any.zip. Just download the zip, change the extension from If you run notebook with Thank you for your help! |
Describe the bug
Kernel not properly shutdown after NotebookApp receives SIGTERM.
I am seeing this w/ the Almond Scala kernel, w/ the Notebook stack running in automated testing environment. An atypical environment where many things other than Notebook could be at fault. Still, I wanted to file something in case others see it.
I'm sending SIGTERM to the Notebook server to shut things down after testing. NotebookApp handles the signal, shuts down the kernel. Then, many times, but not always, KernelRestarter kicks in and starts it right back up.
Logs:
My kernels logs that it receives shutdown_request, sends shutdown_reply, then does some final cleanup. I don't have exact timestamps to speak to timeliness, nor firm evidence that the process actually terminates.
Versions:
The text was updated successfully, but these errors were encountered: