-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel creation blocked when other Pod is pending in Kubernetes #784
Comments
Hi @janvdvegt - thanks for opening this issue. Unfortunately, the jupyter stack kernel startup sequence is synchronous at the moment. Issue #86 has been long-standing, as has PR #580. On the bright side, the dependent PRs referenced in #580 are seeing some life and may be merged (although due to the layering, it may still be a while before things are available to EG). Do you have KernelImagePuller configured? I suppose even that isn't sufficient in auto-scaled environments since a new node is added on-demand to address a kernel pod startup request, so both it and the KernelImagePuller daemonset are probably competing to get the kernel image pulled. I believe I created a version of the EG image that contains the code corresponding to the async startup PRs. If you pull the image with the dev_async tag, you might find other kernel starts are not blocked. That code is over a year old, but if you see some success, I could try to spend some time updating that image with the latest layers. I understand your pain, but given our dependence on the rest of the stack, there isn't much we can do about this. I believe you'll also find those initial kernel start requests (when a new node is added) will likely timeout. As a result, running with a higher KERNEL_LAUNCH_TIMEOUT value may be required. I wish I had a more positive answer for you. |
Hi Kevin, Thank you for your answer. The KernelImagePuller does not work well due to the elastic nature of our setup as you mentioned already. Are you familiar with this project: https://github.com/kubeflow/kubeflow/tree/master/components/notebook-controller ? It's difficult for me to fully grasp the differences between the projects. Do you think this project fits better with our requirements? |
Hi @janvdvegt - The notebook-controller would be analogous to using Juptyer Hub on k8s because it's spinning up the actual Jupyter Lab server instance. Unless EG is plugged in, all notebook kernels will run within the notebook-controller pod, just like the notebook pods that JH spins up. If you're looking to create a NotebookServerAsAService, then the notebook-controller or JH would be sufficient and you won't encounter the kernel start contention (unless attempting to start multiple kernels simultaneously within the same notebook/lab pod instance). If either of those satisfy your requirements, I would highly recommend JupyterHub since it has a strong user community and excellent devs maintaining it. Not to say notebook-controller doesn't also, I just don't know much about it having just heard about it from you. |
Hi Kevin, thank you for your analysis and your update. We are going to look into the right solution for us in a week or two. I do think the NotebookServerAsAService approach is what we need. |
Is there any update regarding this issue? I see that #794 has not been merged yet, although its dependencies have. Being able to have non-blocking and parallel kernel creation would be a great feature, and essential for a project that I am currently involved in. |
Your observation is correct. Once the Notebook 6.1 release is created, we will merge #794. |
@borremosch - you might also try pulling |
Async kernel management is available throughout the jupyter stack for the last several releases - closing. |
Description
New kernels are spinning up as Pods in our Kubernetes clusters when started. Because we use autoscaling, this could take a few minutes before a node is ready. It looks like during this time, it is impossible for other kernels to be started. Even worse, if for example the autoscaling fails (for a specific nodeSelector) and the Pod remains pending this blocks any new notebook kernels from being created. Is this intended behaviour? Can we do something about this? Even with 2 or 3 users, this could significantly hurt the "Notebooks-as-a-service" application.
Environment
The text was updated successfully, but these errors were encountered: