Kernel creation blocked when other Pod is pending in Kubernetes #784

janvdvegt · 2020-02-26T12:51:30Z

Description

New kernels are spinning up as Pods in our Kubernetes clusters when started. Because we use autoscaling, this could take a few minutes before a node is ready. It looks like during this time, it is impossible for other kernels to be started. Even worse, if for example the autoscaling fails (for a specific nodeSelector) and the Pod remains pending this blocks any new notebook kernels from being created. Is this intended behaviour? Can we do something about this? Even with 2 or 3 users, this could significantly hurt the "Notebooks-as-a-service" application.

Environment

Enterprise Gateway Version 2.1.0
Platform: Kubernetes

kevin-bates · 2020-02-26T15:48:53Z

Hi @janvdvegt - thanks for opening this issue.

Unfortunately, the jupyter stack kernel startup sequence is synchronous at the moment. Issue #86 has been long-standing, as has PR #580. On the bright side, the dependent PRs referenced in #580 are seeing some life and may be merged (although due to the layering, it may still be a while before things are available to EG).

Do you have KernelImagePuller configured? I suppose even that isn't sufficient in auto-scaled environments since a new node is added on-demand to address a kernel pod startup request, so both it and the KernelImagePuller daemonset are probably competing to get the kernel image pulled.

I believe I created a version of the EG image that contains the code corresponding to the async startup PRs. If you pull the image with the dev_async tag, you might find other kernel starts are not blocked. That code is over a year old, but if you see some success, I could try to spend some time updating that image with the latest layers.

I understand your pain, but given our dependence on the rest of the stack, there isn't much we can do about this. I believe you'll also find those initial kernel start requests (when a new node is added) will likely timeout. As a result, running with a higher KERNEL_LAUNCH_TIMEOUT value may be required.

I wish I had a more positive answer for you.

janvdvegt · 2020-03-10T13:22:56Z

Hi Kevin,

Thank you for your answer. The KernelImagePuller does not work well due to the elastic nature of our setup as you mentioned already.

Are you familiar with this project: https://github.com/kubeflow/kubeflow/tree/master/components/notebook-controller ? It's difficult for me to fully grasp the differences between the projects. Do you think this project fits better with our requirements?

kevin-bates · 2020-03-10T14:38:57Z

Hi @janvdvegt - The notebook-controller would be analogous to using Juptyer Hub on k8s because it's spinning up the actual Jupyter Lab server instance. Unless EG is plugged in, all notebook kernels will run within the notebook-controller pod, just like the notebook pods that JH spins up. If you're looking to create a NotebookServerAsAService, then the notebook-controller or JH would be sufficient and you won't encounter the kernel start contention (unless attempting to start multiple kernels simultaneously within the same notebook/lab pod instance).

If either of those satisfy your requirements, I would highly recommend JupyterHub since it has a strong user community and excellent devs maintaining it. Not to say notebook-controller doesn't also, I just don't know much about it having just heard about it from you.

kevin-bates · 2020-03-31T16:57:57Z

This issue will be resolved once #794 (formerly #580) is merged. The merge requires a new Notebook release prior.

janvdvegt · 2020-04-02T08:21:17Z

Hi Kevin, thank you for your analysis and your update. We are going to look into the right solution for us in a week or two. I do think the NotebookServerAsAService approach is what we need.

borremosch · 2020-05-11T13:46:02Z

Is there any update regarding this issue? I see that #794 has not been merged yet, although its dependencies have. Being able to have non-blocking and parallel kernel creation would be a great feature, and essential for a project that I am currently involved in.

kevin-bates · 2020-05-11T16:34:39Z

Your observation is correct. Once the Notebook 6.1 release is created, we will merge #794.

kevin-bates · 2020-05-11T16:42:56Z

@borremosch - you might also try pulling elyra/enterprise-gateway:async and see how it goes.

kevin-bates · 2022-01-26T23:02:19Z

Async kernel management is available throughout the jupyter stack for the last several releases - closing.

kevin-bates added kubernetes performance & scalability labels Feb 26, 2020

kevin-bates added the dependency label Mar 31, 2020

kevin-bates closed this as completed Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel creation blocked when other Pod is pending in Kubernetes #784

Kernel creation blocked when other Pod is pending in Kubernetes #784

janvdvegt commented Feb 26, 2020

kevin-bates commented Feb 26, 2020

janvdvegt commented Mar 10, 2020

kevin-bates commented Mar 10, 2020

kevin-bates commented Mar 31, 2020

janvdvegt commented Apr 2, 2020

borremosch commented May 11, 2020

kevin-bates commented May 11, 2020

kevin-bates commented May 11, 2020

kevin-bates commented Jan 26, 2022

Kernel creation blocked when other Pod is pending in Kubernetes #784

Kernel creation blocked when other Pod is pending in Kubernetes #784

Comments

janvdvegt commented Feb 26, 2020

Description

Environment

kevin-bates commented Feb 26, 2020

janvdvegt commented Mar 10, 2020

kevin-bates commented Mar 10, 2020

kevin-bates commented Mar 31, 2020

janvdvegt commented Apr 2, 2020

borremosch commented May 11, 2020

kevin-bates commented May 11, 2020

kevin-bates commented May 11, 2020

kevin-bates commented Jan 26, 2022