Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel creation blocked when other Pod is pending in Kubernetes #784

Closed
janvdvegt opened this issue Feb 26, 2020 · 9 comments
Closed

Kernel creation blocked when other Pod is pending in Kubernetes #784

janvdvegt opened this issue Feb 26, 2020 · 9 comments

Comments

@janvdvegt
Copy link

Description

New kernels are spinning up as Pods in our Kubernetes clusters when started. Because we use autoscaling, this could take a few minutes before a node is ready. It looks like during this time, it is impossible for other kernels to be started. Even worse, if for example the autoscaling fails (for a specific nodeSelector) and the Pod remains pending this blocks any new notebook kernels from being created. Is this intended behaviour? Can we do something about this? Even with 2 or 3 users, this could significantly hurt the "Notebooks-as-a-service" application.

Environment

  • Enterprise Gateway Version 2.1.0
  • Platform: Kubernetes
@kevin-bates
Copy link
Member

Hi @janvdvegt - thanks for opening this issue.

Unfortunately, the jupyter stack kernel startup sequence is synchronous at the moment. Issue #86 has been long-standing, as has PR #580. On the bright side, the dependent PRs referenced in #580 are seeing some life and may be merged (although due to the layering, it may still be a while before things are available to EG).

Do you have KernelImagePuller configured? I suppose even that isn't sufficient in auto-scaled environments since a new node is added on-demand to address a kernel pod startup request, so both it and the KernelImagePuller daemonset are probably competing to get the kernel image pulled.

I believe I created a version of the EG image that contains the code corresponding to the async startup PRs. If you pull the image with the dev_async tag, you might find other kernel starts are not blocked. That code is over a year old, but if you see some success, I could try to spend some time updating that image with the latest layers.

I understand your pain, but given our dependence on the rest of the stack, there isn't much we can do about this. I believe you'll also find those initial kernel start requests (when a new node is added) will likely timeout. As a result, running with a higher KERNEL_LAUNCH_TIMEOUT value may be required.

I wish I had a more positive answer for you.

@janvdvegt
Copy link
Author

Hi Kevin,

Thank you for your answer. The KernelImagePuller does not work well due to the elastic nature of our setup as you mentioned already.

Are you familiar with this project: https://github.com/kubeflow/kubeflow/tree/master/components/notebook-controller ? It's difficult for me to fully grasp the differences between the projects. Do you think this project fits better with our requirements?

@kevin-bates
Copy link
Member

Hi @janvdvegt - The notebook-controller would be analogous to using Juptyer Hub on k8s because it's spinning up the actual Jupyter Lab server instance. Unless EG is plugged in, all notebook kernels will run within the notebook-controller pod, just like the notebook pods that JH spins up. If you're looking to create a NotebookServerAsAService, then the notebook-controller or JH would be sufficient and you won't encounter the kernel start contention (unless attempting to start multiple kernels simultaneously within the same notebook/lab pod instance).

If either of those satisfy your requirements, I would highly recommend JupyterHub since it has a strong user community and excellent devs maintaining it. Not to say notebook-controller doesn't also, I just don't know much about it having just heard about it from you.

@kevin-bates
Copy link
Member

This issue will be resolved once #794 (formerly #580) is merged. The merge requires a new Notebook release prior.

@janvdvegt
Copy link
Author

Hi Kevin, thank you for your analysis and your update. We are going to look into the right solution for us in a week or two. I do think the NotebookServerAsAService approach is what we need.

@borremosch
Copy link

Is there any update regarding this issue? I see that #794 has not been merged yet, although its dependencies have. Being able to have non-blocking and parallel kernel creation would be a great feature, and essential for a project that I am currently involved in.

@kevin-bates
Copy link
Member

Your observation is correct. Once the Notebook 6.1 release is created, we will merge #794.

@kevin-bates
Copy link
Member

@borremosch - you might also try pulling elyra/enterprise-gateway:async and see how it goes.

@kevin-bates
Copy link
Member

Async kernel management is available throughout the jupyter stack for the last several releases - closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants