Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove blocking Kernel Startup #86

Closed
aazhou1 opened this issue Jul 21, 2017 · 12 comments · Fixed by #794
Closed

Remove blocking Kernel Startup #86

aazhou1 opened this issue Jul 21, 2017 · 12 comments · Fixed by #794
Assignees
Milestone

Comments

@aazhou1
Copy link

aazhou1 commented Jul 21, 2017

Single-threaded kernel startup causing notebook front end to lose connection to gateway as a lot of kernels are being started--so that notebook has to re-pull kernel specs from gateway after Elyra finishes starting kernels.

@LK-Tmac1 LK-Tmac1 added this to the Backlog milestone Jul 21, 2017
@kevin-bates
Copy link
Member

I believe this issue is a result of the duration it takes to "start" a kernel relative to a remote cluster. In normal circumstances (local kernels) the duration is on the order of a second (to invoke the process and have the kernel respond). For remote kernels running as yarn-cluster, that duration is an order of magnitude longer (10-15 seconds) which exacerbates as blocking startups on the front end.

I tried to look into this early on but couldn't make any headway because the Jupyter framework assumes the kernel is communicating immediately on return from the startup logic (so using a thread doesn't really help).

@kevin-bates kevin-bates modified the milestones: Sprint 8, Backlog Aug 21, 2017
@kevin-bates kevin-bates self-assigned this Aug 21, 2017
@kevin-bates kevin-bates modified the milestones: Backlog, Sprint 8 Aug 28, 2017
@kevin-bates
Copy link
Member

kevin-bates commented Aug 28, 2017

I haven't been able to make any headway into this. The MappingKernelManager is async in that it uses the @gen.coroutine decorator. This "multi kernel manager" instantiates a "single kernel manager" - which is responsible for starting the kernel - that, when remote - takes several seconds.

I tried introducing asynchronous behaviors at various levels within this framework - either at the point of launch, higher up at the start of the "single kernel manager", or even higher within the "multi-kernel manager" sequence - as well as replacing time.sleep with yield gen.sleep. In all cases, any form of yield returns to the caller - as if the call did not occur. This causes control to be returned back to the user too soon.

In addition, I suspect the current asynchronous intentions are not working as expected, but that has gone unnoticed because regular kernel startups are on the order of one, maybe two, seconds.

Here are some links I found that seemed applicable to what we need:
http://blog.trukhanov.net/Running-synchronous-code-on-tornado-asynchronously/
https://stackoverflow.com/questions/18088176/calling-tornado-coroutines-from-synchronous-code
https://stackoverflow.com/questions/13051591/tornado-blocking-asynchronous-requests

Placing back on the backlog for now. It would be great if someone else could spend some time with this. I'd be happy to share more about what I tried.

@kevin-bates kevin-bates removed their assignment Aug 29, 2017
@lresende lresende self-assigned this Aug 29, 2017
@lresende lresende modified the milestones: Sprint 9, Backlog Aug 29, 2017
@lresende lresende modified the milestones: Sprint 10, Sprint 9 Sep 11, 2017
@lresende lresende modified the milestones: Sprint 10, Sprint 11 Sep 21, 2017
@lresende lresende modified the milestones: Sprint 11, Backlog Oct 7, 2017
@lresende lresende removed their assignment Oct 14, 2017
@lresende lresende self-assigned this Nov 6, 2017
@lresende lresende removed this from the Backlog milestone Nov 6, 2017
@kevin-bates kevin-bates added this to the v0.8 Backlog milestone Nov 7, 2017
@kevin-bates kevin-bates removed this from the v0.8 Backlog milestone Mar 26, 2018
@kevin-bates kevin-bates changed the title Single-threaded Kernel Startup Remove blocking Kernel Startup Apr 3, 2018
@kevin-bates kevin-bates added this to the v1.0 milestone May 11, 2018
@kevin-bates kevin-bates modified the milestones: v1.0, v2.0 Jun 1, 2018
@Carreau
Copy link
Contributor

Carreau commented Nov 12, 2018

As promised in another thread, some comments on this; Apologies for the delay, I've been pretty busy on other things (like deleting root's home on one of the cluster I manage because i'm stupid...)

I believe there is a path forward that keep backward compatibility but will allow us to slowly migrate to async-almost-everything and not block kernel launching. I'm going to use python 3 syntax, but it should be possible to do that on Python 2 with @gen.coroutine/yield but it's less clear in explanations.

  1. We need slowly migrate all the current API from def function() to async def function(), and prefer the second one. The nice thing is that async def function() can be blocking, that is not a problem, but it's hard to make def function() non-blocking.

For client that do not run in an event-loop, and are "quasi sync", you can fake an event loop that just advance everything like in there

  1. We need to provide backward compatibility, that is to say we need to provide, and be able to call call 2 kinds of methods:
  • Synchone def foo()
  • Asynchrone async def foo().

In python 3 at least, you can figure out whether a function is synchrone using inspect.iscoroutinefunction. If it's sync then we just call it, if it's async we yield from it.

I suggest the following for every method def XXX(...) that participate in kernel management we define:

  • Make a async def XXX_async() version that is provisional but async.

For every method foo we currently call check anc call in order:

  • async def foo
  • async def foo_async
  • def foo

Everytime we encounter a non-async function, or a non-async function get called have extensive deprecation warning that ask users to:

  • make their function async if it's sync.
  • Provide them with a simple way of upgrading their code to call both async and non-async function.

I've been toying with that in jupyter/jupyter_client#402, and jupyter/notebook#4126, I just need to find the right abstraction and pattern to allow the transition to be as seamless as possible for users.

As far as I can tell, we should be able to do a transition where all of the new API are compatible with the future ones with minimal disruptions and code change. I believe with the right tooling, documentation and example that should be something relatively systematic to do, so the would just be a question of time and outreach to review all the possible places that need to be updated.

Apologies if this is a bit unclear, I'm still trying to arrange things in my head, and I still need to dive into kernel and enterprise gateway to see how it affects or would simplify those.

@kevin-bates
Copy link
Member

@Carreau - thank you for your comment. Yes, I've been monitoring your WIP PRs and agree that its best to see how the dust settles on those. In addition, @takluyver's WIP PR may essentially eliminate all kernel management from Enterprise Gateway anyway - so we need to wait to see where his effort takes us. I would recommend you coordinate with Thomas, although I suspect you have already done that. Looks like Thomas is accounting for async kernel providers also - all good.

In the meantime, we will continue to monitor these kernel management PRs and contribute where we can. Our primary interest is also in preserving the existing functionality we provide and to minimize any migration challenges that lie ahead when adopting these new architectures.

@takluyver
Copy link

Looks like Thomas is accounting for async kernel providers also - all good.

To be honest, not really. I tried to figure this out, and there are launch_async methods in the API, but it all got too messy, and I've wound up using only the blocking APIs to launch and manage kernels. I have a working async client built on the tornado interface, but not the provider/manager part.

Part of the muddle I got into was that I was planning to build everything directly on asyncio, but then there's a nice ZMQ 'stream' interface built on tornado, and then I started reading about trio, and...

@kevin-bates kevin-bates assigned kevin-bates and unassigned lresende Feb 19, 2019
kevin-bates added a commit to kevin-bates/enterprise_gateway that referenced this issue Feb 19, 2019
Updated start_kernel and restart_kernel methods to coroutines.

Converted various `time.sleep()` to `yield gen.sleep()` which enables
tornado to "slice-out" to a different request. As a result, converting
`confirm_remote_startup()` and `handle_timeout()` to coroutines was key.

Fixes jupyter-server#86
@esevan
Copy link
Contributor

esevan commented Feb 20, 2019

@kevin-bates Would you mind if I ask nb2kg image for this patch?

I'll appreciate it if any guide is given as well for upgrading jupyter notebook and client in nb2kg Dockerfile.

@kevin-bates
Copy link
Member

@esevan - Although I performed validation of the changes via the REST API, an update to the nb2kg image shouldn't be necessary since the actual changes to enable concurrency really take place in EG via #580. The notebook and jupyter_client changes essentially facilitate the changes in EG. If you're finding that not to be the case, please let me know along with how you go about starting kernels concurrently via the NB2KG image and I'd be happy to take a look.

@esevan
Copy link
Contributor

esevan commented Feb 20, 2019

@kevin-bates I'm using this image as my notebook environment. https://github.com/jupyter/enterprise_gateway/blob/master/etc/docker/nb2kg/Dockerfile

I expected this included notebook and client, that It should be changed to facilitate #580.

kevin-bates added a commit to kevin-bates/enterprise_gateway that referenced this issue Feb 20, 2019
Updated start_kernel and restart_kernel methods to coroutines.

Converted various `time.sleep()` to `yield gen.sleep()` which enables
tornado to "slice-out" to a different request. As a result, converting
`confirm_remote_startup()` and `handle_timeout()` to coroutines was key.

Fixes jupyter-server#86
@kevin-bates
Copy link
Member

That's correct that the image uses notebook and jupyter_client, but the actual kernel startup and management gets proxied (via NB2KG) to the Enterprise Gateway server. That's where the practical portion of the concurrency changes reside. The NB and JC changes are merely to turn some existing methods into coroutines, but there weren't any additional changes made that trigger significant concurrency. The primary change was to replace the time.sleep() calls with yield gen.sleep() within confirm_remote_startup() and handle_timeout(), but to do that various calling methods (that include superclass methods in NB/JC) required they be decorated with @gen.coroutine.

I've attached the 3 dev-built whl files for EG, NB and JC. You must pip install EG before NB since the NB version violates EG's constraints.
async-startup-wheels.zip

If you only need the applicable EG image, I've pushed one that can be pulled using: docker pull kbates/enterprise-gateway:async

I hope that helps.

@esevan
Copy link
Contributor

esevan commented Feb 20, 2019

@kevin-bates So you just modified EG process embedding JC and NB. I thought client side was also changed, and now I got it. Thanks a lot!

@esevan
Copy link
Contributor

esevan commented Feb 24, 2019

I've studied kernel startup source codes working in my environment (k8s) on my free time.
Here's just a graffiti as a result of it. Just wanna share this for anyone studying enterprise_gateway in source code level like me. Now I fully understood why I don't need nb2kg image.
image

@kevin-bates
Copy link
Member

@esevan - this is excellent!

kevin-bates added a commit to kevin-bates/enterprise_gateway that referenced this issue Feb 28, 2019
Updated start_kernel and restart_kernel methods to coroutines.

Converted various `time.sleep()` to `yield gen.sleep()` which enables
tornado to "slice-out" to a different request. As a result, converting
`confirm_remote_startup()` and `handle_timeout()` to coroutines was key.

Fixes jupyter-server#86
kevin-bates added a commit to kevin-bates/enterprise_gateway that referenced this issue Mar 10, 2019
Updated start_kernel and restart_kernel methods to coroutines.

Converted various `time.sleep()` to `yield gen.sleep()` which enables
tornado to "slice-out" to a different request. As a result, converting
`confirm_remote_startup()` and `handle_timeout()` to coroutines was key.

Fixes jupyter-server#86
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment