Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Changes necessary to get async kernel startup working #425

Closed
wants to merge 3 commits into from

Conversation

kevin-bates
Copy link
Member

@kevin-bates kevin-bates commented Feb 16, 2019

Inspired by @Carreau's work in #402, this pull request essentially applies the same model used in Notebook for supporting coroutines with appropriately placed yield statements in order to start multiple kernels simultaneously within the same server instance. The reason I needed to adopt this approach is because Enterprise Gateway has a long-standing issue with concurrent kernel startup requests and runs in both python 2.7 and 3.x.

Since EG supports running remote kernels launched by pluggable resource managers (Hadoop YARN, Kubernetes, Docker Swarm, etc.) kernel startup requests can take anywhere from 5 to 15 seconds, depending on the platform. Moreover, because EG is a headless web server, essentially exposing "Kernel As A Service" behaviors, it is not unusual for simultaneous startup requests to occur - unlike typical single-user Notebook servers.

These changes have been tested with applicable changes in Enterprise Gateway and "single-threaded" POSTs are now eliminated.

I have also run the Notebook nosetests using the applicable dev build of jupyter_client. In addition, I have used a previous version of EG (our recently released beta) running against jupyter_client. No issues were found using either of those "clients" - so backwards compatibility appears to be present.

Because local kernels are so fast to start, it is difficult to determine if these changes are all that are necessary for concurrent starts outside of EG. EG uses poll loops for discovery and startup confirmation and it was this area where blocking was occurring. This, coupled with large start windows make it much easier to determine that the blocking has been addressed. That said, I believe there are similar issues on shutdown (see the timing loop in finish_shutdown()) where further progress can be made. However, kernel shutdowns are not nearly as urgent for end-users.

Given there don't appear to be compatibility issues, I'm hoping these changes can be merged to master and 5.x since EG has existing customers that would like this behavior.

@kevin-bates kevin-bates changed the title Changes necessary to get async kernel startup working [WIP] Changes necessary to get async kernel startup working Feb 18, 2019
@kevin-bates
Copy link
Member Author

Restarts aren't behaving properly - so I'm looking into that.

Inspired by jupyter#402, but needing support for python 2.7, these changes
essentially apply the same model used in Notebook for supporting
coroutines with appropriately placed yield statements in order to
start (and restart) multiple kernels simultaneously within the same
server instance.
@kevin-bates kevin-bates changed the title [WIP] Changes necessary to get async kernel startup working Changes necessary to get async kernel startup working Feb 19, 2019
@kevin-bates
Copy link
Member Author

Similar "plumbing" was required for restart_kernel. Note that a similar change is required from notebook/services/kernels/kernelmanager.py. However, the two changes do not affect notebook kernels and can be published independently.

Only projects that require additional coroutine support (like Enterprise Gateway) will require BOTH sets of changes.

@Carreau
Copy link
Member

Carreau commented Feb 21, 2019

That is awesome.. I'm guessing @minrk and @yuvipanda may want to have a look.

@Carreau Carreau self-requested a review February 21, 2019 01:44
@kevin-bates
Copy link
Member Author

kevin-bates commented Feb 25, 2019

Found that adding support for shutdowns was straightforward. This requires that shutdown_kernel() be exposed as a coroutine in Notebook's MappingKernelManager.

I'm happy to squash these commits if you like.

Removed shutdown change. It works great for clean shut downs but was side-affecting restarts. Following the restart, the kernel would crash. I suspect it was due to the immediately shutdown logic not having a yield or something like that. Need to revisit shut downs and ensure restarts aren't side affected.

@kevin-bates kevin-bates changed the title Changes necessary to get async kernel startup working Changes necessary to get async kernel support working Feb 25, 2019
@kevin-bates
Copy link
Member Author

@Carreau @minrk - would it be possible to get this included in a 5.2.5 release soon. Enterprise Gateway really needs this for its 1.x line (along with the corresponding Notebook PRs). Thanks in advance.

@kevin-bates
Copy link
Member Author

I may be seeing some issues relative to the shutdown changes during restarts. Need to take a closer look.

@kevin-bates kevin-bates changed the title Changes necessary to get async kernel support working [WIP] Changes necessary to get async kernel support working Mar 4, 2019
@kevin-bates
Copy link
Member Author

Removed WIP - need to revisit shutdowns later.

@kevin-bates kevin-bates changed the title [WIP] Changes necessary to get async kernel support working Changes necessary to get async kernel support working Mar 4, 2019
minrk added a commit to jupyter/notebook that referenced this pull request Mar 5, 2019
Converted `MappingKernelManager.restart_kernel` to a coroutine so that projects that take advantage of async kernel startup can also realize appropriate behavior relative to restarts.  To take full advantage of async kernel startup (with restarts), this PR along with its "[sibling PR](jupyter/jupyter_client#425)" will be required.

Please note, however, that each of these PRs can be independently installed w/o affecting today's notebook applications (as far as I can determine via testing the possible combinations).  That said, BOTH of these PRs will be required for usage by Enterprise Gateway - which incurs very long kernel startup times - or other projects that require concurrent kernel starts.

It would be ideal to have this PR back-ported to python 2.7-relative release since EG has an immediate need.

Signed-off-by: Min RK <benjaminrk@gmail.com>
@minrk
Copy link
Member

minrk commented Mar 5, 2019

This can't go into a patch release since it's a major backward-incompatible API change. This change alone forces a major revision due to the change from synchronous to asynchronous methods. This is why #402 defines a new method with a new API, so that it could go in a minor release instead of a major one. We can start trying to release 6.0 soon, though.

@minrk minrk added this to the 6.0 milestone Mar 5, 2019
@minrk
Copy link
Member

minrk commented Mar 5, 2019

The alternative that would let us release this in 5.3 is to take the approach in #402, and define a suite of *_async APIs that return coroutines. The default implementation of these from the base class could be:

def start_kernel_async(self, *args, **kwargs):
    f = Future()
    f.set_result(self.start_kernel(*args, **kwargs))
    return f

so that any subclass would work with the async API, and truly async subclasses could implement the async methods directly.

@kevin-bates
Copy link
Member Author

OK - thanks for the explanation. I suppose the same approach goes for jupyter/notebook#4412.

Would I post the PRs against their relative back-ported branches once I get these in place?

@kevin-bates
Copy link
Member Author

@minrk - Before I go down this road of exposing async kernel management to 5.x, I need to have my understanding confirmed.

You're saying that for each method that should be made async (start_kernel, restart_kernel, shutdown_kernel for starters), a set of parallel async methods need to be implemented. Any "guts" would then need to be placed into local methods (e.g., _start_kernel) as #402 demonstrates, before the B/C release is cut (e.g., juptyer_client 5.3). @Carreau's PR makes use of the async method implicit, based on the version of python, but this question is more about B/C for python 2 users. As a result, I believe you get the following cascading...

Enterprise Gateway has direct dependencies on Kernel Gateway, Notebook and Jupyter Client via inheritance, so before EG can think about having this functionality, a set of parallel methods would be required in Notebook and Notebook 5.x cut (with proper JC dependency in setup.py). Then, after both JC and NB are done, Kernel Gateway needs to have its subclasses updated with applicable parallel methods and JKG 2.x cut (with proper JC and Notebook dependencies in setup.py). Then, Enterprise Gateway can update its subclasses to the async-based methods and JEG 2.x cut (with property JC, Notebook and JKG dependencies in setup.py).

Is my understanding correct?

If yes, then I think a) we (JEG) may not want to make this investment and b) we should get whatever methods are needed in place in jupyter-client for 6.0 so that these parallel methods aren't required post 6.0.

@Carreau
Copy link
Member

Carreau commented Mar 5, 2019

I think it make a lot of sens to develop patches purely in an asynchronous manner to get things to work, but before merging we need to provide backward compatibility with sync methods.

In #402 I tried to do this (likely imperfectly).

You're saying that for each method that should be made async (start_kernel, restart_kernel, shutdown_kernel for starters), a set of parallel async methods need to be implemented. Any "guts" would then need to be placed into local methods (e.g., _start_kernel) as #402 demonstrates, before the B/C release is cut (e.g., juptyer_client 5.3)

If we want a smooth transition then yes. We could do :
5.x : provide sync
6.x : provide sync and async:
7.x : provide async only.

The hard part if for each step of the stack:

  • If dependencies have async: use async as much as possible,
  • otherwise: use sync.

One direction is easy:

  • an async method can alway call a sync one.

There is some difficulties in how some packages call each-other.

This discussion is one of the topics for a meeting next week in DC. I'll see if I can work on some of this.

@Carreau's PR makes use of the async method implicit, based on the version of python, but this question is more about B/C for python 2 users. As a result, I believe you get the following cascading...

Yes and something at some point need to make the decision to start async. I forced it in my PR mostly for testing, sorry if that was confusing.

@kevin-bates
Copy link
Member Author

kevin-bates commented Mar 5, 2019

Ok - thanks for confirming my understanding. Was kinda hoping for a different response - but you're correct, there are extra steps.

5.x : provide sync
6.x : provide sync and async:
7.x : provide async only.

Couldn't we provide async methods in the next 5.x builds of Notebook and JupyterClient? This would provide a vehicle for those Python 2 clients for getting to async. Then 6.x would be the same, with the mantra being "you should move to async methods", where finally, 7.x pulls the plug on the sync methods.

5.x : provide sync and async (where async is optional and uses: @gen.coroutine/yield?)
6.x : provide sync and async (where async is recommended and uses: async def/await?)
7.x : provide async only (where async is required)

Does the switch to async/await need to be coordinated across all classes? Ie., can a subclass use gen.coroutine while the superclass uses async def?

@minrk
Copy link
Member

minrk commented Mar 6, 2019

Does the switch to async/await need to be coordinated across all classes? Ie., can a subclass use gen.coroutine while the superclass uses async def?

In general, no, as long as you use tornado >= 5. I think there can be complications working with tornado 4 and asyncio, but it's okay with tornado 5 or 6. The main thing is dropping the deprecated gen.maybe_future, in favor of a more complete version that supports asyncio coroutine objects, as in jupyter/notebook#4449.

I think getting async to work in a smooth transition that works for both Python 2 and 3, tornado and asyncio, is going to be tricky. Part of the issue is that the notebook repo has already dropped Python 2 support, so any further Python 2 development there is going to have to happen in a forked manner on a backport branch. I think developing async API is also the right time to do the same for this repo. I don't think we have the resources to devote to maintenance of two active development branches of the notebook repo (we aren't currently able to keep up with just one).

Personally, I don't believe users on Python 2, which is itself EOL in 9 months, have a reasonable expectation of new features from any software. I think Python 2 should only be getting security and major bugfix backports at this point. We already took this leap in IPython and JupyterHub, and have done so in notebook master as well. async is one of the very biggest benefits of Python 3, and I think it is extremely reasonable to require Python 3 for such features.

Here's what I would do:

  1. release 6.0 (requiring Python 3.5) with async def ..._async methods, which default to simple wrappers around the synchronous implementations, so that client implementations that are synchronous-only will continue to work in the async API. Preserve the existing method names as always-sync
  2. notebook 6.0 requires jupyter-client 6.0 and uses the ._async methods exclusively

If the _async methods are too cumbersome, we can define new AsyncKernelManager classes (not change the existing ones) where the canonical start_kernel methods, etc. are coroutines. But this is ultimately the same as new methods on the existing classes. The wrapping of sync classes can still be done automatically, but it would probably be more complicated than the async_method in the base class case.

This should be maximally backward-compatible (all synchronous implementations will continue to work, and all consumers of synchronous APIs continue to work), and require no two-stage release for a breaking change.

@kevin-bates
Copy link
Member Author

Great point regarding the impending EOL of Python 2 and expectations of new features. That keeps slipping my mind. Ok, here's how I'd like to proceed based on your responses (both @minrk and @Carreau).

  1. I will revert this commit rather than create a new PR so we retain our conversation.
  2. For each of the public KernelManager methods (i.e., those decorated in MultiKernelManager with @kernel_method) along with those already marked with @gen.coroutine, as well any known others (e.g., I think _launch_kernel should probably support async) a ..._async method will be produced.
  3. Most ..._async methods will simply be a "future wrapper" like suggested above. In addition, a new decorator @kernel_method_async will be introduced to perform a similar function as today's @kernel_method.
  4. Existing @gen.coroutine will have parallel methods that perform async/await functionality. I will also add new async def methods to address the original intent of this PR.

These changes will go into the 6.0 release, after-which we can do the same for Notebook and its 6.0 release. That change will apply use of the async methods to the handlers. Those handlers, presumably, will be converted to async def and call the kernel manager ..._async methods.

I hesitate to comment regarding other services in Notebook outside of the kernels regarding the need for similar changes.

Does this sound right?

@minrk
Copy link
Member

minrk commented Mar 6, 2019

I forgot to mention that perhaps the best plan to move to async is @takluyver’s new async-native replacement APIs for Jupyter-client. Maybe it’s better to try to finish moving notebook to those (jupyter/notebook#4170) and get EG onto those APIs as well. Then we don’t have to try to figure out a safe way to transition these APIs.

@kevin-bates
Copy link
Member Author

Yes, I'm aware of @takluyver's kernel provider proposal and, in principle, agree with the approach since EG's process proxies are essentially providers themselves. However, there are items that require much further discussion. Once we move to this kind of model (where we essentially have pluggable kernel managers and the MultiKernelManager facilitates lifecycle actions), then most of EG will probably go away - although I think it will remain for things like multi-tenancy, HA, provider repo, etc.

At any rate, I think the kernel provider stuff warrants further design discussion and I don't know how that kind of thing gets facilitated across the projects. I'm all for working on this asap and would be perfectly happy to talk to you during your working hours - but that's the kind of thing that needs to happen verbally.

In the meantime, EG users require async kernel startup (due to its multi-tenant behavior, coupled with long startup times) and it seems the most expedient way to get there w/o breaking clients is with dual methods in jupyter_client and notebook.

Do you mind if we attempt the dual method approach? I'd rather not expend the effort if things aren't going to get merged (assuming the changes are within reason).

@minrk
Copy link
Member

minrk commented Mar 6, 2019

I definitely don’t mind if you want to give it a go. I just wanted to make sure we aren’t forgetting the main path forward in the new APIs, which has been underway for some time.

This reverts commit 5103959.
Plan is to introduce parallel start_kernel_async methods on
KernelManager and MultiKernelManager for now.
@kevin-bates kevin-bates changed the title Changes necessary to get async kernel support working Changes necessary to get async kernel startup working Mar 7, 2019
@kevin-bates
Copy link
Member Author

@minrk - I've taken a step back and am only focusing on kernel startups. Restarts are a bit dicey when we're talking about parallel methods and shutdown is more tricky. Startup is where the biggest bang for buck is located so I reverted the previous commit and added the update. If necessary, I'd be happy to create a different PR and/or squash these commits and cleanup the description - but wanted to run this past you first.

I've introduced parallel methods on MultiKernelManager and KernelManager with the async method being start_kernel_async. I've also added a "parallel" method to _launch_kernel named launch_kernel_async (note the lack of '_'). This is because EG needs to override this method so I'm taking the liberty to rename this now.

I'd like to share how the MultiKernelManager.start_kernel_async` method gets invoked as this may seem strange, but I felt it the best way for these to co-exist without having to introduce version dependencies.

Here are the proposed changes in Notebook (PR pending): kevin-bates/notebook@bb40684

I chose to make the use of the async method configurable because just checking if the method exists is not sufficient if KernelManager is sub-classed and the sub-classes don't implement the parallel method (i.e., they'd essentially lose the functionality they intend to provide). What's nice about this is that Notebook and JKG don't implement KernelManager at all. As a result, EG can set this trait unconditionally and trigger its async kernel support via its override of start_kernel_async (and launch_kernel_async).

Btw, restarts will wind up using the synchronous start method for its start phase - which seems fine.

I'm also (still) hoping these changes could be back-ported since they work fine in python 2.7 envs.

@kevin-bates
Copy link
Member Author

This is proving difficult. We definitely need restarts supported but that would require deeper plumbing. I think I'm going to pause this PR and take a different approach via subclasses - @minrk had suggested this in a brief comment previously. I view the tricks for that to be the restarter code and the multiple layers of hierarchy involved without blindly duplicating everything.

I'm going to mark this as WIP, but will likely just close this PR if a sub-classed approach proves practical.

@kevin-bates kevin-bates changed the title Changes necessary to get async kernel startup working [WIP] Changes necessary to get async kernel startup working Mar 10, 2019
@kevin-bates
Copy link
Member Author

I'm closing this in favor of #428 - since the subclass approach feels more compatible.
Thank you for participating. See you over on #428.

@kevin-bates kevin-bates deleted the async-startup branch April 2, 2019 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants