-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Changes necessary to get async kernel startup working #425
Conversation
Restarts aren't behaving properly - so I'm looking into that. |
Inspired by jupyter#402, but needing support for python 2.7, these changes essentially apply the same model used in Notebook for supporting coroutines with appropriately placed yield statements in order to start (and restart) multiple kernels simultaneously within the same server instance.
633f245
to
5103959
Compare
Similar "plumbing" was required for Only projects that require additional coroutine support (like Enterprise Gateway) will require BOTH sets of changes. |
That is awesome.. I'm guessing @minrk and @yuvipanda may want to have a look. |
Removed shutdown change. It works great for clean shut downs but was side-affecting restarts. Following the restart, the kernel would crash. I suspect it was due to the immediately shutdown logic not having a yield or something like that. Need to revisit shut downs and ensure restarts aren't side affected. |
I may be seeing some issues relative to the shutdown changes during restarts. Need to take a closer look. |
3cabdc4
to
5103959
Compare
Removed WIP - need to revisit shutdowns later. |
Converted `MappingKernelManager.restart_kernel` to a coroutine so that projects that take advantage of async kernel startup can also realize appropriate behavior relative to restarts. To take full advantage of async kernel startup (with restarts), this PR along with its "[sibling PR](jupyter/jupyter_client#425)" will be required. Please note, however, that each of these PRs can be independently installed w/o affecting today's notebook applications (as far as I can determine via testing the possible combinations). That said, BOTH of these PRs will be required for usage by Enterprise Gateway - which incurs very long kernel startup times - or other projects that require concurrent kernel starts. It would be ideal to have this PR back-ported to python 2.7-relative release since EG has an immediate need. Signed-off-by: Min RK <benjaminrk@gmail.com>
This can't go into a patch release since it's a major backward-incompatible API change. This change alone forces a major revision due to the change from synchronous to asynchronous methods. This is why #402 defines a new method with a new API, so that it could go in a minor release instead of a major one. We can start trying to release 6.0 soon, though. |
The alternative that would let us release this in 5.3 is to take the approach in #402, and define a suite of def start_kernel_async(self, *args, **kwargs):
f = Future()
f.set_result(self.start_kernel(*args, **kwargs))
return f so that any subclass would work with the async API, and truly async subclasses could implement the async methods directly. |
OK - thanks for the explanation. I suppose the same approach goes for jupyter/notebook#4412. Would I post the PRs against their relative back-ported branches once I get these in place? |
@minrk - Before I go down this road of exposing async kernel management to 5.x, I need to have my understanding confirmed. You're saying that for each method that should be made async ( Enterprise Gateway has direct dependencies on Kernel Gateway, Notebook and Jupyter Client via inheritance, so before EG can think about having this functionality, a set of parallel methods would be required in Notebook and Notebook 5.x cut (with proper JC dependency in Is my understanding correct? If yes, then I think a) we (JEG) may not want to make this investment and b) we should get whatever methods are needed in place in jupyter-client for 6.0 so that these parallel methods aren't required post 6.0. |
I think it make a lot of sens to develop patches purely in an asynchronous manner to get things to work, but before merging we need to provide backward compatibility with sync methods. In #402 I tried to do this (likely imperfectly).
If we want a smooth transition then yes. We could do : The hard part if for each step of the stack:
One direction is easy:
There is some difficulties in how some packages call each-other. This discussion is one of the topics for a meeting next week in DC. I'll see if I can work on some of this.
Yes and something at some point need to make the decision to start async. I forced it in my PR mostly for testing, sorry if that was confusing. |
Ok - thanks for confirming my understanding. Was kinda hoping for a different response - but you're correct, there are extra steps.
Couldn't we provide async methods in the next 5.x builds of Notebook and JupyterClient? This would provide a vehicle for those Python 2 clients for getting to async. Then 6.x would be the same, with the mantra being "you should move to async methods", where finally, 7.x pulls the plug on the sync methods. 5.x : provide sync and async (where async is optional and uses: Does the switch to |
In general, no, as long as you use tornado >= 5. I think there can be complications working with tornado 4 and asyncio, but it's okay with tornado 5 or 6. The main thing is dropping the deprecated I think getting async to work in a smooth transition that works for both Python 2 and 3, tornado and asyncio, is going to be tricky. Part of the issue is that the notebook repo has already dropped Python 2 support, so any further Python 2 development there is going to have to happen in a forked manner on a backport branch. I think developing async API is also the right time to do the same for this repo. I don't think we have the resources to devote to maintenance of two active development branches of the notebook repo (we aren't currently able to keep up with just one). Personally, I don't believe users on Python 2, which is itself EOL in 9 months, have a reasonable expectation of new features from any software. I think Python 2 should only be getting security and major bugfix backports at this point. We already took this leap in IPython and JupyterHub, and have done so in notebook master as well. async is one of the very biggest benefits of Python 3, and I think it is extremely reasonable to require Python 3 for such features. Here's what I would do:
If the This should be maximally backward-compatible (all synchronous implementations will continue to work, and all consumers of synchronous APIs continue to work), and require no two-stage release for a breaking change. |
Great point regarding the impending EOL of Python 2 and expectations of new features. That keeps slipping my mind. Ok, here's how I'd like to proceed based on your responses (both @minrk and @Carreau).
These changes will go into the 6.0 release, after-which we can do the same for Notebook and its 6.0 release. That change will apply use of the async methods to the handlers. Those handlers, presumably, will be converted to I hesitate to comment regarding other services in Notebook outside of the kernels regarding the need for similar changes. Does this sound right? |
I forgot to mention that perhaps the best plan to move to async is @takluyver’s new async-native replacement APIs for Jupyter-client. Maybe it’s better to try to finish moving notebook to those (jupyter/notebook#4170) and get EG onto those APIs as well. Then we don’t have to try to figure out a safe way to transition these APIs. |
Yes, I'm aware of @takluyver's kernel provider proposal and, in principle, agree with the approach since EG's process proxies are essentially providers themselves. However, there are items that require much further discussion. Once we move to this kind of model (where we essentially have pluggable kernel managers and the MultiKernelManager facilitates lifecycle actions), then most of EG will probably go away - although I think it will remain for things like multi-tenancy, HA, provider repo, etc. At any rate, I think the kernel provider stuff warrants further design discussion and I don't know how that kind of thing gets facilitated across the projects. I'm all for working on this asap and would be perfectly happy to talk to you during your working hours - but that's the kind of thing that needs to happen verbally. In the meantime, EG users require async kernel startup (due to its multi-tenant behavior, coupled with long startup times) and it seems the most expedient way to get there w/o breaking clients is with dual methods in jupyter_client and notebook. Do you mind if we attempt the dual method approach? I'd rather not expend the effort if things aren't going to get merged (assuming the changes are within reason). |
I definitely don’t mind if you want to give it a go. I just wanted to make sure we aren’t forgetting the main path forward in the new APIs, which has been underway for some time. |
This reverts commit 5103959. Plan is to introduce parallel start_kernel_async methods on KernelManager and MultiKernelManager for now.
f8fa9f9
to
55483d7
Compare
@minrk - I've taken a step back and am only focusing on kernel startups. Restarts are a bit dicey when we're talking about parallel methods and shutdown is more tricky. Startup is where the biggest bang for buck is located so I reverted the previous commit and added the update. If necessary, I'd be happy to create a different PR and/or squash these commits and cleanup the description - but wanted to run this past you first. I've introduced parallel methods on I'd like to share how the MultiKernelManager.start_kernel_async` method gets invoked as this may seem strange, but I felt it the best way for these to co-exist without having to introduce version dependencies. Here are the proposed changes in Notebook (PR pending): kevin-bates/notebook@bb40684 I chose to make the use of the async method configurable because just checking if the method exists is not sufficient if Btw, restarts will wind up using the synchronous start method for its start phase - which seems fine. I'm also (still) hoping these changes could be back-ported since they work fine in python 2.7 envs. |
This is proving difficult. We definitely need restarts supported but that would require deeper plumbing. I think I'm going to pause this PR and take a different approach via subclasses - @minrk had suggested this in a brief comment previously. I view the tricks for that to be the restarter code and the multiple layers of hierarchy involved without blindly duplicating everything. I'm going to mark this as WIP, but will likely just close this PR if a sub-classed approach proves practical. |
Inspired by @Carreau's work in #402, this pull request essentially applies the same model used in Notebook for supporting coroutines with appropriately placed yield statements in order to start multiple kernels simultaneously within the same server instance. The reason I needed to adopt this approach is because Enterprise Gateway has a long-standing issue with concurrent kernel startup requests and runs in both python 2.7 and 3.x.
Since EG supports running remote kernels launched by pluggable resource managers (Hadoop YARN, Kubernetes, Docker Swarm, etc.) kernel startup requests can take anywhere from 5 to 15 seconds, depending on the platform. Moreover, because EG is a headless web server, essentially exposing "Kernel As A Service" behaviors, it is not unusual for simultaneous startup requests to occur - unlike typical single-user Notebook servers.
These changes have been tested with applicable changes in Enterprise Gateway and "single-threaded" POSTs are now eliminated.
I have also run the Notebook nosetests using the applicable dev build of jupyter_client. In addition, I have used a previous version of EG (our recently released beta) running against jupyter_client. No issues were found using either of those "clients" - so backwards compatibility appears to be present.
Because local kernels are so fast to start, it is difficult to determine if these changes are all that are necessary for concurrent starts outside of EG. EG uses poll loops for discovery and startup confirmation and it was this area where blocking was occurring. This, coupled with large start windows make it much easier to determine that the blocking has been addressed. That said, I believe there are similar issues on shutdown (see the timing loop in
finish_shutdown()
) where further progress can be made. However, kernel shutdowns are not nearly as urgent for end-users.Given there don't appear to be compatibility issues, I'm hoping these changes can be merged to master and 5.x since EG has existing customers that would like this behavior.