Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further improvements to pending kernels managment #732

Merged
merged 15 commits into from
Jan 14, 2022

Conversation

Zsailer
Copy link
Member

@Zsailer Zsailer commented Jan 5, 2022

Two changes in this PR, summarized below. This work is a follow-up to #712

Make shutdown_kernel a pending state

Makes shutdown_kernel also show the kernel in a pending state. Since the KernelManager is still managing a process while it's shutting down, which might take a long time, I think this should show as a pending state too.

Make KernelManager only responsible for reporting kernel pending state

Also, after working with pending kernels a bit, I believe it makes the most sense to make the KernelManager responsible for reporting the kernel's pending state, while the layer that sits above the KernelManager, e.g. MultiKernelManager, responsible for reacting to that state.

This essentially means removing all self.ready checks in the individual KernelManager. For example,

# Shutdown is a no-op for a kernel that had a failed startup
if self._ready.exception():
return

should be removed; rather, a MultiKernelManager whose use_pending_kernels attribute is True would determine if shutdown is a passthrough, etc.

Another example is

if not self._ready.done():
raise RuntimeError("Cannot restart the kernel. " "Kernel has not fully started.")

The MultiKernelManager would be responsible for handling what to do when a kernel restart happens while a kernel is in a pending state.

@blink1073
Copy link
Contributor

This seems reasonable to me. Zach and I discussed this today and agreed that we should release a version of jupyter_server that supports these changes, get downstream tests here passing, and then release a minor release with these refinements.

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Zsailer - this looks good. I just had a question regarding the now method in the tests.

It might be good to also update the help-string of the ready property. Perhaps something like:

"""A future that resolves when the kernel has completed its startup or shutdown"""

"""Use this function ensure that this awaitable
happens before other awaitables defined after it.
"""
(out,) = await asyncio.gather(awaitable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't understand why this method is necessary - especially for a single-item gather. Isn't awaiting the awaitable sufficient to prevent the execution of follow-on awaitables in the same execution path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question, @kevin-bates.

That's what's supposed to happen; however, I'm seeing some weird behavior that I think is coming from the gen_test decorator. Basically, I'm not seeing the awaits happening in the order they are defined.

Weirdly, if I replace the async/await syntax with the old yield syntax, these tests work fine (no now needed). Again, this suggest some weird interference happening from Tornado's async testing. I'm not sure how to fix this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By enforcing a gather call in each await statement, everything happens in the order they are called. Super strange!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Zach. That is interesting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like nested coroutines can be a bit difficult in unit tests. It appears that inner coroutines never get scheduled even when the outer coroutine is awaited. Adding ensure_future in multiple places in the tests ensures that these coroutines get scheduled and tests pass again.

@Zsailer Zsailer changed the title Stopping kernel is a pending kernel Fuether improvements to pending kernels managment Jan 8, 2022
@Zsailer Zsailer changed the title Fuether improvements to pending kernels managment Further improvements to pending kernels managment Jan 8, 2022
@Zsailer
Copy link
Member Author

Zsailer commented Jan 10, 2022

It looks like all tests are (finally) passing except the downstream tests. For that, jupyter-server/jupyter_server#654 is required.

I'll work on getting that merged and then ask for a final review here.

@Zsailer
Copy link
Member Author

Zsailer commented Jan 11, 2022

I've also added some documentation for pending kernels to this PR

@Zsailer
Copy link
Member Author

Zsailer commented Jan 11, 2022

This PR evolved a bit. I've added documentation to this PR to define more clearly what a "pending" kernel is.

A pending kernel is a kernel that is currently in the process of "starting" or "shutting down".

The multikernelmanager will block any subsequent actions that attempt to affect a pending kernel—e.g. "interrupt", "restart", "shutdown"—by raising a RuntimeError.

Remember, pending kernels are opt-in, so this feature is not enabled by default.

@Zsailer
Copy link
Member Author

Zsailer commented Jan 13, 2022

I'm not sure why the nbclient downstream tests are failing. When I run these locally, they run successfully.

@Zsailer
Copy link
Member Author

Zsailer commented Jan 13, 2022

I kicked the downstream tests in #731 (sorry @blink1073 😅 ) and see the exact same errors, so they are unrelated to this PR.

This should be ready to merge. 🚀

@davidbrochart
Copy link
Member

The nbclient failures should be fixed by jupyter/nbclient#190.
I could release nbclient if you want to be sure.

@davidbrochart
Copy link
Member

nbclient v0.5.10 is on PyPI with the fix.

@Zsailer
Copy link
Member Author

Zsailer commented Jan 13, 2022

Thanks, @davidbrochart! I really appreciate it!

@Zsailer
Copy link
Member Author

Zsailer commented Jan 14, 2022

Waiting on jupyter-server/jupyter_server#662 to be merged and released to fix the downstream tests.

@Zsailer
Copy link
Member Author

Zsailer commented Jan 14, 2022

All green!

@Zsailer
Copy link
Member Author

Zsailer commented Jan 14, 2022

This was discussed extensively at the last Jupyter Server meeting: jupyter-server/team-compass#15 (comment)

Merging away!

@Zsailer Zsailer merged commit 4428715 into jupyter:main Jan 14, 2022
Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a couple of suggestions and comments, but this looks good.


*Added in 7.1.0*

In scenarios where an kernel takes a long time to start (e.g. kernels running remotely), it can be advantageous to immediately return the kernel's model and ID from key methods like ``.start_kernel()`` and ``.shutdown_kernel()``. The kernel will continue its task without blocking other managerial actions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In scenarios where an kernel takes a long time to start (e.g. kernels running remotely), it can be advantageous to immediately return the kernel's model and ID from key methods like ``.start_kernel()`` and ``.shutdown_kernel()``. The kernel will continue its task without blocking other managerial actions.
In scenarios where a kernel takes a long time to start (e.g. kernels running remotely), it can be advantageous to immediately return the kernel's model and ID from key methods like ``.start_kernel()`` and ``.shutdown_kernel()``. The kernel will continue its task without blocking other managerial actions.

Comment on lines +102 to +105
@property
def _starting_kernels(self):
"""A shim for backwards compatibility."""
return self._pending_kernels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be marked as deprecated?

if self._using_pending_kernels() and kernel_id in self._pending_kernels:
raise RuntimeError("Kernel is in a pending state. Cannot shutdown.")
# If the kernel is still starting, wait for it to be ready.
elif kernel_id in self._starting_kernels:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the use of _starting_kernels intentional? Technically, this could also contain kernels that are shutting down now, but only when pending kernels are enabled - so I suspect this was for hinting that, in this case, the kernel can only be starting.

Perhaps we could be more explicit (and save an existence check [nit]) via:

        if kernel_id in self._pending_kernels:
            if self._using_pending_kernels():
                 raise RuntimeError("Kernel is in a pending state. Cannot shutdown.")
            else:  # kernel is still starting, wait for its startup
                kernel = self._pending_kernels[kernel_id]
                try:
                    await kernel
                except Exception:
                    self.remove_kernel(kernel_id)

try:
await self._starting_kernels[kernel_id]
await kernel
except Exception:
self.remove_kernel(kernel_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to update the _pending_kernels list as well here?

@kevin-bates
Copy link
Member

Hmm - sorry, it looks like this was merged while I was reviewing. The comments aren't earth-shattering so I'll leave it to you to determine if they have any merit.

Thanks for all the work on this @Zsailer - good stuff!

@jtpio
Copy link
Member

jtpio commented Jan 17, 2022

Looks like the 7.1.1 release of jupyter_client affected the JupyterLab tests: jupyterlab/jupyterlab#11886

The check is defined here:

https://github.com/jupyterlab/jupyterlab/blob/50fa0047d63287005a2c3b1d8f5bfedfe56cde7e/packages/apputils/test/sessioncontext.spec.ts#L453-L469

Pinning to 7.1.0 fixes it in jupyterlab/jupyterlab#11887.

Reporting here for reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants