-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spawning many kernels may result in ZMQError #487
Comments
Hi @JohanMabille - this is exactly what we do in Jupyter Enterprise Gateway - let the kernel (actually a kernel wrapper) determine its ports and return the connection info back to the server, who's listening on a port it passed to the wrapper (our term is kernel launcher) at launch time. This was primarily done to deal with kernels launched on remote hosts, but we've also had users, faced with this classic jupyter race condition, use this for local kernels as well. We use kernel launchers because we didn't want to force kernel updates for anyone using EG and remote kernels. I haven't had a chance to look into @takluyver's kernel nanny, but I think the launchers are similar concept. The EG launchers do a bit more than just manage the ZMQ connections. They also create a Spark context (if requested) and are used to listen for interrupt and shutdown requests (since signals don't span node or user-id boundaries and graceful shutdowns don't always happen in the real world). I think the connection management and interrupt/shutdown portions could be generally useful. |
Thanks for the details @kevin-bates. @martinRenou is implementing a backward compatible workaround for jupyter_server's mapping manager but I think that this is a critical-enough issue that it should trigger a conversation about changing what kernelspecs should look like (i.e. not having a connection file set by the client, but a single port for specified for the handshake socket). Another situation in which it occurs is when jupyterlab attempts to recreate all the kernels corresponding to the open notebooks in a layout. All the kernels are opened at once and step on each other's toes with respect to ports because of that wrong logic. |
Note that I don't see the behaviour that @JohanMabille describes on OSX with voila, which could be due to OSX not returning the same 'free' port. However, I do see the issue @SylvainCorlay describes, which is:
These may be different issues, but I'm not 100% sure. |
@SylvainCorlay - yes, this is identical to what we already do in EG, so I'd like to make sure things are done in a compatible way and I'd love to see this in At any rate, should the kernel provider proposal not get "ratified", I'd like to see this abstraction layer get introduced at the Subprocess level, so folks can bring their own kind of launchers and lifecycle management. This could be done in jupyter_client and retain the existing architecture. Fundamental to all of this, however, is async kernel management. This needs to get done in the new jupyter server! |
Interesting. I am not sure about different kernels having different launch mechanisms. As a xeus co-author, I like the idea of a kernel simply being an executable implementing one handshake mechanism. Ideally, people should also be able to come up with alternate implementations of jupyter_client (e.g. in C++) that could launch and talk to any kernel. |
This is how IPython parallel works as well (although all kernel sockets connect, not just one and then binding the rest). I agree that letting kernels pick ports is best, we just need to figure out a backward-compatible mechanism to make the transition in jupyter-client. Having a listening socket in the client can make things complicated for remote kernels since we are introducing a new relationship of connectable process (the kernel manager) and process which needs to know where that is (the kernel), but the local case should not be complicated. #490 is a great workaround for the short-term, which should help in very close to all cases. |
Agreed on the need to find a smooth transition between the two approaches. I think that the case a remote kernels is somewhat orthogonal, since kernels (as a process exposing zmq sockets) probably still have a local client. The remote handshaking mechanism can be completely different I presume. |
I don't see any reason to close socket A. Just keep it open for the next kernel that will start. Hopefully I'm going to assume the client that starts the kernel knows whether this kernel will know how to do a handshake. If so we can just use a convention of placeholder value in the connection file, or an extra field. I don't think there would be any problem with backward compatiblity. |
I guess we can add a field to the metadata section of the kernelspec (just like what we did for the debugger). |
You can't bind to a port that's already bound is the problem. So the kernel can't take the port until you release it. Once you release it another process (like a parallel notebook execution process) might bind it to hold. You can release and reconnect while it's in the |
I know but that's not what A is for. Kernel won't bind to A. The description is to bind to A in the client, and then the kernel(s) will connect to (HOST, port) and you can connect many times, which you can set in The point is if I understand correctly:
So A is never the port that is used in the end, it's just used as a backchannel for a handshake and have the kernel tell client what's the port to use. Hence I don't see any reason of closing A. We can of course pick a random port and listen to another one for each kernel, but why do it ? The only reasonable reason would be to use the port number as an indicator of which kernel you are waiting to start. |
Ahh I see. Yes that could work. It would be a breaking change for existing kernels. Do you see a backwards compatible way to support this behavior in jupyter_client? Maybe we could try the new behavior and fall back to the old if the kernel dies while we hold the port? |
you probably can add a key in the kernelspec to say "yes I support that", and a key in the connection file on how to pass the value back. |
I recommend using a single port for all kernels. In EG we currently use the other approach (an ephemeral port for each kernel) as it served our needs nicely at the time. However, if you want the server to be in, say, Kubernetes, yet the kernels on another network (say an on-prem Hadoop YARN cluster), you want to be able to expose that single response address from a k8s service and ephemeral ports aren't conducive to that. Yes, I think conveying this intention via the kernelspec's metadata stanza is the way to go. EG conveys the response address as a (templated) parameter to the kernel launcher but embedding it in the connection file would be fine for local kernels. |
That all makes sense to me. Should we start with the local kernel implementation and review so it's inline with EG approach (perhaps for a more unified interface on the kernel managers)? |
I would be in favor of doing anythind that make EG path forward easy, as they seem to be the one with the most experience in the domain. One of the questions we did not answer was:
|
I am not opinionated on the choice between having one different ephemeral handshake port per kernel vs a single one. The difference may actually only impact the client implementation, and not the kernel. Basically, instead of only supporting
the client could also do
|
The problem will probably more be about dealing with old clients of the kernel protocol than old kernels. How can a client that only supports the old mechanism deal with the new kernels implementing that handshake? Another way to put it is how can kernels advertize two ways of launching them, one being the new handshake, and the other one for clients that do not support the new approach? |
Wouldn't kernels need to tolerate both approaches and, if passed a handshake-port call Or, I want to say Matthias mentioned this above, embed the handshake-port in the connection file and the kernel recognizes the difference to trigger the newer behavior. But yes, your use case is an older juptyer_client that doesn't know to look at the kernelspec to determine this capability. Hmm, but if the embed-in-connection file approach is used, then that older JC would simply launch the kernel using "classic mode" and since kernels know how to deal with each, it would just work as it does today. |
I think so long as we pass what port this particular client is using it's fine. IMO the one port per kernel for handshake seems the simplest with the least amount of contract required (we're really unlikely to run out of ports in any reasonable usecase).
Passing just the handshake port would be insufficient I believe since we still need to securely pass a cryptographic key to the client to avoid port hijacks. Since it's a risk to pass this in plain text to a process command, likely we'll still need the connection file, just with a new field I think this would work just fine. It would mean we'll have a long overlap period in handshake behaviors from the approach, but that should be manageable. |
But will all your ports be opened ? I'm mostly worried about locked system when you would need to tweak firewalld or kubernetes. Isn't Random port binding after starting a thing an issue ? On how would client detect if a kernel support it, add akey in the spec file: "I support handshake". And if you see ports in the connection file then it's likely the client didn't support handshake. So I think this is not an issue. |
Yes it could be an issue, though usually there's a port range to leave available to bind against or dynamic network configuration options. But to help make behavior clear and easy to implement I think we need a clear contract that allows for specific port binding and random port binding independent of communication port. A -- current behavior) If provided specific ports (e.g. shell_port) a kernel tries to use that port and fails on bind failure. This way a kubernetes cluster that has dedicated ports can assign them if needed as per C), but it can also wait to try to connect until gets a message on the handshake rather than polling until available. Local processes can let the kernel choose a port rather than have them preassigned so you don't get port collisions. Thus for backwards compatability:
If this seems like that's a reasonable contract I can document this up more formally into a PR in our docs. Any concerns uncovered by this? |
One thing to consider for kubernetes / dsitributed kernel launches is that the handshake port assumes that there's a node holding open the port connections for the kernels. @kevin-bates How do you handle if the server hosting the handshake port rotates? |
I would imagine you only need the handshake for initialization and then you use ports defined in the ConfigMap for the pod so anyone can connect after that point? |
@kevin-bates @Carreau friendly ping the thread. I'd be happy to help improve things here, but want some consensus on changes before any PRs are made. |
I'm sorry @MSeal, I didn't see this question from 9 days ago until now...
We don't. The handshake port is only needed during the kernel's startup. Afterwhich, communication with the kernel is strictly through ZMQ. We (EG) do have a sixth port that is listened to by the kernel "launcher" for handling interrupt requests since signals don't span these kinds of boundaries and not all kernels we support also support message-based interrupts. (We also send a separate shutdown request to let the launcher know it should exit.) Since EG already has a means for all this, I guess my only concern is that changes be made in a way they can be optional, both from a kernelspec configuration and in terms of method overrides. So my hope is that new (and optional) functionality be fine-grained such that existing subclasses don't break. We essentially use B where the content sent over the handshake port becomes the connection info. We don't use a connection file on the server when using remote kernels, it's all in memory. Should there be an issue with the kernel, it's unlikely it will be able to respond and we have a separate discovery and timeout mechanism. We use the discovery mechanism to determine when the kernel should have started (since that can take a while depending on the resource manager) while monitoring the handshake port. If we have not received the handshake/connection info within a specified timeout window, we fail the startup. I think this would be a good addition to the ecosystem and it seems like you could address lots of frustrations just by extending ipykernel with this functionality. |
I think both B and C are fine. I don't have the cycle currently to do an implementation and will trust you on what part needs implementations and what the implementation would be. If a pull request there is and it's merged on master I'm happy to try and it and push for a release, even with this as a "preview" in order to get the ecosystem to settle. |
Great, thanks for the responses. The EG behavior matches what I'd expect -- good to know more of those details. I'll get started on some PRs for next week then and we can try out some of the behavior to see how well it works. |
Sorry I haven't gotten a change up for this yet. It's still in my queue of things to tackle (been trying to help nbconvert get 6.0 done). |
For anyone that is using import asyncio
from pathlib import Path
import nbformat
import portalocker
from nbclient import NotebookClient
def run_nb(filename: Path = Path("some_notebook.ipynb")):
nb = nbformat.read(filename)
executor = NotebookClient(
nb,
kernel_name="python3",
resources={"metadata": {"path": filename.parent}},
)
# Prepare kernel
executor.km = executor.create_kernel_manager()
with portalocker.Lock("jupyter_kernel.lock", timeout=300):
asyncio.run(executor.async_start_new_kernel())
asyncio.run(executor.async_start_new_kernel_client())
return executor.execute() |
Spawning many kernerls in a short lapse of time may result in ZMQError because one of the kernel tries to use a port already in use by another kernel.
This is due to the current implementation of jupyter_client: after free ports have been found, they are dumped in a connection file that will be passed to the kernel that the client will start.
The problem is that we might search for free ports after creating the connection file but before starting the kernel (when restoring a session in Jupyter Lab, or spawning multi kernels quickly in Voila for instance). Since the first kernel has not started yet, the ports are still free and
jupyter_client
might write a connection file for the next kernel to start with the same ports as in the first connection file. Therefore two kernels will attempt to use the same ports.Even if we can fix this issue in Jupyter Lab and Voilà (by searching free ports for all kernels first, and then writing all the connection files at once), this does not prevent other applications (unrelated to the Jupyter project) to start and use the port written in the connection file before the kernel has started.
A solution would be to always let the kernel find free ports and communicate them to the client (kind of handshaking pattern):
I am aware that this requires significant changes in the kernel protocol and the implementation of a lot of kernels, but I do not see a better solution to this issue.
cc @vidartf and @martinRenou who have been discussing this issue in Voila
The text was updated successfully, but these errors were encountered: