-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zmq.error.ZMQError: Address already in use, when running multiprocessing with multiple notebooks using papermill #511
Comments
Thanks for reporting @DanLeePy . There's been a known issue with rare race conditions causing port binding errors, but I wasn't able to get a nice stack trace before so this is helpful. That being said, check your pyzmq version and try papermill 2.1 as the dependency chain changed a little since 1.2.1 and there were some concurrency improvements made therein. @minrk @Carreau @davidbrochart @kevin-bates as we've each hit this issue and been involved in efforts to fix race conditions in the inter-process actions. We may want to move this issue to jupyter_client, but I dug through the stack traces and reread the port binding code. Since we're reaching https://github.com/ipython/ipykernel/blob/master/ipykernel/kernelapp.py#L194 the port is being passed down explicitly by jupyter_client. Specifically I think https://github.com/jupyter/jupyter_client/blob/6.1.3/jupyter_client/connect.py#L95-L104 is binding the port and then releasing it, leaving other port bindings done in parallel to grab to the same port in conflict of one-another. I think we can fix this by having https://github.com/jupyter/jupyter_client/blob/6.1.3/jupyter_client/manager.py#L289-L306 retry on port bind failure with a fresh set of ports, given that the race is rarely lost and an optimistic binding approach seems easiest. The other way to approach this would be for the binding to be left unassigned in the connection file and let ipykernel pick the port https://github.com/ipython/ipykernel/blob/master/ipykernel/kernelapp.py#L192 and make that port bind decision visible to jupyter client. This is a little less cross-kernel robust but also less noisy on retry attempts if the port binding becomes in conflict more often, as a pessimistic lock pattern. I'd be for the first approach unless there's additional feedback you all have on this? |
I'm wondering if it's not related to jupyter/jupyter_client#487 again. |
It is I think. But you can see from the link here that the bound port is released before launching the kernel and then it assumes the kernel subprocess will have that port available. This is a race condition as another process may try to pick the same port between releasing and reacquiring it. |
I agree this is jupyter/jupyter_client#487 again, As pointed there, If we cleanly want to solve this we need some way to swap connect/bind, there are many reason to want that. The original design was that kernels would be long and persistant and that many clients may connect/disconnect/reconnect thus they act like "servers"; but it looks like in most case it's the opposite, the clients are actually things that manage multiple kernels and are persistant. By swapping connect and bind, then the kernels would have to connect to the client which often will be way easier to setup as you already know where the client is, and you can grab the port and then create the kernel.json file. Typically HPC system don't know yet where a kernel will start as well for example to it would be easier for the kernel to connect back to the login node. So my take would be an optional field in kernel.json which is opt-in and say that the connect/bind are reversed. I'm not too sure how that affects ZMQ Pub/Sub and similar, plus it also is not robust to connection dropping (we coudl have the kernel retry to connect regularly. On the other hand it would also be also way easier for kernel discovery as the kernels are pinging you, the client back. It woudl be relatively easy to provide a wrapper launcher that act as a connect-bind proxy. |
Hi Matthew thanks for your prompt response! The original pyzmq version was 17.1.2. I tried to update pyzmq to the latest version ,19.0.1, and papermill to 2.1.1. However I still do get the same errors. |
This is exactly what we do to support remote kernels in Enterprise Gateway. Rather than specify the kernel to launch, the kernel.json specifies a kernel launcher and a response address template (filled in when the kernel is started). The kernel launcher knows how to start/embed the kernel it's launching and is responsible for determining the ZMQ ports and reporting them back on the response address, where EG is listening. This way, we don't touch kernel implementations (which we consider a requirement). |
Then I'm for moving some of that into the core as optional to implement for
now.
…On Sun, Jun 7, 2020, 07:49 Kevin Bates ***@***.***> wrote:
Typically HPC system don't know yet where a kernel will start as well for
example to it would be easier for the kernel to connect back to the login
node.
...
It woudl be relatively easy to provide a wrapper launcher that act as a
connect-bind proxy.
This is exactly what we do to support remote kernels in Enterprise
Gateway. Rather than specify the kernel to launch, the kernel.json
specifies a *kernel launcher* and a response address template (filled in
when the kernel is started). The kernel launcher knows how to start/embed
the kernel it's launching and is responsible for determining the ZMQ ports
and reporting them back on the response address, where EG is listening.
This way, we don't touch kernel implementations (which we consider a
requirement).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#511 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACR5TZSMSFY6YRCZCESVXLRVOSGNANCNFSM4NV66UCQ>
.
|
Yup, #487 is a fundamental limitation of the parent selecting random kernel-allocated TCP ports in certain circumstances, relying on the kernel's TIME_WAIT implementation to keep the socket available, which it doesn't strictly need to do. We should have a standard way to eliminate this; it was one of the several things the kernel nanny addition was meant to solve, I think. The parent selecting from its own non-kernel-owned port range pool is another way to do it. For the simplest reliable workaround today, if you are running on localhost and not Windows, you can select ipc transport instead of tcp: c.KernelManager.transport = "ipc"
# or `--KernelManager.transport=ipc` on the command-line The IPC transport doesn't have any race conditions on port allocation, since we can use kernel UUIDs in paths to eliminate the possibility of collision. |
Speaking of kernel nanny, is it still planned to have https://github.com/takluyver/jupyter_kernel_mgmt as the successor of |
Hi @davidbrochart - Formally speaking, that proposal is still in the JEP process - although it appears to have a majority of approvals. To the best of my knowledge, Jupyter Server still plans on adopting this approach because it believes multiple (and simultaneous) kernel manager implementations should be supported. I think we should view the current jupyter_client implementation essentially becoming the basis for the default kernel provider implementation. However, given the focus and work put into the jupyter_client package recently, I think we're looking at a longer transition period - which is probably fine. This discussion is probably better suited in |
I stumbled across this with the same issue in jupyter/jupyter_client#154 (comment). What's the best way to pass
did not work for me (neither did putting it in |
Hi @mlucool. I think because papermill doesn't derive from the traitlets One approach you could take, which sounds long and arduous but really isn't that bad, is to create your own papermill engine (derived from |
Thanks for the great tip! This is very close (and even easier than you said), but still seeing some issues: Given the following files:
Running
Feels like we are one path away from this work! Any further tips @kevin-bates? |
I'm not familiar with the IPC transport, but looking into this for a bit, it appears there needs to be better "randomization" happening when checking for the ipc "port" existence. In the case of "transport == ipc" the As a workaround, you could simply bring your own kernel-id and connection-file values: import os
import uuid
from jupyter_client.manager import KernelManager
from jupyter_core.paths import jupyter_runtime_dir
class IPCKernelManager(KernelManager):
def __init__(self, *args, **kwargs):
kernel_id = str(uuid.uuid4())
connection_file = os.path.join(jupyter_runtime_dir(), f"kernel-{kernel_id}.json")
super().__init__(*args, transport = "ipc", kernel_id=kernel_id, connection_file=connection_file, **kwargs) Note that the current code creates the IPC files in the current directory and not colocated with the connection file, whereas this will colocate the two sets (and better IMHO). |
That worked great - thanks!! This seems to fully solve the issue if you can use IPC. |
@kevin-bates I thought you may be interested in this from a protocol POV: the runtime dir did not work great on NFS for the above test. I saw race conditions resulting in |
@mlucool - thanks for the update.
Each of the jupyter-based directories (runtime, config, data) can be "redirected" via envs. Namely, |
A small improvement on @kevin-bates import os
import uuid
from jupyter_client.manager import KernelManager
from jupyter_core.paths import jupyter_runtime_dir
class IPCKernelManager(KernelManager):
def __init__(self, *args, **kwargs):
kernel_id = str(uuid.uuid4())
os.makedirs(jupyter_runtime_dir(), exist_ok=True)
connection_file = os.path.join(jupyter_runtime_dir(), f"kernel-{kernel_id}.json")
super().__init__(*args, transport = "ipc", kernel_id=kernel_id, connection_file=connection_file, **kwargs) |
I am using the papermill library to run multiple notebooks using multiprocessing simultaneously.
This is occurring on Python 3.6.6, Red Hat 4.8.2-15 within a Docker container.
However when I run the python script, about 5% of my notebooks do not work immediately (No Jupyter Notebook cells run) due to me receiving this error:
along with:
Please help me with this problem, as I have scoured the web trying different solutions, none that have worked for my case so far.
This error rate of 5% occurs regardless of the number of notebooks I run simultaneously or the number of cores on my computer which makes it extra curious.
I have tried changing the start method and updating the libraries but to no avail.
The version of my libraries are:
Thank you!
The text was updated successfully, but these errors were encountered: