Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zmq.error.ZMQError: Address already in use, when running multiprocessing with multiple notebooks using papermill #511

Open
DanLeePy opened this issue Jun 6, 2020 · 18 comments
Labels
upstream projects that are not maintained by nteract

Comments

@DanLeePy
Copy link

DanLeePy commented Jun 6, 2020

I am using the papermill library to run multiple notebooks using multiprocessing simultaneously.

This is occurring on Python 3.6.6, Red Hat 4.8.2-15 within a Docker container.

However when I run the python script, about 5% of my notebooks do not work immediately (No Jupyter Notebook cells run) due to me receiving this error:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 657, in launch_instance
    app.initialize(argv)
  File "<decorator-gen-124>", line 2, in initialize
  File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 469, in initialize
    self.init_sockets()
  File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 238, in init_sockets
    self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
  File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 180, in _bind_socket
    s.bind("tcp://%s:%i" % (self.ip, port))
  File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use

along with:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "main.py", line 77, in run_papermill
    pm.execute_notebook(notebook, output_path, parameters=config)
  File "/opt/conda/lib/python3.6/site-packages/papermill/execute.py", line 104, in execute_notebook
    **engine_kwargs
  File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 49, in execute_notebook_with_engine
    return self.get_engine(engine_name).execute_notebook(nb, kernel_name, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 304, in execute_notebook
    nb = cls.execute_managed_notebook(nb_man, kernel_name, log_output=log_output, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 372, in execute_managed_notebook
    preprocessor.preprocess(nb_man, safe_kwargs)
  File "/opt/conda/lib/python3.6/site-packages/papermill/preprocess.py", line 20, in preprocess
    with self.setup_preprocessor(nb_man.nb, resources, km=km):
  File "/opt/conda/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/opt/conda/lib/python3.6/site-packages/nbconvert/preprocessors/execute.py", line 345, in setup_preprocessor
    self.km, self.kc = self.start_new_kernel(**kwargs)
  File "/opt/conda/lib/python3.6/site-packages/nbconvert/preprocessors/execute.py", line 296, in start_new_kernel
    kc.wait_for_ready(timeout=self.startup_timeout)
  File "/opt/conda/lib/python3.6/site-packages/jupyter_client/blocking/client.py", line 104, in wait_for_ready
    raise RuntimeError('Kernel died before replying to kernel_info')
RuntimeError: Kernel died before replying to kernel_info

Please help me with this problem, as I have scoured the web trying different solutions, none that have worked for my case so far.

This error rate of 5% occurs regardless of the number of notebooks I run simultaneously or the number of cores on my computer which makes it extra curious.

I have tried changing the start method and updating the libraries but to no avail.

The version of my libraries are:

papermill==1.2.1
ipython==7.14.0
jupyter-client==6.1.3
pyzmq==17.1.2

Thank you!

@MSeal
Copy link
Member

MSeal commented Jun 6, 2020

Thanks for reporting @DanLeePy . There's been a known issue with rare race conditions causing port binding errors, but I wasn't able to get a nice stack trace before so this is helpful. That being said, check your pyzmq version and try papermill 2.1 as the dependency chain changed a little since 1.2.1 and there were some concurrency improvements made therein.

@minrk @Carreau @davidbrochart @kevin-bates as we've each hit this issue and been involved in efforts to fix race conditions in the inter-process actions. We may want to move this issue to jupyter_client, but I dug through the stack traces and reread the port binding code. Since we're reaching https://github.com/ipython/ipykernel/blob/master/ipykernel/kernelapp.py#L194 the port is being passed down explicitly by jupyter_client. Specifically I think https://github.com/jupyter/jupyter_client/blob/6.1.3/jupyter_client/connect.py#L95-L104 is binding the port and then releasing it, leaving other port bindings done in parallel to grab to the same port in conflict of one-another. I think we can fix this by having https://github.com/jupyter/jupyter_client/blob/6.1.3/jupyter_client/manager.py#L289-L306 retry on port bind failure with a fresh set of ports, given that the race is rarely lost and an optimistic binding approach seems easiest. The other way to approach this would be for the binding to be left unassigned in the connection file and let ipykernel pick the port https://github.com/ipython/ipykernel/blob/master/ipykernel/kernelapp.py#L192 and make that port bind decision visible to jupyter client. This is a little less cross-kernel robust but also less noisy on retry attempts if the port binding becomes in conflict more often, as a pessimistic lock pattern. I'd be for the first approach unless there's additional feedback you all have on this?

@davidbrochart
Copy link

I'm wondering if it's not related to jupyter/jupyter_client#487 again.

@MSeal
Copy link
Member

MSeal commented Jun 6, 2020

It is I think. But you can see from the link here that the bound port is released before launching the kernel and then it assumes the kernel subprocess will have that port available. This is a race condition as another process may try to pick the same port between releasing and reacquiring it.

@Carreau
Copy link
Member

Carreau commented Jun 6, 2020

I agree this is jupyter/jupyter_client#487 again,

As pointed there, If we cleanly want to solve this we need some way to swap connect/bind, there are many reason to want that.

The original design was that kernels would be long and persistant and that many clients may connect/disconnect/reconnect thus they act like "servers"; but it looks like in most case it's the opposite, the clients are actually things that manage multiple kernels and are persistant.

By swapping connect and bind, then the kernels would have to connect to the client which often will be way easier to setup as you already know where the client is, and you can grab the port and then create the kernel.json file.

Typically HPC system don't know yet where a kernel will start as well for example to it would be easier for the kernel to connect back to the login node.

So my take would be an optional field in kernel.json which is opt-in and say that the connect/bind are reversed.

I'm not too sure how that affects ZMQ Pub/Sub and similar, plus it also is not robust to connection dropping (we coudl have the kernel retry to connect regularly.

On the other hand it would also be also way easier for kernel discovery as the kernels are pinging you, the client back.

It woudl be relatively easy to provide a wrapper launcher that act as a connect-bind proxy.

@DanLeePy
Copy link
Author

DanLeePy commented Jun 7, 2020

@MSeal

Hi Matthew thanks for your prompt response!

The original pyzmq version was 17.1.2.

I tried to update pyzmq to the latest version ,19.0.1, and papermill to 2.1.1. However I still do get the same errors.

@kevin-bates
Copy link

Typically HPC system don't know yet where a kernel will start as well for example to it would be easier for the kernel to connect back to the login node.
...
It woudl be relatively easy to provide a wrapper launcher that act as a connect-bind proxy.

This is exactly what we do to support remote kernels in Enterprise Gateway. Rather than specify the kernel to launch, the kernel.json specifies a kernel launcher and a response address template (filled in when the kernel is started). The kernel launcher knows how to start/embed the kernel it's launching and is responsible for determining the ZMQ ports and reporting them back on the response address, where EG is listening. This way, we don't touch kernel implementations (which we consider a requirement).

@Carreau
Copy link
Member

Carreau commented Jun 7, 2020 via email

@minrk
Copy link
Contributor

minrk commented Jun 8, 2020

Yup, #487 is a fundamental limitation of the parent selecting random kernel-allocated TCP ports in certain circumstances, relying on the kernel's TIME_WAIT implementation to keep the socket available, which it doesn't strictly need to do. We should have a standard way to eliminate this; it was one of the several things the kernel nanny addition was meant to solve, I think. The parent selecting from its own non-kernel-owned port range pool is another way to do it.

For the simplest reliable workaround today, if you are running on localhost and not Windows, you can select ipc transport instead of tcp:

c.KernelManager.transport = "ipc"
# or `--KernelManager.transport=ipc` on the command-line

The IPC transport doesn't have any race conditions on port allocation, since we can use kernel UUIDs in paths to eliminate the possibility of collision.

@davidbrochart
Copy link

Speaking of kernel nanny, is it still planned to have https://github.com/takluyver/jupyter_kernel_mgmt as the successor of jupyter_client?

@kevin-bates
Copy link

Hi @davidbrochart - Formally speaking, that proposal is still in the JEP process - although it appears to have a majority of approvals. To the best of my knowledge, Jupyter Server still plans on adopting this approach because it believes multiple (and simultaneous) kernel manager implementations should be supported. I think we should view the current jupyter_client implementation essentially becoming the basis for the default kernel provider implementation. However, given the focus and work put into the jupyter_client package recently, I think we're looking at a longer transition period - which is probably fine. This discussion is probably better suited in jupyter_client or the JEP itself.

@mlucool
Copy link
Contributor

mlucool commented May 3, 2022

I stumbled across this with the same issue in jupyter/jupyter_client#154 (comment). What's the best way to pass KernelManager.transport to papermill today? Passing

--KernelManager.transport=ipc on the command-line

did not work for me (neither did putting it in jupyter_server_config.py or jupyter_notebook_config.py).

@kevin-bates
Copy link

Hi @mlucool. I think because papermill doesn't derive from the traitlets Application class, there isn't a driver to exercise the configuration machinery. (I would love to hear otherwise and feel I'm missing something here, but there must be a way to configure configurables independent of a configurable Application.) Papermill also doesn't expose the ability to bring your own KernelManager (like nbclient does via NotebookClient.kernel_manager_class).

One approach you could take, which sounds long and arduous but really isn't that bad, is to create your own papermill engine (derived from NBClientEngine which would allow you to then specify the KernelManager class you wish to use. You could then implement an "IPCKernelManager" that just sets transport and reference that class name in your papermill engine - and, then, reference the engine on the papermill CLI. FWIW, we use this approach in Elyra to run local notebook pipelines against an EG via papermill and need the ability to pass other args to the kernel launch.

@mlucool
Copy link
Contributor

mlucool commented Jun 1, 2022

Thanks for the great tip! This is very close (and even easier than you said), but still seeing some issues:

Given the following files:

./__init__.py

./test.py
#!/usr/local/bin/python
import papermill as pm
import tempfile

with tempfile.TemporaryDirectory() as tmpdirname:
    pm.execute_notebook(
        "/path/to/any/notebook.ipynb",
        tmpdirname + "/ignored-output.ipynb",
        progress_bar=False,
        timeout=1800,
        kernel_manager_class="IPCKernelManager.IPCKernelManager",
    )

./test_runner.sh
#!/bin/bash

for i in {1..2}
do
    ./test.py &
done

./IPCKernelManager.py
from jupyter_client.manager import KernelManager


class IPCKernelManager(KernelManager):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, transport = "ipc", **kwargs)

Running test.py seems to work. But not test_runner.sh. I think this is because the ipc channels are all overlapping and I get a ton of messages like:

Traceback (most recent call last):
  File "/usr/local/python/python-3.9/std/lib64/python3.9/site-packages/ipykernel/kernelbase.py", line 359, in dispatch_shell
    msg = self.session.deserialize(msg, content=True, copy=False)
  File "/usr/local/python/python-3.9/std/lib64/python3.9/site-packages/jupyter_client/session.py", line 1054, in deserialize
    raise ValueError("Invalid Signature: %r" % signature)
ValueError: Invalid Signature: b'4ccdd36edf8bc494aba12bae8f5d8de9f21887bde0bd05a36ba387a43780f7d6'

Feels like we are one path away from this work! Any further tips @kevin-bates?

@kevin-bates
Copy link

I'm not familiar with the IPC transport, but looking into this for a bit, it appears there needs to be better "randomization" happening when checking for the ipc "port" existence.

In the case of "transport == ipc" the ip value is either "kernel-ipc" or "kernel-kernel_id-ipc" where the latter is really the value of self.connection_file sans the .json suffix (and would provide sufficient randomization). However, because self.connection_file is not set (by default) at the time the ports are "checked", simultaneous kernel starts will always check against "kernel-ipc-N" (where N is 1..5) and always succeed due depending on the race condition, and thus the collision.

As a workaround, you could simply bring your own kernel-id and connection-file values:

import os
import uuid

from jupyter_client.manager import KernelManager
from jupyter_core.paths import jupyter_runtime_dir


class IPCKernelManager(KernelManager):
    def __init__(self, *args, **kwargs):
        kernel_id = str(uuid.uuid4())
        connection_file = os.path.join(jupyter_runtime_dir(), f"kernel-{kernel_id}.json")
        super().__init__(*args, transport = "ipc", kernel_id=kernel_id, connection_file=connection_file, **kwargs)

Note that the current code creates the IPC files in the current directory and not colocated with the connection file, whereas this will colocate the two sets (and better IMHO).

@mlucool
Copy link
Contributor

mlucool commented Jun 1, 2022

That worked great - thanks!! This seems to fully solve the issue if you can use IPC.

@mlucool
Copy link
Contributor

mlucool commented Jun 1, 2022

@kevin-bates I thought you may be interested in this from a protocol POV: the runtime dir did not work great on NFS for the above test. I saw race conditions resulting in RuntimeError: Kernel didn't respond in 60 seconds. Moving this to local disk solves any issues.

@kevin-bates
Copy link

@mlucool - thanks for the update.

the runtime dir did not work great on NFS for the above test. I saw race conditions resulting in RuntimeError: Kernel didn't respond in 60 seconds. Moving this to local disk solves any issues.

Each of the jupyter-based directories (runtime, config, data) can be "redirected" via envs. Namely, JUPYTER_RUNTIME_DIR, JUPYTER_CONFIG_DIR, and JUPYTER_DATA_DIR, respectively. In situations where the files require a high degree of access (like the IPC files), you're probably better off pointing the corresponding env to a local directory, which would also benefit other jupyter-based applications in that particular configuration and allow the same code to run irrespective of the underlying env values.

@miltondp
Copy link

A small improvement on @kevin-bates IPCKernelManager class to make sure the runtime directory exists:

import os
import uuid

from jupyter_client.manager import KernelManager
from jupyter_core.paths import jupyter_runtime_dir


class IPCKernelManager(KernelManager):
    def __init__(self, *args, **kwargs):
        kernel_id = str(uuid.uuid4())
        os.makedirs(jupyter_runtime_dir(), exist_ok=True)
        connection_file = os.path.join(jupyter_runtime_dir(), f"kernel-{kernel_id}.json")
        super().__init__(*args, transport = "ipc", kernel_id=kernel_id, connection_file=connection_file, **kwargs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream projects that are not maintained by nteract
Projects
None yet
Development

No branches or pull requests

9 participants