Tests which fail in multiprocessing contexts #1018

alexrudy · 2019-05-08T15:27:49Z

@MSeal and I got a ways toward supporting multiprocessing at PyCon sprints. The two tests added here, one of which still fails, demonstrate the problem from nbconvert.

I am working on a minimum reproduction for jupyter_client as well, but I wanted to put this branch up somewhere so that we can discuss & share the reproducing examples. I've marked both tests as XFail, but since they are both slow, I don't even know that this should get merged.

MSeal · 2019-05-12T22:49:35Z

Thanks for posting. Let's leave this up while the jupyter_client change gets reviewed and then re-evaluate if the tests need to be made quicker to merge or not.

MSeal

Apologies for being disconnected for a while. This was the first weekend I was around to go through the past week or two's open source PRs (been a busy month).

Do you have any suggestions for debugging test_parallel_fork_notebooks further? I could probably take a deeper dive on Monday into what's happening here.

MSeal · 2019-05-25T19:09:24Z

nbconvert/preprocessors/tests/test_execute.py

+    assert captured.err == ""
+
+@pytest.mark.xfail
+def test_parallel_fork_notebooks(capfd):


I can't get this test to pass on my ubuntu 18.04 machine. It will always hang if the thread is running when multi-processing launches.

I tried with and without the zeromq cleanup, and with / without the jupyter_client pending release changes. Haven't dug in deeper as to why this still hangs.

MSeal · 2019-05-25T19:10:54Z

nbconvert/preprocessors/tests/test_execute.py

+                    input_file,
+                    opts,
+                    res,
+                    functools.partial(label_parallel_notebook, label=label),


@mpacer Wanted to keep run_notebook helper from getting more complex. Could we check in Parallel Execute A.ipynb and Parallel Execute B.ipynb into the PR to avoid adding this?

Sorry for not reaching out about this directly. What I would recommend doing would be the following
Instead of

threads = [ threading.Thread( target=run_notebook, args=( input_file, opts, res, functools.partial(label_parallel_notebook, label=label), ), ) for label in ("A", "B") ]

go with

input_files = [label_parallel_notebook(input_file, label=label) for label in ("A", "B")] threads = [ threading.Thread( target=run_notebook, args=( this_input_file, opts, res ), ) for this_input_file in input_files ]

That way you don't need to change the signature of run_notebook. This should work because preprocessing the notebook to add a cell doesn't need to occur concurrently, but the notebooks themselves will still be running concurrently.

MSeal · 2019-05-25T19:11:59Z

nbconvert/preprocessors/tests/test_execute.py

+    # Destroy the context - if you don't do this, the context
+    # will survive across the fork, and then fail to start properly.
+    import zmq
+    zmq.Context.instance().destroy()


FYI I tested this with the new jupyter_client. Could we add a TODO: delete when jupyter_client>=5.2.4 releases?

MSeal · 2019-05-30T21:30:17Z

I haven't gotten an actual stack trace here, but in tests I have seen

___________________________ test_parallel_notebooks ____________________________
capfd = <_pytest.capture.CaptureFixture object at 0x7fbbe7dc2790>
tmpdir = local('/tmp/pytest-of-travis/pytest-0/test_parallel_notebooks0')
    def test_parallel_notebooks(capfd, tmpdir):
        """Two notebooks should be able to be run simultaneously without problems.
    
        The two notebooks spawned here use the filesystem to check that the other notebook
        wrote to the filesystem."""
    
        opts = dict(kernel_name="python")
        input_name = "Parallel Execute.ipynb"
        input_file = os.path.join(current_dir, "files", input_name)
        res = notebook_resources()
    
        with modified_env({"NBEXECUTE_TEST_PARALLEL_TMPDIR": str(tmpdir)}):
            threads = [
                threading.Thread(
                    target=run_notebook,
                    args=(
                        input_file,
                        opts,
                        res,
                        functools.partial(label_parallel_notebook, label=label),
                    ),
                )
                for label in ("A", "B")
            ]
            [t.start() for t in threads]
            [t.join(timeout=2) for t in threads]
    
        captured = capfd.readouterr()
>       assert captured.err == ""
E       assert 'Traceback (m...kernel_info\n\n' == ''
E         - Traceback (most recent call last):
E         -   File "/opt/python/2.7.15/lib/python2.7/runpy.py", line 174, in _run_module_as_main
E         -     "__main__", fname, loader, pkg_name)
E         -   File "/opt/python/2.7.15/lib/python2.7/runpy.py", line 72, in _run_code
E         -     exec code in run_globals
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/ipykernel_launcher.py", line 16, in <module>
E         -     app.launch_new_instance()
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/traitlets/config/application.py", line 657, in launch_instance
E         -     app.initialize(argv)
E         -   File "</home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/decorator.pyc:decorator-gen-121>", line 2, in initialize
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/traitlets/config/application.py", line 87, in catch_config_error
E         -     return method(app, *args, **kwargs)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 469, in initialize
E         -     self.init_sockets()
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 239, in init_sockets
E         -     self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 181, in _bind_socket
E         -     s.bind("tcp://%s:%i" % (self.ip, port))
E         -   File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind
E         -   File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
E         -     raise ZMQError(errno)
E         - ZMQError: Address already in use
E         - Exception in thread Thread-16:
E         - Traceback (most recent call last):
E         -   File "/opt/python/2.7.15/lib/python2.7/threading.py", line 801, in __bootstrap_inner
E         -     self.run()
E         -   File "/opt/python/2.7.15/lib/python2.7/threading.py", line 754, in run
E         -     self.__target(*self.__args, **self.__kwargs)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/tests/test_execute.py", line 97, in run_notebook
E         -     output_nb, _ = preprocessor(cleaned_input_nb, resources)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/base.py", line 47, in __call__
E         -     return self.preprocess(nb, resources)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/execute.py", line 400, in preprocess
E         -     with self.setup_preprocessor(nb, resources, km=km):
E         -   File "/opt/python/2.7.15/lib/python2.7/contextlib.py", line 17, in __enter__
E         -     return self.gen.next()
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/execute.py", line 345, in setup_preprocessor
E         -     self.km, self.kc = self.start_new_kernel(**kwargs)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/execute.py", line 296, in start_new_kernel
E         -     kc.wait_for_ready(timeout=self.startup_timeout)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/jupyter_client/blocking/client.py", line 120, in wait_for_ready
E         -     raise RuntimeError('Kernel died before replying to kernel_info')
E         - RuntimeError: Kernel died before replying to kernel_info
E         -
/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/tests/test_execute.py:305: AssertionError

now which indicates there might be another issue in on the kernel side with port management?

mpacer · 2019-05-30T23:46:50Z

nbconvert/preprocessors/tests/test_execute.py

@@ -69,14 +72,17 @@ def build_preprocessor(opts):
    return preprocessor


-def run_notebook(filename, opts, resources):
+def run_notebook(filename, opts, resources, preprocess_notebook=None):


As described below, you can avoid adding this complication that also introduces an additional meaning to what it means to preprocess a notebook.

mpacer · 2019-05-30T23:48:19Z

nbconvert/preprocessors/tests/test_execute.py

+        }
+    )
+
+    nb.cells.insert(1, label_cell)


I would make a deepcopy rather than mutating the copy you sent in.

Suggested change

nb.cells.insert(1, label_cell)

nb = deepcopy(nb)

nb.cells.insert(1, label_cell)

mpacer · 2019-05-30T23:50:40Z

nbconvert/preprocessors/tests/test_execute.py

+
+    Used for parallel testing to label two notebooks which are run simultaneously.
+    """
+    label_cell = nbformat.NotebookNode(


Rather than creating a NotebookNode directly I would use the new_code_cell function from nbformat.v4.
https://github.com/jupyter/nbformat/blob/11903688167d21af96c92f7f5bf0634ab51819f1/nbformat/v4/nbbase.py#L98-L110

The reason to do that is it ensures that the code cell itself will be valid according to the v4 spec.

mpacer · 2019-05-30T23:53:32Z

nbconvert/preprocessors/tests/test_execute.py

+    return nb
+
+
+def test_parallel_notebooks(capfd, tmpdir):


Are capfd and tmpdir pytest fixtures? Otherwise, where are these being populated when it comes time to test them?

Yes they are provided by pytest: https://docs.pytest.org/en/latest/tmpdir.html#the-tmpdir-fixture https://docs.pytest.org/en/latest/capture.html#accessing-captured-output-from-a-test-function

mpacer · 2019-05-30T23:55:21Z

nbconvert/preprocessors/execute.py

@@ -220,6 +220,20 @@ class ExecutePreprocessor(Preprocessor):
            )
    ).tag(config=True)

+    ipython_hist_file = Unicode(


I think you might want to remove this change and put it in its own PR as it stands alone as a huge benefit.

mpacer · 2019-05-31T00:04:48Z

E - raise ZMQError(errno)
E - ZMQError: Address already in use

That definitely suggests that whatever is assigning ports is unaware of the other instances that are assigning ports. This might have been introduced by getting rid of the singleton ZmqContext or it might have been always present but less likely to occur because there were fewer ports being assigned.

I wonder if there are lessons to be learned from how the MultiKernelManager handles this case. The notebook server builds on that in its MappingKernelManager.

But the fact that we're running into this issue might mean that we might need to provide a primitive on the nbconvert side to enable this rather than relying on library users to solve it in their own parallel code.

MSeal · 2019-05-31T01:06:34Z

The subprocess wouldn't have the zeromq change or parent process manipulation, as that's unreleased isolated by process. So I believe it's unrelated to the singleton change. I think I found the bug: https://github.com/ipython/ipykernel/blob/master/ipykernel/kernelapp.py#L185-L187 is a race condition where between when it finds a free port (it's written to handle port conflicts) and when it actually uses it the port could have been taken by another process. It's likely not safe regardless of nbconvert parallelism if you have a notebook server up and are running nbconvert or another notebook server at the same time.

I'll open an issue / fix for ipykernel.

On the later comment you had I don't know if an nbconvert primitive would help, because I might spawn multiple nbconvert processes at the same time outside of a parent python process (e.g. spawning nbconvert processes in a bash loop). This would probably defeat the approach MultiKernelManager takes. Maybe we could use a file lock to enable cross process nbconvert instances to wait for each kernel to launch before starting a new one. It would only make nbconvert parallelism safe (and slow) but not safe across multiple types of clients connecting to zeromq. Correct me if your thinking of something else here.

Basically I think the kernels wrappers should solve it and we should help improve the tools for making kernels to not have to think about this if we can.

MSeal · 2019-06-02T19:49:11Z

I now have a local fix to the the ipykernel socket issue. But that failure mode is pretty rare and wasn't what we were hitting here.

With some modifications to tests and printing a few lines wrapping execute starts and stops I was able to narrow down that any processes that start while the threaded execution is running fail. Looking deeper by adding a log message whenever a zmq context instance is generated revealed that the forked processes before that the parent process finishes executing would not generate new context instances. I believe the traitlets usage or some other wrapper is leaving the context object on the class and reusing it in the forked processes instead of recomputing.

/usr/lib/python3.6/concurrent/futures/_base.py:384: TimeoutError
----------------------------------------------------------------------------------------------- Captured stdout call ------------------------------------------------------------------------------------------------
ABOUT TO START 0! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/Sleep One.ipynb
ABOUT TO START 1! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
WAITING ON 8
ABOUT TO START 2! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
ABOUT TO START 4! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
ABOUT TO START 3! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
CREATING CONTEXT
DONE 0! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/Sleep One.ipynb
ABOUT TO START 5! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
ABOUT TO START 6! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
ABOUT TO START 7! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
ABOUT TO START 8! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
CREATING CONTEXT
CREATING CONTEXT
CREATING CONTEXT
CREATING CONTEXT
DONE 8! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
WAITING ON 7
DONE 7! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
WAITING ON 6
DONE 6! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
WAITING ON 5
DONE 5! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
WAITING ON 4
# Times out on 4 waiting for a failed processes

MSeal · 2019-06-02T20:34:23Z

Seems like the processes 1-4 aren't even making it to the kernel client initialization. I'll do some more debugging tonight and find where they fail to get past more definitively.

MSeal · 2019-06-12T05:30:35Z

Found the parallel processing issue after I tracked it down to the GIL freezing up before entering the __new__ class method of the KernelManager.
https://bugs.python.org/issue27422
... forgot the C rule of never thread then fork, always fork and then thread

Basically the processes hangs if the GIL is still active when the multiprocessing forks try to acquire it and deadlock.
If the multiprocessing forks happen to get started before the threads do (which we saw sometimes) then we had a fork first situation and it would pass.

Going to modify the PR to remove the thread then fork test and get the rest merged so we can move forward. There is a race condition in ipykernel that's difficult to trigger that I have PR partially written to solve the only remaining race condition I've seen crop up.

MSeal · 2019-06-13T05:57:58Z

Known race condition in ipykernel now has a PR to address: ipython/ipykernel#412

meeseeksmachine · 2019-08-09T04:11:35Z

This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/nbconvert-5-6-0-release/1867/1

alexrudy added 2 commits May 8, 2019 10:34

Parallel Demonstrator Tests

84108d8

Mark tests as XFail

be4ee65

alexrudy mentioned this pull request May 8, 2019

Avoid kernel failures with multiple processes jupyter/jupyter_client#437

Merged

alexrudy added 2 commits May 25, 2019 12:13

Parallel Demonstrator Tests

b13601f

Mark tests as XFail

f4752c0

MSeal requested changes May 25, 2019

View reviewed changes

mpacer reviewed May 30, 2019

View reviewed changes

MSeal added 4 commits June 11, 2019 22:54

Applied PR feedback

a08d359

Merge remote-tracking branch 'rudy/fix-zmq-context-global' into fix-zmq

6a0f8c8

Removed duplicate tests from merge

55bb0db

Merge branch 'master' of github.com:jupyter/nbconvert into fix-zmq

cc0ac44

MSeal approved these changes Jun 12, 2019

View reviewed changes

MSeal added 7 commits June 11, 2019 23:44

Added additional timeout delay to test

5c0848f

Set additional timeout on the correct test field

b88f3e9

Added ability to turn off slow tests

a7a92b1

Attempt to get travis passing slow tests

ae011e7

Fixing pytest conf issues with temp directories

a443d03

Moved conftest into nbconvert path

d1a3a6d

Removed unecessary travis command

4d3e992

MSeal added 8 commits June 12, 2019 09:33

Another attempt to get Travis

48beaa4

Simplified the test execution

baea777

Yet another travis fix attempt

f81e1a2

Adding much higher timeouts to failing test

1434a2a

Attempt #billion to fix travis

a818745

Travis only-failure debug

bcd5157

Added latest jupyter_client to travis for test run

f74f2e1

Removed extra debug lines

c3ef4b7

MSeal added 2 commits July 8, 2019 09:44

Resolve conflict with master

9bbcecb

Removed jupyter_client master install from travis

70d47b7

MSeal merged commit 76061f8 into jupyter:master Jul 8, 2019

MSeal added this to the 5.6 milestone Jul 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests which fail in multiprocessing contexts #1018

Tests which fail in multiprocessing contexts #1018

alexrudy commented May 8, 2019

MSeal commented May 12, 2019

MSeal left a comment

MSeal May 25, 2019

MSeal May 25, 2019

MSeal May 25, 2019

mpacer May 30, 2019

MSeal May 25, 2019

MSeal commented May 30, 2019

mpacer May 30, 2019

mpacer May 30, 2019

mpacer May 30, 2019

mpacer May 30, 2019

mpacer May 30, 2019

MSeal Jun 12, 2019

mpacer May 30, 2019

mpacer commented May 31, 2019

MSeal commented May 31, 2019

MSeal commented Jun 2, 2019 •

edited

Loading

MSeal commented Jun 2, 2019

MSeal commented Jun 12, 2019

MSeal commented Jun 13, 2019

meeseeksmachine commented Aug 9, 2019

	nb.cells.insert(1, label_cell)
	nb = deepcopy(nb)
	nb.cells.insert(1, label_cell)

Tests which fail in multiprocessing contexts #1018

Tests which fail in multiprocessing contexts #1018

Conversation

alexrudy commented May 8, 2019

MSeal commented May 12, 2019

MSeal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MSeal commented May 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpacer commented May 31, 2019

MSeal commented May 31, 2019

MSeal commented Jun 2, 2019 • edited Loading

MSeal commented Jun 2, 2019

MSeal commented Jun 12, 2019

MSeal commented Jun 13, 2019

meeseeksmachine commented Aug 9, 2019

MSeal commented Jun 2, 2019 •

edited

Loading