Infinite retries if segv on import #236

timj · 2017-09-28T16:48:56Z

In our project we are developing many python modules that include C++. Some times this means that the test file fails on import. Without xdist you get bad exit status to the shell so you know something went wrong:

$ pytest 
======================================================== test session starts ========================================================
platform darwin -- Python 3.6.1, pytest-3.2.0, py-1.4.34, pluggy-0.4.0
rootdir: /Users/timj/work/lsst/xdist-crash, inifile:
plugins: session2file-0.1.9, forked-0.3.dev0+g1dd93f6.d20170913, xdist-14.1.dev1+g5772c03.d20170928, flake8-0.8.1
collecting 0 itemsSegmentation fault: 11

With xdist enabled things start to go horribly wrong:

$ pytest -n 2 
======================================================== test session starts ========================================================
platform darwin -- Python 3.6.1, pytest-3.2.0, py-1.4.34, pluggy-0.4.0
rootdir: /Users/timj/work/lsst/xdist-crash, inifile:
plugins: session2file-0.1.9, forked-0.3.dev0+g1dd93f6.d20170913, xdist-14.1.dev1+g5772c03.d20170928, flake8-0.8.1
gw0 ok / gw1 C[gw0] node down: Not properly terminated
Replacing crashed slave gw0
gw2 C / gw1 ok[gw1] node down: Not properly terminated
Replacing crashed slave gw1
gw2 ok / gw3 C[gw2] node down: Not properly terminated
Replacing crashed slave gw2
gw4 C / gw3 ok[gw3] node down: Not properly terminated
Replacing crashed slave gw3
gw4 ok / gw5 C[gw4] node down: Not properly terminated
Replacing crashed slave gw4
gw6 C / gw5 ok[gw5] node down: Not properly terminated
Replacing crashed slave gw5
gw6 ok / gw7 C[gw6] node down: Not properly terminated
Replacing crashed slave gw6
gw8 C / gw7 ok[gw7] node down: Not properly terminated
Replacing crashed slave gw7
gw8 C / gw9 C ^C
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
to show a full traceback on KeyboardInterrupt use --fulltrace
/Users/timj/work/lsstsw3/miniconda/lib/python3.6/threading.py:299: KeyboardInterrupt
=================================================== no tests ran in 6.86 seconds ====================================================

where it continually tries to restart workers until the system runs out of resources (we have had cases where a Jenkins node has become completely unresponsive and requires a reboot).

I thought that limiting worker restarts would be the solution but that doesn't work either:

 $ pytest -n 2 --max-slave-restart=0
======================================================== test session starts ========================================================
platform darwin -- Python 3.6.1, pytest-3.2.0, py-1.4.34, pluggy-0.4.0
rootdir: /Users/timj/work/lsst/xdist-crash, inifile:
plugins: session2file-0.1.9, forked-0.3.dev0+g1dd93f6.d20170913, xdist-14.1.dev1+g5772c03.d20170928, flake8-0.8.1
gw0 ok / gw1 C[gw0] node down: Not properly terminated
Slave restarting disabled
gw0 ok / gw1 ok[gw1] node down: Not properly terminated
Slave restarting disabled
^C
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
to show a full traceback on KeyboardInterrupt use --fulltrace
/Users/timj/work/lsstsw3/miniconda/lib/python3.6/threading.py:299: KeyboardInterrupt
=================================================== no tests ran in 5.12 seconds ====================================================

It doesn't keep restarting subprocesses but it does hang forever.

My test file is:

import ctypes

def crash():
        '''\
        crash the Python interpreter...
        '''
        i = ctypes.c_char(b'a')
        j = ctypes.pointer(i)
        c = 0
        while True:
                j[c] = b'a'
                c += 1
        j

crash()

def test_crash():
    assert 1 == 1

(here's the fulltrace output for the case where it hangs, if that helps).

$ pytest -n 2 --max-slave-restart=0 --fulltrace
======================================================== test session starts ========================================================
platform darwin -- Python 3.6.1, pytest-3.2.0, py-1.4.34, pluggy-0.4.0
rootdir: /Users/timj/work/lsst/xdist-crash, inifile:
plugins: session2file-0.1.9, forked-0.3.dev0+g1dd93f6.d20170913, xdist-14.1.dev1+g5772c03.d20170928, flake8-0.8.1
gw0 ok / gw1 C[gw0] node down: Not properly terminated
Slave restarting disabled
gw0 ok / gw1 ok[gw1] node down: Not properly terminated
Slave restarting disabled
^C
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

config = <_pytest.config.Config object at 0x10e04f048>, doit = <function _main at 0x10e0211e0>

    def wrap_session(config, doit):
        """Skeleton command line program"""
        session = Session(config)
        session.exitstatus = EXIT_OK
        initstate = 0
        try:
            try:
                config._do_configure()
                initstate = 1
                config.hook.pytest_sessionstart(session=session)
                initstate = 2
>               session.exitstatus = doit(config, session) or 0

../../lsstsw3/stack/DarwinX86/pytest/3.2.0/lib/python/pytest-3.2.0-py3.6.egg/_pytest/main.py:110: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

config = <_pytest.config.Config object at 0x10e04f048>, session = <Session 'xdist-crash'>

    def _main(config, session):
        """ default command line protocol for initialization, session,
        running tests and reporting. """
        config.hook.pytest_collection(session=session)
>       config.hook.pytest_runtestloop(session=session)

../../lsstsw3/stack/DarwinX86/pytest/3.2.0/lib/python/pytest-3.2.0-py3.6.egg/_pytest/main.py:146: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_HookCaller 'pytest_runtestloop'>
kwargs = {'__multicall__': <_MultiCall 0 results, 1 meths, kwargs={'session': <Session 'xdist-crash'>, '__multicall__': <_MultiCall 0 results, 1 meths, kwargs={...}>}>, 'session': <Session 'xdist-crash'>}

    def __call__(self, **kwargs):
        assert not self.is_historic()
>       return self._hookexec(self, self._nonwrappers + self._wrappers, kwargs)

../../lsstsw3/stack/DarwinX86/pytest/3.2.0/lib/python/pytest-3.2.0-py3.6.egg/_pytest/vendored_packages/pluggy.py:745: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_pytest.config.PytestPluginManager object at 0x10d7320f0>, hook = <_HookCaller 'pytest_runtestloop'>
methods = [<_pytest.vendored_packages.pluggy.HookImpl object at 0x10e04fc18>]
kwargs = {'__multicall__': <_MultiCall 0 results, 1 meths, kwargs={'session': <Session 'xdist-crash'>, '__multicall__': <_MultiCall 0 results, 1 meths, kwargs={...}>}>, 'session': <Session 'xdist-crash'>}

    def _hookexec(self, hook, methods, kwargs):
        # called from all hookcaller instances.
        # enable_tracing will set its own wrapping function at self._inner_hookexec
>       return self._inner_hookexec(hook, methods, kwargs)

../../lsstsw3/stack/DarwinX86/pytest/3.2.0/lib/python/pytest-3.2.0-py3.6.egg/_pytest/vendored_packages/pluggy.py:339: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

hook = <_HookCaller 'pytest_runtestloop'>, methods = [<_pytest.vendored_packages.pluggy.HookImpl object at 0x10e04fc18>]
kwargs = {'__multicall__': <_MultiCall 0 results, 1 meths, kwargs={'session': <Session 'xdist-crash'>, '__multicall__': <_MultiCall 0 results, 1 meths, kwargs={...}>}>, 'session': <Session 'xdist-crash'>}

    self._inner_hookexec = lambda hook, methods, kwargs: \
>       _MultiCall(methods, kwargs, hook.spec_opts).execute()

../../lsstsw3/stack/DarwinX86/pytest/3.2.0/lib/python/pytest-3.2.0-py3.6.egg/_pytest/vendored_packages/pluggy.py:334: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_MultiCall 0 results, 1 meths, kwargs={'session': <Session 'xdist-crash'>, '__multicall__': <_MultiCall 0 results, 1 meths, kwargs={...}>}>

    def execute(self):
        all_kwargs = self.kwargs
        self.results = results = []
        firstresult = self.specopts.get("firstresult")
    
        while self.hook_impls:
            hook_impl = self.hook_impls.pop()
            try:
                args = [all_kwargs[argname] for argname in hook_impl.argnames]
            except KeyError:
                for argname in hook_impl.argnames:
                    if argname not in all_kwargs:
                        raise HookCallError(
                            "hook call must provide argument %r" % (argname,))
            if hook_impl.hookwrapper:
                return _wrapped_call(hook_impl.function(*args), self.execute)
>           res = hook_impl.function(*args)

../../lsstsw3/stack/DarwinX86/pytest/3.2.0/lib/python/pytest-3.2.0-py3.6.egg/_pytest/vendored_packages/pluggy.py:614: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <xdist.dsession.DSession object at 0x10d386898>

    def pytest_runtestloop(self):
        self.sched = self.config.hook.pytest_xdist_make_scheduler(
            config=self.config,
            log=self.log
        )
        assert self.sched is not None
    
        self.shouldstop = False
        while not self.session_finished:
>           self.loop_once()

../../lsstsw3/stack/DarwinX86/pytest_xdist/tickets.DM-12021-g5772c03d22/lib/python/pytest_xdist-14.1.dev1+g5772c03.d20170928-py3.6.egg/xdist/dsession.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <xdist.dsession.DSession object at 0x10d386898>

    def loop_once(self):
        """Process one callback from one of the slaves."""
        while 1:
            try:
>               eventcall = self.queue.get(timeout=2.0)

../../lsstsw3/stack/DarwinX86/pytest_xdist/tickets.DM-12021-g5772c03d22/lib/python/pytest_xdist-14.1.dev1+g5772c03.d20170928-py3.6.egg/xdist/dsession.py:124: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <queue.Queue object at 0x10d386908>, block = True, timeout = 2.0

    def get(self, block=True, timeout=None):
        '''Remove and return an item from the queue.
    
            If optional args 'block' is true and 'timeout' is None (the default),
            block if necessary until an item is available. If 'timeout' is
            a non-negative number, it blocks at most 'timeout' seconds and raises
            the Empty exception if no item was available within that time.
            Otherwise ('block' is false), return an item if one is immediately
            available, else raise the Empty exception ('timeout' is ignored
            in that case).
            '''
        with self.not_empty:
            if not block:
                if not self._qsize():
                    raise Empty
            elif timeout is None:
                while not self._qsize():
                    self.not_empty.wait()
            elif timeout < 0:
                raise ValueError("'timeout' must be a non-negative number")
            else:
                endtime = time() + timeout
                while not self._qsize():
                    remaining = endtime - time()
                    if remaining <= 0.0:
                        raise Empty
>                   self.not_empty.wait(remaining)

../../lsstsw3/miniconda/lib/python3.6/queue.py:173: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Condition(<unlocked _thread.lock object at 0x10d7acd50>, 0)>, timeout = 1.9999960879940772

    def wait(self, timeout=None):
        """Wait until notified or until a timeout occurs.
    
            If the calling thread has not acquired the lock when this method is
            called, a RuntimeError is raised.
    
            This method releases the underlying lock, and then blocks until it is
            awakened by a notify() or notify_all() call for the same condition
            variable in another thread, or until the optional timeout occurs. Once
            awakened or timed out, it re-acquires the lock and returns.
    
            When the timeout argument is present and not None, it should be a
            floating point number specifying a timeout for the operation in seconds
            (or fractions thereof).
    
            When the underlying lock is an RLock, it is not released using its
            release() method, since this may not actually unlock the lock when it
            was acquired multiple times recursively. Instead, an internal interface
            of the RLock class is used, which really unlocks it even when it has
            been recursively acquired several times. Another internal interface is
            then used to restore the recursion level when the lock is reacquired.
    
            """
        if not self._is_owned():
            raise RuntimeError("cannot wait on un-acquired lock")
        waiter = _allocate_lock()
        waiter.acquire()
        self._waiters.append(waiter)
        saved_state = self._release_save()
        gotit = False
        try:    # restore state no matter what (e.g., KeyboardInterrupt)
            if timeout is None:
                waiter.acquire()
                gotit = True
            else:
                if timeout > 0:
>                   gotit = waiter.acquire(True, timeout)
E                   KeyboardInterrupt

../../lsstsw3/miniconda/lib/python3.6/threading.py:299: KeyboardInterrupt
=================================================== no tests ran in 6.21 seconds ====================================================

The text was updated successfully, but these errors were encountered:

RonnyPfannschmidt · 2017-09-28T16:52:02Z

This one is a duplicate wrt max restart

timj · 2017-09-28T19:35:38Z

Ah. #45 -- sorry for the noise.

nicoddemus closed this as completed Sep 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinite retries if segv on import #236

Infinite retries if segv on import #236

timj commented Sep 28, 2017

RonnyPfannschmidt commented Sep 28, 2017

timj commented Sep 28, 2017

Infinite retries if segv on import #236

Infinite retries if segv on import #236

Comments

timj commented Sep 28, 2017

RonnyPfannschmidt commented Sep 28, 2017

timj commented Sep 28, 2017