control-C sometimes getting missed on Windows #119

njsmith · 2017-04-11T06:58:17Z

We're getting occasional failures on appveyor, that look like:

\trio\_core\tests\test_ki.py::test_ki_is_good_neighbor PASSED
..\trio\_core\tests\test_ki.py::test_ki_wakes_us_up 
[... coverage info ...]
!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
to show a full traceback on KeyboardInterrupt use --fulltrace
C:\projects\trio\trio\_core\_run.py:1136: KeyboardInterrupt
==================== 14 passed, 1 skipped in 23.20 seconds ====================

(source)

This is a test that delivers a synthetic SIGINT while the trio thread is sleeping, to make sure that the KeyboardInterrupt gets delivered promptly.

_run.py:1136 is the raise KeyboardInterrupt at the end of run() – the one that checks ki.pending one last time as run is exiting.

Given the time report at the end, it looks like the await sleep(20) in test_ki_wakes_us_up may be expiring? And that's the only checkpoint in the test, so if we somehow aren't being woken up by the signal arriving, but the signal is in fact arriving, then this would make sense.

...how can this be happening though? I'm really not sure :-(. May need to push some instrumented builds in a PR and keep rebuilding them on appveyor until I get failures...

The text was updated successfully, but these errors were encountered:

njsmith · 2017-04-11T07:21:37Z

Here's another one from a few days ago, which looks identical to the one linked above: https://ci.appveyor.com/project/njsmith/trio/build/1.0.265/job/9gyy7y6jp4vlu472
[Edit: and again: https://ci.appveyor.com/project/njsmith/trio/build/1.0.265/job/9gyy7y6jp4vlu472 ]

Instrumentation is in #120, but it looks like this was a bit rare to start with so tracking it down may be difficult :-(

njsmith · 2017-04-11T08:55:21Z

Okay, managed to reproduce locally. Some tricky things required to get a trace: make the sigint trace print to stderr, b/c if it runs re-entrant with one of the stdout prints then you have two calls to the same BufferedWriter.write at the same time and it blows up; run without pytest's -s option b/c console printing seems to mess up timing, but then hack the raise KeyboardInterrupt to be raise KeyError so that pytest will actually print the captured stdout/stderr. Anyway:

---------------------------- Captured stdout call -----------------------------
sleeping @ 1491898063.0975676
woke up @ 1491898063.0975676
draining wakeup socket @ 1491898063.0975676
done draining
sleeping @ 1491898063.0975676
woke up @ 1491898063.0975676
draining wakeup socket @ 1491898063.0975676
done draining
sleeping @ 1491898063.0975676
sleeping @ 1491898063.0975676
woke up @ 1491898064.112795
wakeup sock is readable
PyErr_CheckSignals returned 0
draining wakeup socket @ 1491898064.112795
drained b'\x02'
done draining
sleeping @ 1491898064.112795
woke up @ 1491898083.113062
draining wakeup socket @ 1491898083.113062
done draining
finally @ 1491898083.113062
sleeping @ 1491898083.113062
woke up @ 1491898083.113062
draining wakeup socket @ 1491898083.113062
done draining
draining wakeup socket @ 1491898083.113062
drained b'\x00'
done draining
sleeping @ 1491898083.113062
woke up @ 1491898083.113062
draining wakeup socket @ 1491898083.113062
done draining
sleeping @ 1491898083.113062
woke up @ 1491898083.113062
draining wakeup socket @ 1491898083.113062
done draining
---------------------------- Captured stderr call -----------------------------
sigint handler @ 1491898083.113062

So the smoking gun here is:

wake up at 1491898064.112795
wakeup socket is readable, and we'll see contains \x02, so the C-level signal handler's wakeup fd code ran
but PyErr_CheckSignals (called via ctypes.pythonapi.PyErr_CheckSignals) says that there are no signals pending and does not run the Python-level signal handler! The bytecode eval loop doesn't run the signal handler either!
so we go back to sleep at 1491898064.112795
then ~20 seconds later at 1491898083.113062, we wake up b/c the select timer expired
and now the Python handler runs immediately (1491898083.113062)

The problem seems to be that in CPython's implementation of trip_signal, the logic goes:

set a flag saying that the signal arrived
then write to the wakeup fd (good, we dodged a race condition! ...but wait!)
then set a global flag saying "some signal arrived" and call Py_AddPendingCall (the latter of which is what sets the magic flag that tells the bytecode eval loop to immediately pause and run signal handlers)

Normatively, it should never be possible to get a write to the wakeup fd but not have a pending signal. So the fact that we observed this in the trace above seems to me to be a smoking gun: the problem is that the IO thread got scheduled to run after the write to the wakeup fd, but before Py_AddPendingCall.

So it's difficult to validate this because I can only reproduce the bug on Windows and I don't have a way to build CPython on Windows, but I'm ~99% confident that this is happening because trip_signal does things in the wrong order. It should write to the wakeup fd last, after it's set all the global variables.

njsmith · 2017-04-11T09:29:03Z

Here's a run I managed to get showing the problem on appveyor: https://ci.appveyor.com/project/njsmith/trio/build/1.0.286/job/mcc0rdjx5fpoqvo6

njsmith · 2017-04-11T10:02:21Z

CPython bug report: https://bugs.python.org/issue30038
Proposed fix: python/cpython#1082

PyPy doesn't seem to be affected.

njsmith · 2017-04-11T10:08:34Z

Maybe this will be a useful workaround for the test suite until fixed versions of CPython are released? https://github.com/box/flaky

This is being flaky on Windows, almost certainly due to a bug in CPython 3.5.{0,1,2,3} and 3.6.{0,1}: https://bugs.python.org/issue30038 More details in python-triogh-119. I've submitted (what I think is) a fix to CPython, but in the mean time the random appveyor failures are pretty annoying and I don't see any way to fix them except wait for CPython to make a new release, so as a temporary measure this commit adds some retry logic to this test. This is an ugly hack and should be removed as soon as possible. (Unfortunately that probably won't be until after we can drop 3.6 support...)

njsmith · 2017-04-11T10:38:31Z

Committed a temporary workaround in #121

njsmith · 2017-04-11T10:49:58Z

Hmm, another more-or-less tolerable workaround would be to have the test simply raise the signal twice. This is what a real person would do if they hit this (i.e., hit control-C again if the first try didn't work), and should work (since if we do write-to-fd, mark-signal-for-delivery, write-to-fd, mark-signal-for-delivery, then that sequence does contain the correct [mark-signal-for-delivery, write-to-fd] sequence as a subsequence).

This makes the workaround more realistic, more precisely targeted, and less messy code-wise.

Different and better workaround for gh-119

njsmith · 2017-04-12T09:01:05Z

#122 implemented the try-control-C-twice workaround, and more precisely targeted it to CPython/Windows/current-versions.

njsmith · 2017-04-19T21:43:28Z

I guess there is one way to work around this without fixing CPython, though it's pretty nasty...

Use SetConsoleCtrlHandler to register our own handler. In our handler, use cffi to call raise(SIGINT), and run the regular signal handler synchronously. Then write to our wakeup fd. Then return the magic value that tells Windows not to run any other handlers. So basically this would be monkeypatching and reimplementing the C runtime's signal delivery code... but I guess it would work?

Before, it was possible to get the following sequence of events (especially on Windows, where the C-level signal handler for SIGINT is run in a separate thread): - SIGINT arrives - trip_signal is called - trip_signal writes to the wakeup fd - the main thread wakes up from select()-or-equivalent - the main thread checks for pending signals, but doesn't see any - the main thread drains the wakeup fd - the main thread goes back to sleep - trip_signal sets is_tripped=1 and calls Py_AddPendingCall to notify the main thread the it should run the Python-level signal handler - the main thread doesn't notice because it's asleep This has been causing repeated failures in the Trio test suite: python-trio/trio#119

njsmith · 2017-06-08T09:06:20Z

it's still broken: https://ci.appveyor.com/project/njsmith/trio/build/1.0.430/job/gwcj27g058r2mqaw

☹

njsmith · 2017-06-08T09:09:21Z

I just saw a different and also very weird failure mode in this test: trying to reproduce in my local VM, I ran with --count=10, and on this particular run all 10 tries the test finished in like 0.85-0.95 seconds, so the assert 1 <= duration check failed. I've never seen this before or since. Super weird.

…2075) Before, it was possible to get the following sequence of events (especially on Windows, where the C-level signal handler for SIGINT is run in a separate thread): - SIGINT arrives - trip_signal is called - trip_signal writes to the wakeup fd - the main thread wakes up from select()-or-equivalent - the main thread checks for pending signals, but doesn't see any - the main thread drains the wakeup fd - the main thread goes back to sleep - trip_signal sets is_tripped=1 and calls Py_AddPendingCall to notify the main thread the it should run the Python-level signal handler - the main thread doesn't notice because it's asleep This has been causing repeated failures in the Trio test suite: python-trio/trio#119 (cherry picked from commit 4ae0149)

…triogh-119

njsmith · 2017-08-04T18:15:36Z

Update:

I figured out one possible reason this could still be failing, that's almost certainly wrong: bpo-31119. AFAICT this is a real bug, but x86/x86-64 provide such strong memory ordering guarantees that it probably doesn't happen in practice on those platforms.

I also figured out why it's actually failing (I think)! Here's a log with --fulltrace enabled: https://ci.appveyor.com/project/njsmith/trio/build/1.0.602/job/9lx65iv7nw877jjf
(which I got by hacking the appveyor build to do pip install pytest-repeat and then pytest [stuff] --fulltrace trio/_core/tests/test_ki.py -k test_ki_wakes_us_up --count=1000)

So the problem is that sometimes, the second ki arrives at a moment when we are in a KI-protected section, so we need to add a checkpoint to check for it. Otherwise it gets detected when our main function exits, which is what we've been seeing.

njsmith · 2017-08-04T20:33:56Z

ran 4 builds with --count=1000, 2 of them failed: https://ci.appveyor.com/project/njsmith/trio/build/1.0.602

added a checkpoint then did another 4 builds with --count=1000, they all passed: https://ci.appveyor.com/project/njsmith/trio/build/1.0.603

So tentatively, I think that this must be it.

This commit passed 4x1000 runs on appveyor: python-trio#119 (comment) https://ci.appveyor.com/project/njsmith/trio/build/1.0.603

njsmith added the Windows label Apr 11, 2017

njsmith mentioned this issue Apr 11, 2017

bpo-30038: fix race condition in signal delivery + wakeup fd python/cpython#1082

Merged

njsmith mentioned this issue Apr 11, 2017

Put together a Python 3.7 wishlist #103

Closed

njsmith mentioned this issue Apr 11, 2017

Work around flakiness in test_ki_wakes_us_up #121

Merged

njsmith added a commit to njsmith/trio that referenced this issue Apr 12, 2017

Different and better workaround for python-triogh-119

9f23f11

This makes the workaround more realistic, more precisely targeted, and less messy code-wise.

njsmith added a commit that referenced this issue Apr 12, 2017

Merge pull request #122 from njsmith/119-workaround-improved

c2d8851

Different and better workaround for gh-119

njsmith mentioned this issue Apr 19, 2017

IO waits + control-C on Windows #42

Closed

njsmith changed the title ~~nondeterministic failure on appveyor~~ control-C sometimes getting missed on Windows Apr 20, 2017

vstinner mentioned this issue Jun 10, 2017

[3.6] bpo-30038: fix race condition in signal delivery + wakeup fd (#1082) python/cpython#2075

Merged

njsmith mentioned this issue Jun 17, 2017

Tracking issue: intermittent test failures #200

Open

njsmith added a commit to njsmith/trio that referenced this issue Aug 4, 2017

[Don't merge] try to get a proper traceback from appveyor for python-…

6fddb39

…triogh-119

njsmith added a commit to njsmith/trio that referenced this issue Aug 4, 2017

[Don't merge] try to get a proper traceback from appveyor for python-…

f6046ef

…triogh-119

njsmith added a commit to njsmith/trio that referenced this issue Aug 4, 2017

What if we add a checkpoint?

ae87d96

This commit passed 4x1000 runs on appveyor: python-trio#119 (comment) https://ci.appveyor.com/project/njsmith/trio/build/1.0.603

njsmith mentioned this issue Aug 4, 2017

Add a checkpoint to fix #119 #263

Merged

njsmith closed this as completed in 50edf8a Aug 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

control-C sometimes getting missed on Windows #119

control-C sometimes getting missed on Windows #119

njsmith commented Apr 11, 2017 •

edited

Loading

njsmith commented Apr 11, 2017 •

edited

Loading

njsmith commented Apr 11, 2017

njsmith commented Apr 11, 2017

njsmith commented Apr 11, 2017 •

edited

Loading

njsmith commented Apr 11, 2017

njsmith commented Apr 11, 2017

njsmith commented Apr 11, 2017

njsmith commented Apr 12, 2017

njsmith commented Apr 19, 2017

njsmith commented Jun 8, 2017

njsmith commented Jun 8, 2017

njsmith commented Aug 4, 2017

njsmith commented Aug 4, 2017

control-C sometimes getting missed on Windows #119

control-C sometimes getting missed on Windows #119

Comments

njsmith commented Apr 11, 2017 • edited Loading

njsmith commented Apr 11, 2017 • edited Loading

njsmith commented Apr 11, 2017

njsmith commented Apr 11, 2017

njsmith commented Apr 11, 2017 • edited Loading

njsmith commented Apr 11, 2017

njsmith commented Apr 11, 2017

njsmith commented Apr 11, 2017

njsmith commented Apr 12, 2017

njsmith commented Apr 19, 2017

njsmith commented Jun 8, 2017

njsmith commented Jun 8, 2017

njsmith commented Aug 4, 2017

njsmith commented Aug 4, 2017

njsmith commented Apr 11, 2017 •

edited

Loading

njsmith commented Apr 11, 2017 •

edited

Loading

njsmith commented Apr 11, 2017 •

edited

Loading