-
-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
control-C sometimes getting missed on Windows #119
Comments
Here's another one from a few days ago, which looks identical to the one linked above: https://ci.appveyor.com/project/njsmith/trio/build/1.0.265/job/9gyy7y6jp4vlu472 Instrumentation is in #120, but it looks like this was a bit rare to start with so tracking it down may be difficult :-( |
Okay, managed to reproduce locally. Some tricky things required to get a trace: make the sigint trace print to stderr, b/c if it runs re-entrant with one of the stdout prints then you have two calls to the same
So the smoking gun here is:
The problem seems to be that in CPython's implementation of
Normatively, it should never be possible to get a write to the wakeup fd but not have a pending signal. So the fact that we observed this in the trace above seems to me to be a smoking gun: the problem is that the IO thread got scheduled to run after the write to the wakeup fd, but before So it's difficult to validate this because I can only reproduce the bug on Windows and I don't have a way to build CPython on Windows, but I'm ~99% confident that this is happening because |
Here's a run I managed to get showing the problem on appveyor: https://ci.appveyor.com/project/njsmith/trio/build/1.0.286/job/mcc0rdjx5fpoqvo6 |
CPython bug report: https://bugs.python.org/issue30038 PyPy doesn't seem to be affected. |
Maybe this will be a useful workaround for the test suite until fixed versions of CPython are released? https://github.com/box/flaky |
This is being flaky on Windows, almost certainly due to a bug in CPython 3.5.{0,1,2,3} and 3.6.{0,1}: https://bugs.python.org/issue30038 More details in python-triogh-119. I've submitted (what I think is) a fix to CPython, but in the mean time the random appveyor failures are pretty annoying and I don't see any way to fix them except wait for CPython to make a new release, so as a temporary measure this commit adds some retry logic to this test. This is an ugly hack and should be removed as soon as possible. (Unfortunately that probably won't be until after we can drop 3.6 support...)
Committed a temporary workaround in #121 |
Hmm, another more-or-less tolerable workaround would be to have the test simply raise the signal twice. This is what a real person would do if they hit this (i.e., hit control-C again if the first try didn't work), and should work (since if we do write-to-fd, mark-signal-for-delivery, write-to-fd, mark-signal-for-delivery, then that sequence does contain the correct [mark-signal-for-delivery, write-to-fd] sequence as a subsequence). |
This makes the workaround more realistic, more precisely targeted, and less messy code-wise.
Different and better workaround for gh-119
#122 implemented the try-control-C-twice workaround, and more precisely targeted it to CPython/Windows/current-versions. |
I guess there is one way to work around this without fixing CPython, though it's pretty nasty... Use |
Before, it was possible to get the following sequence of events (especially on Windows, where the C-level signal handler for SIGINT is run in a separate thread): - SIGINT arrives - trip_signal is called - trip_signal writes to the wakeup fd - the main thread wakes up from select()-or-equivalent - the main thread checks for pending signals, but doesn't see any - the main thread drains the wakeup fd - the main thread goes back to sleep - trip_signal sets is_tripped=1 and calls Py_AddPendingCall to notify the main thread the it should run the Python-level signal handler - the main thread doesn't notice because it's asleep This has been causing repeated failures in the Trio test suite: python-trio/trio#119
Before, it was possible to get the following sequence of events (especially on Windows, where the C-level signal handler for SIGINT is run in a separate thread): - SIGINT arrives - trip_signal is called - trip_signal writes to the wakeup fd - the main thread wakes up from select()-or-equivalent - the main thread checks for pending signals, but doesn't see any - the main thread drains the wakeup fd - the main thread goes back to sleep - trip_signal sets is_tripped=1 and calls Py_AddPendingCall to notify the main thread the it should run the Python-level signal handler - the main thread doesn't notice because it's asleep This has been causing repeated failures in the Trio test suite: python-trio/trio#119
it's still broken: https://ci.appveyor.com/project/njsmith/trio/build/1.0.430/job/gwcj27g058r2mqaw ☹ |
I just saw a different and also very weird failure mode in this test: trying to reproduce in my local VM, I ran with |
…2075) Before, it was possible to get the following sequence of events (especially on Windows, where the C-level signal handler for SIGINT is run in a separate thread): - SIGINT arrives - trip_signal is called - trip_signal writes to the wakeup fd - the main thread wakes up from select()-or-equivalent - the main thread checks for pending signals, but doesn't see any - the main thread drains the wakeup fd - the main thread goes back to sleep - trip_signal sets is_tripped=1 and calls Py_AddPendingCall to notify the main thread the it should run the Python-level signal handler - the main thread doesn't notice because it's asleep This has been causing repeated failures in the Trio test suite: python-trio/trio#119 (cherry picked from commit 4ae0149)
Update: I figured out one possible reason this could still be failing, that's almost certainly wrong: bpo-31119. AFAICT this is a real bug, but x86/x86-64 provide such strong memory ordering guarantees that it probably doesn't happen in practice on those platforms. I also figured out why it's actually failing (I think)! Here's a log with So the problem is that sometimes, the second ki arrives at a moment when we are in a KI-protected section, so we need to add a checkpoint to check for it. Otherwise it gets detected when our main function exits, which is what we've been seeing. |
ran 4 builds with added a checkpoint then did another 4 builds with So tentatively, I think that this must be it. |
This commit passed 4x1000 runs on appveyor: python-trio#119 (comment) https://ci.appveyor.com/project/njsmith/trio/build/1.0.603
We're getting occasional failures on appveyor, that look like:
(source)
This is a test that delivers a synthetic SIGINT while the trio thread is sleeping, to make sure that the
KeyboardInterrupt
gets delivered promptly._run.py:1136
is theraise KeyboardInterrupt
at the end ofrun()
– the one that checkski.pending
one last time asrun
is exiting.Given the time report at the end, it looks like the
await sleep(20)
intest_ki_wakes_us_up
may be expiring? And that's the only checkpoint in the test, so if we somehow aren't being woken up by the signal arriving, but the signal is in fact arriving, then this would make sense....how can this be happening though? I'm really not sure :-(. May need to push some instrumented builds in a PR and keep rebuilding them on appveyor until I get failures...
The text was updated successfully, but these errors were encountered: