Gracefully shutdown child processes #393

cjolowicz · 2021-03-02T11:24:00Z

Fixes #139

This PR changes Nox's behavior when receiving a keyboard interrupt (CTRL-C, SIGINT) while executing a command with session.run or session.install.

The current behavior is to terminate the child immediately (SIGTERM on Posix, TerminateProcess on Windows) and to reraise the KeyboardInterrupt, ultimately exiting with failure.

The new behavior is to shut down the child gracefully and only raise if the child did not exit with success. The shutdown sequence consists of these four steps:

Check if the child has terminated, with a 300ms timeout. If it did, skip to the last step. Note that the OS delivers the signal to the child as well.
Send SIGTERM and check again, with a 200ms timeout. If it did, skip to the last step.
Send SIGKILL and wait without timeout.
If the exit status of the child process is non-zero, reraise the KeyboardInterrupt. Otherwise, we consider the keyboard interrupt handled and return normally.

A few points of note:

We should probably move logger.error("Interrupted...") from command.run to popen.popen, so the user gets immediate feedback before the shutdown sequence takes course. ~~We'd need to pass logger to popen.popen, though.~~ What do you think?
The timeouts correspond to tox's defaults for its interrupt_timeout and terminate_timeout settings. For commands that need longer to shut down, it may be a good idea to make this configurable in Nox as well. These could be keyword arguments for session.run. I don't think this needs to be addressed by this PR though, right? (Interestingly, tox4 no longer seems to have these configurable.)
tox does the shutdown slightly differently. It first checks if the child has terminated (with a timeout of zero), and sends a duplicate SIGINT if it hasn't. I believe the reason is that the shutdown protocol is also used in cases where there was no initial SIGINT. The rest of the shutdown sequence is the same. tox always reraises the interrupt, even if the child exited with success. (tox4 essentially behaves like tox3, but busy-waits and skips step 3 on Windows where Popen.kill is an alias for Popen.terminate.)
The use case with pytest --pdb is not covered here; Nox still insists on terminating the child. When I tried this, I did get to see pytest's traceback before Nox killed it, which is an improvement. One way to support this use case would be a command-line option to disable sending additional signals to the child. I don't think this needs to be addressed by this PR either. tox also does not support this currently (configuration option disabling sucicide on pytest subprocess tox-dev/tox#1791).
When output is captured and the child handles the interrupt, we silently discard anything Popen.communicate buffered for us before it was interrupted. However, the only documented way to capture output from session.run is the silent flag, and in that case it doesn't matter.

The u"" literal triggers an internal error with black 19.3b0. While strictly unrelated to this feature branch, we fix it here because the internal error only reproduces on this branch. ❯ nox -s lint nox > Running session lint nox > Creating virtual environment (virtualenv) using python3.8 in .nox/lint nox > python -m pip install flake8==3.7.8 black==19.3b0 isort==4.3.21 mypy==0.720 nox > mypy --disallow-untyped-defs --warn-unused-ignores --ignore-missing-imports nox nox > black --check nox tests noxfile.py setup.py error: cannot format …/nox/tests/test_command.py: INTERNAL ERROR: Black produced code that is not equivalent to the source. Please report a bug on https://github.com/ambv/black/issues. This diff might be helpful: /tmp/blk_d4e49p90.log All done! 💥 💔 💥 38 files would be left unchanged, 1 file would fail to reformat. nox > Command black --check nox tests noxfile.py setup.py failed with exit code 123 nox > Session lint failed. ❯ cat /tmp/blk_d4e49p90.log --- src +++ dst @@ -2092,18 +2092,18 @@ value= Dict( keys= Constant( kind= - 'u', # str + None, # NoneType value= 'SIGIL', # str ) # /Constant values= Constant( kind= - 'u', # str + None, # NoneType value= '123', # str ) # /Constant ) # /Dict ) # /keyword

cjolowicz · 2021-03-02T16:29:27Z

The Windows CI fails due to an unexpected KeyboardInterrupt in the test test_interrupt_handled, where the interrupt should have been handled by the child process. The exception was raised when invoking WaitForSingleObject, in the context of Popen.communicate.

I have been able to reproduce this on a Windows system. In fact, both of the test_interrupt_* tests fail some of the time due to KeyboardInterrupt being raised in the parent process at seemingly random places: apart from subprocess.py (when invoking WaitForSingleObject), also at various locations in typing.py and pytest.

A possible explanation for this is an SO answer by @eryksun:

Python's os.kill wraps two unrelated APIs on Windows. It calls GenerateConsoleCtrlEvent when the sig parameter is CTRL_C_EVENT or CTRL_BREAK_EVENT. In this case the pid parameter is a process group ID. [...]

The test calls os.kill indirectly via Popen.send_signal from the our monkeypatched Popen.communicate, passing signal.CTRL_C_EVENT.

So this means that we send the ^C to ourselves as well, because the parent process is in the same process group as the child. At first, this sounds like we might be able to fix this by passing subprocess.CREATE_NEW_PROCESS_GROUP when creating the child process on Windows. However, reading on in the SO answer, this would also require calling SetConsoleCtrlHandler(NULL, FALSE) in the child, which we can't. My conclusion so far is that we should skip these tests on Windows.

My next step was to paste the Python programs from the tests into a Noxfile, and trigger an actual ^C from the console. There was no spurious KeyboardInterrupt in any of these runs. The results were quite interesting, though:

The session without signal handlers behaves as expected: The process immediately terminates, and the session fails.
The session with a signal handler exiting with zero also behaves as expected: The process immediately terminates, and the session succeeds.
The session ignoring SIGINT, on the other hand, ignores not only the ^C, but also the TerminateProcess sent from Nox. The process completes its sleep, and the session succeeds.
The session ignoring both SIGINT and SIGTERM behaves the same.

This another loose end that I haven't had time yet to investigate.

cjolowicz · 2021-03-03T00:01:28Z

833d44f fixes the spurious interrupts in the test on Windows, leveraging the SO post quoted above. Unfortunately, even though all Nox sessions succeed in CI now, Nox mysteriously exits with 1. Haven't been able to reproduce this locally, on my test machine Nox exits with success.

theacodes · 2021-03-03T03:04:01Z

This looks good to me so far, let me know if you get stuck on anything or need some help deciding on something.

cjolowicz · 2021-03-03T12:06:38Z

72f7cd2 fixes the spurious interrupts in the Windows CI. While Nox itself exited with success, its parent process (the one running the CI step) exited with failure. This happened because as a console process, the parent also received the keyboard interrupt triggered by the test. The fix was to simply not send a real interrupt when we're on Windows CI. This means we have to skip the test where the child handles the interrupt (test_interrupt_handled). For the other test (test_interrupt_raises), it does not matter. Outside of CI, both tests pass on Windows.

So this finally ready for review.

Sidenote:

The session ignoring SIGINT, on the other hand, ignores not only the ^C, but
also the TerminateProcess sent from Nox. The process completes its sleep, and
the session succeeds.

This turned out to be not entirely true; what happened was that Nox never called
TerminateProcess. It looks like the KeyboardInterrupt only materialized
after the primary Popen.communicate returned, and the child process had
completed. (CPython sets a Windows event in its signal handler to wake up the
main thread on SIGINT. Somehow that did not happen here.)

Taking a step back, there is no reason Nox has to guarantee termination of a
process on ^C if that process ignores ^C. It seems just fine to provide a best
effort approach that works on POSIX, escalating to SIGTERM and SIGKILL.

eryksun · 2021-03-03T13:02:58Z

I'm not familiar with your specific needs, but here's some general advice.

If you need to test the effect of sending CTRL_C_EVENT (i.e. cancel) to a target process that doesn't manually enable Ctrl+C, then you can't spawn it in a new process group as the target of the console control event. You need to send the event to group 0, i.e. all processes that are attached to the current console session. However, doing that is unreliable for an arbitrary console session, since you don't know what other processes are attached and what will be the effect of sending them Ctrl+C.

You can spawn the test in a new console session, allocated with the process creation flag CREATE_NEW_CONSOLE (or with CREATE_NO_WINDOW if the process works correctly in a windowless console session). Report the result back to the main test process over stdout or some other IPC channel. The test process in the new session can implement the test as follows:

enable Ctrl+C to ensure it's inherited as such
spawn the target process to be tested, without any of the following creation flags:
- CREATE_NEW_PROCESS_GROUP
- CREATE_NEW_CONSOLE
- CREATE_NO_WINDOW
- DETACHED_PROCESS
wait for the target process to be ready
disable Ctrl+C
call os.kill(0, CTRL_C_EVENT) or GenerateConsoleCtrlEvent(CTRL_C_EVENT, 0)
wait for the target process to exit

ctypes wrapper functions:

import signal
import ctypes

kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)

def enable_ctrl_c():
    if not kernel32.SetConsoleCtrlHandler(None, False):
        raise ctypes.WinError(ctypes.get_last_error())

def disable_ctrl_c():
    if not kernel32.SetConsoleCtrlHandler(None, True):
        raise ctypes.WinError(ctypes.get_last_error())

def send_ctrl_event(pgid=0, ctrl_event=signal.CTRL_C_EVENT):
    if not kernel32.GenerateConsoleCtrlEvent(ctrl_event, pgid):
        raise ctypes.WinError(ctypes.get_last_error())

cjolowicz · 2021-03-03T23:05:32Z

Thank you @eryksun for taking the time to comment here. Your comment, as well as your posts on SO and bpo, were tremendously helpful!

dd9f3c2 implements the solution proposed above: On Windows, the keyboard interrupt tests are run in a subprocess that is not attached to the current console window. This means that all tests now run on Windows as well, both locally and in CI.

theacodes

This LGTM!

eryksun · 2021-03-04T01:33:20Z

tests/test_command.py

+skip_on_windows_console = pytest.mark.skipif(
+    platform.system() == "Windows" and "DETACHED_FROM_CONSOLE" not in os.environ,
+    reason="Do not run this test attached to a Windows console.",
+)


The wording here is possibly misleading. CREATE_NO_WINDOW and CREATE_NEW_CONSOLE run the process attached to a new console session. The creation flag DETACHED_PROCESS, on the other hand, tells the system to spawn a process detached from any console session. A truly detached process is definitely not wanted here since the test needs to generate and receive a console control event.

FYI, in Windows 8.1+, a console session is a set of processes that are connected to an instance of conhost.exe by way of the ConDrv kernel device. The session window is optional and may be excluded because of CREATE_NO_WINDOW or, in Windows 10, because the console host is running in headless conpty mode (pseudoconsole). For example, each tab in Windows Terminal uses a headless conpty session, with an open-source build of "conhost.exe" that's named "openconsole.exe".

Thank you for pointing that out. d6b0e59 changes the naming, hopefully for the better 😉

cjolowicz · 2021-03-04T11:38:57Z

@theacodes Thanks for the approval! I pushed two more commits:

d6b0e59 improves naming in the tests, following @eryksun 's comment above
3df8312 ensures that the subprocess always uses the same pytest as the main process (even when it's not on PATH)

theacodes · 2021-03-04T14:55:00Z

Cool, still looks good to me. Feel free to merge when you and @eryksun are happy. :)

theacodes · 2021-03-04T15:11:56Z

Oh, I noticed you don't have merge permissions. Feel free to ping me when you're ready to merge (and if you want merge rights, let me know!)

cjolowicz · 2021-03-04T17:13:17Z

Feel free to ping me when you're ready to merge

This is good to go from my side.

(and if you want merge rights, let me know!)

Thanks for offering this, I'd be honored :)

theacodes · 2021-03-04T17:37:23Z

Thanks for offering this, I'd be honored :)

Done. I'll let you merge this :) 🎆

This PR is eligible for $50 from our open collective. Feel free to submit an expense for that if you'd like.

theacodes · 2021-03-04T19:08:56Z

Huzzah!

…

On Thu, Mar 4, 2021 at 2:04 PM Claudio Jolowicz ***@***.***> wrote: Merged #393 <#393> into main. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#393 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAB5I45ZD4VIUUMFWW4Y2MDTB7KUTANCNFSM4YOYUTVA> .

cjolowicz added 3 commits March 2, 2021 11:58

Allow child processes to handle keyboard interrupts

fad13e8

Add tests for keyboard interrupts during commands

a9402d8

Fix spurious interrupts in Windows tests

833d44f

Do not send keyboard interrupts in Windows CI

72f7cd2

Run keyboard interrupt tests on Windows detached from console

dd9f3c2

theacodes approved these changes Mar 4, 2021

View reviewed changes

eryksun reviewed Mar 4, 2021

View reviewed changes

cjolowicz added 2 commits March 4, 2021 11:55

Avoid misleading naming in interrupt tests

d6b0e59

Do not assume that the correct pytest is on PATH

3df8312

cjolowicz merged commit 7ad30a0 into wntrblm:main Mar 4, 2021

cjolowicz deleted the graceful-shutdown branch March 4, 2021 19:04

cjolowicz mentioned this pull request Jun 6, 2021

Release 2021.6.6 #440

Merged

mlech-reef mentioned this pull request Oct 16, 2021

conda python310 maciej-lech/nox#1

Closed

raddessi mentioned this pull request Feb 1, 2022

feat: allow configurable child shutdown timeouts #565

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully shutdown child processes #393

Gracefully shutdown child processes #393

cjolowicz commented Mar 2, 2021 •

edited

Loading

cjolowicz commented Mar 2, 2021

cjolowicz commented Mar 3, 2021

theacodes commented Mar 3, 2021

cjolowicz commented Mar 3, 2021 •

edited

Loading

eryksun commented Mar 3, 2021 •

edited

Loading

cjolowicz commented Mar 3, 2021

theacodes left a comment

eryksun Mar 4, 2021

cjolowicz Mar 4, 2021

cjolowicz commented Mar 4, 2021

theacodes commented Mar 4, 2021

theacodes commented Mar 4, 2021

cjolowicz commented Mar 4, 2021

theacodes commented Mar 4, 2021

theacodes commented Mar 4, 2021 via email

Gracefully shutdown child processes #393

Gracefully shutdown child processes #393

Conversation

cjolowicz commented Mar 2, 2021 • edited Loading

cjolowicz commented Mar 2, 2021

cjolowicz commented Mar 3, 2021

theacodes commented Mar 3, 2021

cjolowicz commented Mar 3, 2021 • edited Loading

eryksun commented Mar 3, 2021 • edited Loading

cjolowicz commented Mar 3, 2021

theacodes left a comment

Choose a reason for hiding this comment

eryksun Mar 4, 2021

Choose a reason for hiding this comment

cjolowicz Mar 4, 2021

Choose a reason for hiding this comment

cjolowicz commented Mar 4, 2021

theacodes commented Mar 4, 2021

theacodes commented Mar 4, 2021

cjolowicz commented Mar 4, 2021

theacodes commented Mar 4, 2021

theacodes commented Mar 4, 2021 via email

cjolowicz commented Mar 2, 2021 •

edited

Loading

cjolowicz commented Mar 3, 2021 •

edited

Loading

eryksun commented Mar 3, 2021 •

edited

Loading