Use windows console groups for process management #879

lox · 2018-12-20T02:09:41Z

It turns out Windows has no concept of parent/child processes, beyond storing the PID of the process that created it. This means that trying to walk a process tree to cancel a job can leave sub-processes orphaned if a subprocess in the hierarchy dies, as you can't walk that tree past the dead process. For instance in:

bash.exe (pid 1) -> bash.exe (pid 2) -> sleep.exe (pid 3)

If pid 2 dies, pid 3 remains, but killing pid 1 won't be able to traverse down to pid 3.

The best we can do is create processes inside a "console group" and then send break / ctrl-c events to that group.

Context:
#795
golang/dep#862
golang/go#17608
golang/go#6720
https://docs.microsoft.com/en-us/windows/console/generateconsolectrlevent
nodejs/node#3617

lox · 2018-12-20T02:25:57Z

@amitsaha interested if you have any thoughts on this one!

amitsaha · 2018-12-20T02:37:53Z

Hi @lox - thanks for looping me in. This definitely sounds like an improvement. I can do some testing on this as well, however that will have to wait till early January. I am keen to see how it plays with docker containers on Windows server on our buildkite agents.

lox · 2018-12-20T02:38:31Z

Those are very good questions @amitsaha! I hadn't thought about docker interactions.

amitsaha · 2018-12-20T02:58:47Z

Right now, the behavior that we see is that a job cancellation leaves the docker containers running and then we use a buildkite agent hook to kill any running containers before it starts a build (which spawns new containers).

I wonder if docker containers spawned from a process belong to the same console group - 'cause if that's the case, then we could improve the current situation by a lot!

lox · 2018-12-20T23:24:04Z

@amitsaha I'd guess that is because under windows the bootstrap exits immediately on signal, so it doesn't have time to clean up. Whilst this PR doesn't do it, it would open the door of our bootstrap to catch the ctrl_c or break and run the pre-exit hooks. Exciting possibility if there aren't any major compatibility issues here.

lox · 2018-12-20T23:36:25Z

Really good answer on how signal handling works in windows: https://stackoverflow.com/questions/35772001/how-to-handle-the-signal-in-python-on-windows-machine

lox · 2019-01-03T20:39:23Z

So in Docker this results in an error for the Interrupt, but it will fall back to using TASKKILL.EXE. I think this is still a good improvement for folks that don't run the agent in windows docker.

docker/for-win#3173

amitsaha · 2019-01-06T23:34:46Z

So in Docker this results in an error for the Interrupt, but it will fall back to using TASKKILL.EXE. I think this is still a good improvement for folks that don't run the agent in windows docker.

docker/for-win#3173

that is interesting. My current use-case is slightly different:

BK agent begins executing new job via PS script
PS Script spawns docker containers
..

Job is manually cancelled. The spawned docker containers stay running. To clean up, we use a agent hook before the start of the next job to cleanup any running containers.

So, now after this change, I was hoping that since the new containers that are spawned by my PS script may be part of the same console group and hence the CTRL_BREAK_EVENT would also kill the containers. That however doesn't seem to be the case. I have a demo here. Any thoughts?

lox · 2019-01-06T23:59:02Z

@amitsaha unfortunately docker containers don't inherit a console group. That would be too sensible a thing for windows to do. This does pave the way for windows things to run pre-exit scripts on cancellation though, which would mean you could use pre-exit hooks to cleanup like we do in the docker-compose plugin.

amitsaha · 2019-01-07T00:12:37Z

That would be too sensible a thing for windows to do.

Heh.

This does pave the way for windows things to run pre-exit scripts on cancellation though, which would mean you could use pre-exit hooks

Can you please explain this a bit more? Is the process group rooted at the bootstrap process? And if that's the case, then we can set up a signal handler in the bootstrap process to run the pre-exit handlers. If that's the case, is the change already in or would need to be done?

lox · 2019-01-07T00:28:28Z

Can you please explain this a bit more?

Sure, I'm still a little bit blurry on the mechanics of windows terminal process groups, but basically we are telling windows to create a new process group for the agent bootstrap (which runs everything, and subprocesses will inherit that identifier). Cancellation happens by interrupting the bootstrap and then 10 seconds later killing it. Previously we didn't have any possible way of interrupting on windows in a way that could be caught and allow the bootstrap to cleanup (e.g upload artifacts and run pre-exit hooks). This process group change and the introduction of sending CTRL_BREAK_EVENT means that we are pretty close to having the bootstrap be able to catch that even and do it's cleanup on windows too. It might even just work right now, I'll have a look.

lox requested a review from sj26 December 20, 2018 02:30

This was referenced Dec 29, 2018

Repeated Warning: Checkout failed errors while preparing working directory #866

Open

Move graceful kill behaviour out of process.Process #882

Merged

lox added the wip label Jan 3, 2019

lox removed the request for review from sj26 January 3, 2019 00:07

Use windows console groups and ctrl-c for signal handling

d5b9f47

lox force-pushed the use-windows-console-groups-for-processes branch from 2325800 to d5b9f47 Compare January 3, 2019 00:41

Skip test that fails in docker

a6eb9db

lox force-pushed the use-windows-console-groups-for-processes branch from a2021a2 to a6eb9db Compare January 3, 2019 04:47

lox removed the wip label Jan 3, 2019

lox merged commit f3203e4 into master Jan 3, 2019

This was referenced Jan 9, 2019

Cancelling a job doesn't run post hooks or upload artifacts #119

Closed

Cancellation behavior on Windows and how to improve it? #794

Closed

Windows: Cancellation and Logs Streaming Behavior #797

Closed

keithduncan deleted the use-windows-console-groups-for-processes branch September 22, 2021 03:51

jesseduffield mentioned this pull request Oct 11, 2023

Ensure child creation time is after parent creation time jesseduffield/kill#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use windows console groups for process management #879

Use windows console groups for process management #879

lox commented Dec 20, 2018

lox commented Dec 20, 2018

amitsaha commented Dec 20, 2018

lox commented Dec 20, 2018

amitsaha commented Dec 20, 2018

lox commented Dec 20, 2018

lox commented Dec 20, 2018

lox commented Jan 3, 2019

amitsaha commented Jan 6, 2019

lox commented Jan 6, 2019

amitsaha commented Jan 7, 2019

lox commented Jan 7, 2019

Use windows console groups for process management #879

Use windows console groups for process management #879

Conversation

lox commented Dec 20, 2018

lox commented Dec 20, 2018

amitsaha commented Dec 20, 2018

lox commented Dec 20, 2018

amitsaha commented Dec 20, 2018

lox commented Dec 20, 2018

lox commented Dec 20, 2018

lox commented Jan 3, 2019

amitsaha commented Jan 6, 2019

lox commented Jan 6, 2019

amitsaha commented Jan 7, 2019

lox commented Jan 7, 2019