-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use windows console groups for process management #879
Conversation
@amitsaha interested if you have any thoughts on this one! |
Hi @lox - thanks for looping me in. This definitely sounds like an improvement. I can do some testing on this as well, however that will have to wait till early January. I am keen to see how it plays with docker containers on Windows server on our buildkite agents. |
Those are very good questions @amitsaha! I hadn't thought about docker interactions. |
Right now, the behavior that we see is that a job cancellation leaves the docker containers running and then we use a buildkite agent hook to kill any running containers before it starts a build (which spawns new containers). I wonder if docker containers spawned from a process belong to the same console group - 'cause if that's the case, then we could improve the current situation by a lot! |
@amitsaha I'd guess that is because under windows the bootstrap exits immediately on signal, so it doesn't have time to clean up. Whilst this PR doesn't do it, it would open the door of our bootstrap to catch the |
Really good answer on how signal handling works in windows: https://stackoverflow.com/questions/35772001/how-to-handle-the-signal-in-python-on-windows-machine |
2325800
to
d5b9f47
Compare
a2021a2
to
a6eb9db
Compare
So in Docker this results in an error for the Interrupt, but it will fall back to using |
that is interesting. My current use-case is slightly different:
Job is manually cancelled. The spawned docker containers stay running. To clean up, we use a agent hook before the start of the next job to cleanup any running containers. So, now after this change, I was hoping that since the new containers that are spawned by my PS script may be part of the same console group and hence the CTRL_BREAK_EVENT would also kill the containers. That however doesn't seem to be the case. I have a demo here. Any thoughts? |
@amitsaha unfortunately docker containers don't inherit a console group. That would be too sensible a thing for windows to do. This does pave the way for windows things to run pre-exit scripts on cancellation though, which would mean you could use pre-exit hooks to cleanup like we do in the docker-compose plugin. |
Heh.
Can you please explain this a bit more? Is the process group rooted at the bootstrap process? And if that's the case, then we can set up a signal handler in the bootstrap process to run the pre-exit handlers. If that's the case, is the change already in or would need to be done? |
Sure, I'm still a little bit blurry on the mechanics of windows terminal process groups, but basically we are telling windows to create a new process group for the agent bootstrap (which runs everything, and subprocesses will inherit that identifier). Cancellation happens by interrupting the bootstrap and then 10 seconds later killing it. Previously we didn't have any possible way of interrupting on windows in a way that could be caught and allow the bootstrap to cleanup (e.g upload artifacts and run pre-exit hooks). This process group change and the introduction of sending CTRL_BREAK_EVENT means that we are pretty close to having the bootstrap be able to catch that even and do it's cleanup on windows too. It might even just work right now, I'll have a look. |
It turns out Windows has no concept of parent/child processes, beyond storing the PID of the process that created it. This means that trying to walk a process tree to cancel a job can leave sub-processes orphaned if a subprocess in the hierarchy dies, as you can't walk that tree past the dead process. For instance in:
bash.exe (pid 1)
->bash.exe (pid 2)
->sleep.exe (pid 3)
If pid 2 dies, pid 3 remains, but killing pid 1 won't be able to traverse down to pid 3.
The best we can do is create processes inside a "console group" and then send break / ctrl-c events to that group.
Context:
#795
golang/dep#862
golang/go#17608
golang/go#6720
https://docs.microsoft.com/en-us/windows/console/generateconsolectrlevent
nodejs/node#3617