-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix double-fork hang on Windows/ARM64 #73
base: main
Are you sure you want to change the base?
Conversation
On Wed, 8 May 2024, Jeremy Drake wrote: > (this is the same issue discussed in > https://cygwin.com/pipermail/cygwin-patches/2024q1/012621.html) > > On MSYS2, running on Windows on ARM64 only, we've been plagued by issues > with processes hanging up. Usually pacman, when it is trying to validate > signatures with gpgme. When a process is hung in this way, no debugger > seems to be able to attach properly. > > > anecdotally, the hang occurs when _exit() calls > > proc_terminate() which is then blocked by a call to TerminateThread() > > with an invalid thread handle (for more details, see > > msys2/msys2-autobuild#62 (comment)). As a follow-up to this, that was from a proposed workaround of just commenting out the double-fork behavior in gpgme. After reading a comment in the code and doing some research online, it seems the double-fork is an accepted idiom on posix to avoid having to wait for the (grand)child, without creating zombie processes. I was unable to see zombie processes in ps or /proc/<pid>, but I did see extra cygpid.* entries in /proc/sys/BaseNamedObjects/cygwin* which seem to be much the same thing. Today, I was attempting to look at the TerminateThread situation. The call in question comes from the attempt to terminate the wait_thread of a chld_procs entry. I noticed elsewhere in cygwin code (flock.cc) that CancelSynchronousIo was being called, and that stood out to me because chances are that the wait thread (if running) is going to be blocked in ReadFile. I am testing with the following hack, and so far have not seen a hang: Applied-from: https://inbox.sourceware.org/cygwin-developers/23f23b0a-e60e-e3ff-4c1e-295599fdc813@jdrake.com/ Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
I tried the detach (the patch in the comment above) first. and that blew up pretty spectacularly. I think I tried something like |
I should learn how to read: @jeremyd2019 already reported in https://inbox.sourceware.org/cygwin-developers/23f23b0a-e60e-e3ff-4c1e-295599fdc813@jdrake.com/ that:
|
@jeremyd2019 FWIW I just tried to reproduce the hang in a VS Code terminal, running |
I think if you tried the double-fork reproducer I had (https://gist.github.com/jeremyd2019/3156721497096d0bba00ef19a507f619) over and over and over, eventually you would see a crash. For me, the QC710 seemed to have an issue sooner than the other devices I have (2023 Dev Kit & Raspberry Pi 4 running Windows 10 - necessarily i686 cygwin 3.3 on that one). Particularly, the crash showed up in the exit status check https://gist.github.com/jeremyd2019/3156721497096d0bba00ef19a507f619#file-testfork-c-L56 |
@jeremyd2019 right you are; the |
This is a new attempt to address the hangs we observed e.g. in msys2/msys2-autobuild#62, based on the new insight that code-cache locks causes the hangs because the associated threads are already gone.
The idea here is to give the threads a tad more time so that the code-cache lock can be lifted before the thread terminates. And who knows, maybe
CancelSynchronousIo()
by itself causes the code-cache lock to be lifted?I will mark this as a draft PR for now because this needs extensive testing. In particular, I want to test whether we should imitate the pattern in
flock.cc
even more faithfully, via something like this: