Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix double-fork hang on Windows/ARM64 #73

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dscho
Copy link
Member

@dscho dscho commented Sep 26, 2024

This is a new attempt to address the hangs we observed e.g. in msys2/msys2-autobuild#62, based on the new insight that code-cache locks causes the hangs because the associated threads are already gone.

The idea here is to give the threads a tad more time so that the code-cache lock can be lifted before the thread terminates. And who knows, maybe CancelSynchronousIo() by itself causes the code-cache lock to be lifted?

I will mark this as a draft PR for now because this needs extensive testing. In particular, I want to test whether we should imitate the pattern in flock.cc even more faithfully, via something like this:

diff --git a/winsup/cygwin/sigproc.cc b/winsup/cygwin/sigproc.cc
index 18c7bd2648..a36f4bbd2f 100644
--- a/winsup/cygwin/sigproc.cc
+++ b/winsup/cygwin/sigproc.cc
@@ -409,8 +409,15 @@ proc_terminate ()
 	     to 1 iff it is a Cygwin process.  */
 	  if (!have_execed || !have_execed_cygwin)
 	    chld_procs[i]->ppid = 1;
-	  if (chld_procs[i].wait_thread)
-	    CancelSynchronousIo (chld_procs[i].wait_thread->thread_handle ());
+	  cygthread *thr = chld_procs[i].wait_thread;
+	  if (thr)
+	    {
+	      /* If CancelSynchronousIo works we wait for the thread to exit. */
+	      if (CancelSynchronousIo (thr->thread_handle ()))
+		thr->detach ();
+	      else
+		thr->terminate_thread ();
+	    }
 	  /* Release memory associated with this process unless it is 'myself'.
 	     'myself' is only in the chld_procs table when we've execed.  We
 	     reach here when the next process has finished initializing but we

On Wed, 8 May 2024, Jeremy Drake wrote:

> (this is the same issue discussed in
> https://cygwin.com/pipermail/cygwin-patches/2024q1/012621.html)
>
> On MSYS2, running on Windows on ARM64 only, we've been plagued by issues
> with processes hanging up.  Usually pacman, when it is trying to validate
> signatures with gpgme.  When a process is hung in this way, no debugger
> seems to be able to attach properly.
>
> > anecdotally, the hang occurs when _exit() calls
> > proc_terminate() which is then blocked by a call to TerminateThread()
> > with an invalid thread handle (for more details, see
> > msys2/msys2-autobuild#62 (comment)).

As a follow-up to this, that was from a proposed workaround of just
commenting out the double-fork behavior in gpgme.  After reading a comment
in the code and doing some research online, it seems the double-fork is an
accepted idiom on posix to avoid having to wait for the (grand)child,
without creating zombie processes.  I was unable to see zombie processes
in ps or /proc/<pid>, but I did see extra cygpid.* entries in
/proc/sys/BaseNamedObjects/cygwin* which seem to be much the same thing.

Today, I was attempting to look at the TerminateThread situation.  The
call in question comes from the attempt to terminate the wait_thread of a
chld_procs entry.  I noticed elsewhere in cygwin code (flock.cc) that
CancelSynchronousIo was being called, and that stood out to me because
chances are that the wait thread (if running) is going to be blocked in
ReadFile.  I am testing with the following hack, and so far have not seen
a hang:

Applied-from: https://inbox.sourceware.org/cygwin-developers/23f23b0a-e60e-e3ff-4c1e-295599fdc813@jdrake.com/
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@jeremyd2019
Copy link

jeremyd2019 commented Sep 26, 2024

I tried the detach (the patch in the comment above) first. and that blew up pretty spectacularly. I think I tried something like if (!Cancel...) thr->terminate_thread (); and that would eventually hang. I suspect the change proposed here (in 588f46a) would eventually crash (based on my debugging/testing/hacking on this issue). TLDR: CancelSynchronousIo seemed to help, but the (more rare) case where it returns FALSE is still a problem that I couldn't figure out.

@dscho
Copy link
Member Author

dscho commented Sep 26, 2024

In particular, I want to test whether we should imitate the pattern in flock.cc even more faithfully, via something like this: [...]

I should learn how to read: @jeremyd2019 already reported in https://inbox.sourceware.org/cygwin-developers/23f23b0a-e60e-e3ff-4c1e-295599fdc813@jdrake.com/ that:

[...] I first tried

+	      if (CancelSynchronousIo (chld_procs[i].wait_thread->thread_handle ()))
+		chld_procs[i].wait_thread->detach ();
+	      else
+		chld_procs[i].wait_thread->terminate_thread ();

but that resulted in a (debuggable) hang in detach, because the
cygthread::stub was waiting for thread_sync, while cygthread::detach was
waiting for *this. That appears to be because this is an auto-releasing
cygthread. It kind of bothers me that there is no synchronization to be
sure the wait_thread is done shutting down before moving on in
proc_terminate, but I don't see an obvious way in the current structure.

@dscho
Copy link
Member Author

dscho commented Sep 26, 2024

@jeremyd2019 FWIW I just tried to reproduce the hang in a VS Code terminal, running update-via-pacman.ps1 in a Git for Windows SDK (which was pretty reliable for me), and I could not get it to hang... Maybe I am doing something subtly different than I used to. Or maybe it is the update to MSYS2 runtime v3.5.4? Here's hoping...

@jeremyd2019
Copy link

jeremyd2019 commented Sep 26, 2024

I think if you tried the double-fork reproducer I had (https://gist.github.com/jeremyd2019/3156721497096d0bba00ef19a507f619) over and over and over, eventually you would see a crash. For me, the QC710 seemed to have an issue sooner than the other devices I have (2023 Dev Kit & Raspberry Pi 4 running Windows 10 - necessarily i686 cygwin 3.3 on that one).

Particularly, the crash showed up in the exit status check https://gist.github.com/jeremyd2019/3156721497096d0bba00ef19a507f619#file-testfork-c-L56

@dscho
Copy link
Member Author

dscho commented Sep 26, 2024

@jeremyd2019 right you are; the testfork.exe example hung in something like 20-30 of the first 100 forks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants