Fix double-fork hang on Windows/ARM64 #73

dscho · 2024-09-26T08:22:57Z

This is a new attempt to address the hangs we observed e.g. in msys2/msys2-autobuild#62, based on the new insight that code-cache locks causes the hangs because the associated threads are already gone.

The idea here is to give the threads a tad more time so that the code-cache lock can be lifted before the thread terminates. And who knows, maybe CancelSynchronousIo() by itself causes the code-cache lock to be lifted?

I will mark this as a draft PR for now because this needs extensive testing. In particular, I want to test whether we should imitate the pattern in flock.cc even more faithfully, via something like this:

diff --git a/winsup/cygwin/sigproc.cc b/winsup/cygwin/sigproc.cc
index 18c7bd2648..a36f4bbd2f 100644
--- a/winsup/cygwin/sigproc.cc
+++ b/winsup/cygwin/sigproc.cc
@@ -409,8 +409,15 @@ proc_terminate ()
 	     to 1 iff it is a Cygwin process.  */
 	  if (!have_execed || !have_execed_cygwin)
 	    chld_procs[i]->ppid = 1;
-	  if (chld_procs[i].wait_thread)
-	    CancelSynchronousIo (chld_procs[i].wait_thread->thread_handle ());
+	  cygthread *thr = chld_procs[i].wait_thread;
+	  if (thr)
+	    {
+	      /* If CancelSynchronousIo works we wait for the thread to exit. */
+	      if (CancelSynchronousIo (thr->thread_handle ()))
+		thr->detach ();
+	      else
+		thr->terminate_thread ();
+	    }
 	  /* Release memory associated with this process unless it is 'myself'.
 	     'myself' is only in the chld_procs table when we've execed.  We
 	     reach here when the next process has finished initializing but we

On Wed, 8 May 2024, Jeremy Drake wrote: > (this is the same issue discussed in > https://cygwin.com/pipermail/cygwin-patches/2024q1/012621.html) > > On MSYS2, running on Windows on ARM64 only, we've been plagued by issues > with processes hanging up. Usually pacman, when it is trying to validate > signatures with gpgme. When a process is hung in this way, no debugger > seems to be able to attach properly. > > > anecdotally, the hang occurs when _exit() calls > > proc_terminate() which is then blocked by a call to TerminateThread() > > with an invalid thread handle (for more details, see > > msys2/msys2-autobuild#62 (comment)). As a follow-up to this, that was from a proposed workaround of just commenting out the double-fork behavior in gpgme. After reading a comment in the code and doing some research online, it seems the double-fork is an accepted idiom on posix to avoid having to wait for the (grand)child, without creating zombie processes. I was unable to see zombie processes in ps or /proc/<pid>, but I did see extra cygpid.* entries in /proc/sys/BaseNamedObjects/cygwin* which seem to be much the same thing. Today, I was attempting to look at the TerminateThread situation. The call in question comes from the attempt to terminate the wait_thread of a chld_procs entry. I noticed elsewhere in cygwin code (flock.cc) that CancelSynchronousIo was being called, and that stood out to me because chances are that the wait thread (if running) is going to be blocked in ReadFile. I am testing with the following hack, and so far have not seen a hang: Applied-from: https://inbox.sourceware.org/cygwin-developers/23f23b0a-e60e-e3ff-4c1e-295599fdc813@jdrake.com/ Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

jeremyd2019 · 2024-09-26T18:06:52Z

I tried the detach (the patch in the comment above) first. and that blew up pretty spectacularly. I think I tried something like if (!Cancel...) thr->terminate_thread (); and that would eventually hang. I suspect the change proposed here (in 588f46a) would eventually crash (based on my debugging/testing/hacking on this issue). TLDR: CancelSynchronousIo seemed to help, but the (more rare) case where it returns FALSE is still a problem that I couldn't figure out.

dscho · 2024-09-26T18:08:26Z

In particular, I want to test whether we should imitate the pattern in flock.cc even more faithfully, via something like this: [...]

I should learn how to read: @jeremyd2019 already reported in https://inbox.sourceware.org/cygwin-developers/23f23b0a-e60e-e3ff-4c1e-295599fdc813@jdrake.com/ that:

[...] I first tried
+	      if (CancelSynchronousIo (chld_procs[i].wait_thread->thread_handle ()))
+		chld_procs[i].wait_thread->detach ();
+	      else
+		chld_procs[i].wait_thread->terminate_thread ();
but that resulted in a (debuggable) hang in detach, because the
cygthread::stub was waiting for thread_sync, while cygthread::detach was
waiting for *this. That appears to be because this is an auto-releasing
cygthread. It kind of bothers me that there is no synchronization to be
sure the wait_thread is done shutting down before moving on in
proc_terminate, but I don't see an obvious way in the current structure.

dscho · 2024-09-26T18:13:42Z

@jeremyd2019 FWIW I just tried to reproduce the hang in a VS Code terminal, running update-via-pacman.ps1 in a Git for Windows SDK (which was pretty reliable for me), and I could not get it to hang... Maybe I am doing something subtly different than I used to. Or maybe it is the update to MSYS2 runtime v3.5.4? Here's hoping...

jeremyd2019 · 2024-09-26T18:15:55Z

I think if you tried the double-fork reproducer I had (https://gist.github.com/jeremyd2019/3156721497096d0bba00ef19a507f619) over and over and over, eventually you would see a crash. For me, the QC710 seemed to have an issue sooner than the other devices I have (2023 Dev Kit & Raspberry Pi 4 running Windows 10 - necessarily i686 cygwin 3.3 on that one).

Particularly, the crash showed up in the exit status check https://gist.github.com/jeremyd2019/3156721497096d0bba00ef19a507f619#file-testfork-c-L56

dscho · 2024-09-26T19:58:16Z

@jeremyd2019 right you are; the testfork.exe example hung in something like 20-30 of the first 100 forks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix double-fork hang on Windows/ARM64 #73

Fix double-fork hang on Windows/ARM64 #73

dscho commented Sep 26, 2024 •

edited

Loading

jeremyd2019 commented Sep 26, 2024 •

edited

Loading

dscho commented Sep 26, 2024

dscho commented Sep 26, 2024

jeremyd2019 commented Sep 26, 2024 •

edited

Loading

dscho commented Sep 26, 2024

Fix double-fork hang on Windows/ARM64 #73

Are you sure you want to change the base?

Fix double-fork hang on Windows/ARM64 #73

Conversation

dscho commented Sep 26, 2024 • edited Loading

jeremyd2019 commented Sep 26, 2024 • edited Loading

dscho commented Sep 26, 2024

dscho commented Sep 26, 2024

jeremyd2019 commented Sep 26, 2024 • edited Loading

dscho commented Sep 26, 2024

dscho commented Sep 26, 2024 •

edited

Loading

jeremyd2019 commented Sep 26, 2024 •

edited

Loading

jeremyd2019 commented Sep 26, 2024 •

edited

Loading