Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestGdbCoreCrashThreadBacktrace failures #64752

Closed
gopherbot opened this issue Dec 15, 2023 · 20 comments
Closed

runtime: TestGdbCoreCrashThreadBacktrace failures #64752

gopherbot opened this issue Dec 15, 2023 · 20 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Milestone

Comments

@gopherbot
Copy link
Contributor

gopherbot commented Dec 15, 2023

#!watchflakes
post <- pkg == "runtime" && test == "TestGdbCoreCrashThreadBacktrace"

Issue created automatically to collect these failures.

Example (log):

--- FAIL: TestGdbCoreCrashThreadBacktrace (7.67s)
    runtime-gdb_test.go:81: gdb version 10.1
    runtime-gdb_unix_test.go:334: child output:
        Setting RLIMIT_CORE = syscall.Rlimit{Cur:0xffffffffffffffff, Max:0xffffffffffffffff}
        SIGSEGV: segmentation violation
        PC=0x48cc65 m=0 sigcode=1 addr=0x0
        signal arrived during cgo execution

        goroutine 1 gp=0xc0000081c0 m=0 mp=0x548740 [syscall]:
        runtime.cgocall(0x48cc6e, 0xc00006ff20)
...
        #19 0x000000000043b076 in runtime.mstart0 () at /workdir/go/src/runtime/proc.go:1684
        #20 0x0000000000462c65 in runtime.mstart () at /workdir/go/src/runtime/asm_amd64.s:394
        #21 0x000000000048d669 in crosscall1 () at gcc_amd64.S:42
        #22 0x0000000000802000 in ?? ()
        #23 0x00007f8b74b75fc0 in ?? ()
        #24 0x00007ffd0d4c302f in ?? ()
        #25 0x000000c000008a80 in ?? ()
        #26 0x0000000000462c60 in ?? ()
        #27 0x0000000000000000 in ?? ()
    runtime-gdb_unix_test.go:375: could not find trigger_crash in backtrace

watchflakes

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "runtime" && test == "TestGdbCoreCrashThreadBacktrace"
2023-12-14 22:24 linux-amd64-bullseye go@d73b4322 runtime.TestGdbCoreCrashThreadBacktrace (log)
--- FAIL: TestGdbCoreCrashThreadBacktrace (7.67s)
    runtime-gdb_test.go:81: gdb version 10.1
    runtime-gdb_unix_test.go:334: child output:
        Setting RLIMIT_CORE = syscall.Rlimit{Cur:0xffffffffffffffff, Max:0xffffffffffffffff}
        SIGSEGV: segmentation violation
        PC=0x48cc65 m=0 sigcode=1 addr=0x0
        signal arrived during cgo execution

        goroutine 1 gp=0xc0000081c0 m=0 mp=0x548740 [syscall]:
        runtime.cgocall(0x48cc6e, 0xc00006ff20)
...
        #19 0x000000000043b076 in runtime.mstart0 () at /workdir/go/src/runtime/proc.go:1684
        #20 0x0000000000462c65 in runtime.mstart () at /workdir/go/src/runtime/asm_amd64.s:394
        #21 0x000000000048d669 in crosscall1 () at gcc_amd64.S:42
        #22 0x0000000000802000 in ?? ()
        #23 0x00007f8b74b75fc0 in ?? ()
        #24 0x00007ffd0d4c302f in ?? ()
        #25 0x000000c000008a80 in ?? ()
        #26 0x0000000000462c60 in ?? ()
        #27 0x0000000000000000 in ?? ()
    runtime-gdb_unix_test.go:375: could not find trigger_crash in backtrace

watchflakes

@gopherbot gopherbot added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. compiler/runtime Issues related to the Go compiler and/or runtime. labels Dec 15, 2023
@prattmic
Copy link
Member

This test is new in https://go.dev/cl/536895. The most obvious cause for this failure would be the crash thread failing to crash for >5s. However, we don't have any timing information in the test log.

cc @zzkcode @cherrymui

@prattmic
Copy link
Member

Actually, a better theory: usleep doesn't loop on early return of the nanosleep syscall (e.g., if another signal is delivered), so it could return in <5s. We should really fix that.

@zzkcode
Copy link
Contributor

zzkcode commented Dec 16, 2023

Hi @prattmic , thanks! I would like to send a CL to fix this.

This test is new in https://go.dev/cl/536895. The most obvious cause for this failure would be the crash thread failing to crash for >5s. However, we don't have any timing information in the test log.

cc @zzkcode @cherrymui

Thanks. Are you saying that we should add more logs about timing in signal_unix.go(maybe with how many ms still waiting SIGQUIT also) in each loop? Since it seems hard to print timing related in this test code except printing the overall time.

Actually, a better theory: usleep doesn't loop on early return of the nanosleep syscall (e.g., if another signal is delivered), so it could return in <5s. We should really fix that.

Sorry, I don't know if I really get this. You are talking about usleep may sleep less than what we want if a signal is delivered, and maybe one of the thread(not the faulting thread which cause crash) may crash the process instead of the faulting thread? So use nanosleep instead?

I guess, in this case, maybe it just really takes more than 5s to deliver the SIGQUIT signals since this case takes 7.67s in running.

--- FAIL: TestGdbCoreCrashThreadBacktrace (7.67s)

However, it still stranges to me that the ms which are in usleep state should either the faulting thread or the thread that had received the SIGQUIT. In the former case, this test should still pass; in the latter, won't they just blocked the signal?

// All m's that have already received SIGQUIT have signal masks blocking
// receipt of any signals, so the SIGQUIT will go to an m that hasn't seen it yet.

@bcmills
Copy link
Contributor

bcmills commented Jan 2, 2024

@prattmic, since this test is new as of Go 1.22, should it be marked as a release blocker?

If an end user runs go test all, that will also (necessarily) run the tests for runtime, so spurious failures in that package can add a lot of noise depending on the user's environment (compare #39204).

@prattmic
Copy link
Member

prattmic commented Jan 2, 2024

This has only failed once, so failure rate doesn't currently seem too bad, but ultimately this is testing a best-effort feature that intentionally "breaks" after 5s, so it won't ever be 100% reliable (given a slow enough machine). That makes it a good candidate for #39204.

@zzkcode thanks for pointing out the total test time, I missed that. That does imply that 5s really did pass, rather than something else happening, like an early usleep return. It would be nice to get the core file to see what was happening. LUCI is theoretically capable of collecting extra files, but that isn't hooked up yet.

For now, perhaps we should modify the test to run a few times (3?) and at least verify that the functionality isn't failing 3/3 times?

@zzkcode
Copy link
Contributor

zzkcode commented Jan 3, 2024

Hi @prattmic, thanks for confirmation.

For now, perhaps we should modify the test to run a few times (3?) and at least verify that the functionality isn't failing 3/3 times?

IMHO, since this test case includes a small go+cgo program and only have few(5/6 maybe?) threads to quit, it should usually spend few seconds to exit. So we would expected to see this case pass(all threads received SIGQUIT and quit within 5 seconds) in the vast majority of cases, but not cover all cases(it will eventually quit anyway after 5 seconds). Like you said, however, it may be running on a slow enough machine(bullseye here, a kind of Raspberry Pi OS?), which is indeed slower comared to PCs.

I don't know whether this is a best choice to loop test more times, or just skip it on some platforms? I would prefer the former one for now since it only reported once as you mention before, but it won't supprise me if it continues fails 3 times in the future(rare event). So let's just loop 3 times, and if it pass, we just break and skip the rest loops. Otherwise, give it another 2 chances? It may cost more times(15s+) to run this single case in extreme scenarios. I want to submit a pr to fix this if this is OK.

One more tiny thing here(unrelated to this test, but related to the code change), I don't know if this is worth to be a change(maybe should fire another issue): as the 1.22rc1 is released, I had tried it to serveal times. It would usually spend at least 500ms(per loop) to quit when crash, kind of slower compared to 1.21 or earier releases(like sticking in somewhere for a very few moment while waiting the first thread to crash). Maybe we could use exponential backoff or just double the sleep time if this is a issue. Should not a big deal, just an experience thing.

FYI @cherrymui @bcmills

@bcmills
Copy link
Contributor

bcmills commented Jan 3, 2024

In general it is not a good idea for tests to assume that something happens within any particular arbitrary timeframe. Is there some way to eliminate the arbitrary 5-second interval?

For example, could the test wait indefinitely instead of just for 5 seconds, or poll indefinitely to check whether the backtrace has been recorded yet? Then the timeout behavior of the test would follow the go test -timeout flag rather than an arbitrary machine-specific constant.

@prattmic
Copy link
Member

prattmic commented Jan 3, 2024

Is there some way to eliminate the arbitrary 5-second interval?

Not easily, the timeout is hard-coded into a sleep in the fatal signal handler: https://cs.opensource.google/go/go/+/master:src/runtime/signal_unix.go;l=784;drc=de5b418bea70aaf27de1f47e9b5813940d1e15a4. But I suppose we could use a global variable and have the test use //go:linkname to change the global to a very large number. Maybe that is the best path, however ugly.

@prattmic
Copy link
Member

prattmic commented Jan 3, 2024

It would usually spend at least 500ms(per loop) to quit when crash, kind of slower compared to 1.21 or earier releases(like sticking in somewhere for a very few moment while waiting the first thread to crash).

This is a good point. This is an effect of https://cs.opensource.google/go/go/+/master:src/runtime/signal_unix.go;l=781;drc=de5b418bea70aaf27de1f47e9b5813940d1e15a4. The crashing thread will almost certainly sleep for 500ms at least once while waiting for SIGQUIT to make its rounds. IMO, we should just reduce this is something small, like 1ms, as it shouldn't take long for other threads to exit.

@bcmills
Copy link
Contributor

bcmills commented Jan 3, 2024

the timeout is hard-coded into a sleep in the fatal signal handler

Oof, that's unfortunate. Is there some reason the runtime's watchdog needs to be that short?
AFAICT the SIGQUIT sequence is O(M) with the number of runtime Ms, so an O(1) timeout on that sequence seems out of place.

Perhaps we could extend the sleep until crashing.Load() stops increasing? That would allow an arbitrarily long O(N) crash sequence while still providing a watchdog in case the crash sequence wedges.

@zzkcode
Copy link
Contributor

zzkcode commented Jan 4, 2024

The crashing thread will almost certainly sleep for 500ms at least once while waiting for SIGQUIT to make its rounds. IMO, we should just reduce this is something small, like 1ms, as it shouldn't take long for other threads to exit.

@prattmic Maybe it will need more time to crash in a large enough application, like having thounands of thread? In that case, it may have to loop many times in check -> sleep -> wake up -> check ...? How about a slightly larger value, like 10ms?

Oof, that's unfortunate. Is there some reason the runtime's watchdog needs to be that short?

@bcmills I don't have knowledge about its original backgroud. However, this change is intended to be a fix in https://go.dev/cl/536895. Before this, the thread who first see all threads had received SIGQUIT will start to crash(usually the last thread received that signal). And all the other threads will start sleep for 5 seconds. If the thread(the one should crash) miss crash in some edge case, the others will be able to crash process. So the process will always take no more than 5 seconds to crash. This change just follow this. Maybe it just for crashing asap? FYI @ianlancetaylor

@prattmic @bcmills Did a tiny change, is this acceptable?

  • Increase total sleep time to 10s for fixing test flake(May need 10s to see it exit in the worst case);
  • sleep 1ms every loop for crashing asap(May need loop thounands of times in the worst case)
diff --git a/src/runtime/signal_unix.go b/src/runtime/signal_unix.go
index 84391d58ed..dc48067b45 100644
--- a/src/runtime/signal_unix.go
+++ b/src/runtime/signal_unix.go
@@ -774,14 +774,16 @@ func sighandler(sig uint32, info *siginfo, ctxt unsafe.Pointer, gp *g) {
                        print("\n-----\n\n")
                        raiseproc(_SIGQUIT)
                }
+               totalSleepTimeUs := 10 * 1000 * 1000
                if isCrashThread {
-                       i := 0
-                       for (crashing.Load() < mcount()-int32(extraMLength.Load())) && i < 10 {
-                               i++
-                               usleep(500 * 1000)
+                       perLoopSleepTimeUs := 1000
+                       i := totalSleepTimeUs / perLoopSleepTimeUs
+                       for (crashing.Load() < mcount()-int32(extraMLength.Load())) && i > 0 {
+                               i--
+                               usleep(uint32(perLoopSleepTimeUs))
                        }
                } else {
-                       usleep(5 * 1000 * 1000)
+                       usleep(uint32(totalSleepTimeUs))
                }
                printDebugLog()
                crash()

Test pass on Linux.

@prattmic
Copy link
Member

prattmic commented Jan 5, 2024

@prattmic Maybe it will need more time to crash in a large enough application, like having thounands of thread? In that case, it may have to loop many times in check -> sleep -> wake up -> check ...? How about a slightly larger value, like 10ms?

I was unsure how long making these crash rounds takes, so I added some quick and dirty logging to the runtime and got:

8 threads: crash time us 5430
16 threads: crash time us 10572
32 threads: crash time us 16223
64 threads: crash time us 40343
128 threads: crash time us 100810
256 threads: crash time us 203888
512 threads: crash time us 406889

So approximately 1ms per thread (this feels rather high to me, but now isn't the time to worry about optimizing crash too much).

Perhaps 5ms is a good sleep value given these numbers. We could get fancy and have this depend on the M count, but I'm really not very concerned about the inefficiency of one thread waking a bit too frequently when we are crashing.

Your patch above looks good, feel free to send a CL. Note that you'll need to make totalSleepTimeUs a global variable for the test to be able to modify it with linkname.

@gopherbot
Copy link
Contributor Author

Change https://go.dev/cl/554615 mentions this issue: runtime: crash asap and extend total sleep time for slow machine in test

@zzkcode
Copy link
Contributor

zzkcode commented Jan 8, 2024

Hi @prattmic , I had submitted a CL. However, I don't know whether I get you correctly or I just do it the wrong way: the totalSleepTimeUs in test had successfully linked runtime.totalSleepTimeUs, but it seems failed to change it during the test. The test case TestGdbCoreCrashThreadBacktrace will use go tool(from testenv.GoToolPath) to build from source code coreCrashThreadSource, and since this change: totalSleepTimeUs = 60 * 1000 * 1000 doesn't work as the go(from from testenv.GoToolPath) it used does not include this change(not building with the test souce code)?

Sorry if I get you wrong and please advise me the right way to do this. Thanks.

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. FixPending Issues that have a fix which has not yet been reviewed or submitted. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jan 8, 2024
@bcmills
Copy link
Contributor

bcmills commented Jan 10, 2024

Some analysis of the test log reported by watchflakes.

The program ends up running six goroutines in total. When run on my workstation, at the time of the crash the goroutines are parked in the following locations:

In a successful run on my local workstation, I see SIGQUIT: quit logs for five background threads:

In the builder log, we see SIGQUIT: quit logs for only four background threads:

  • one in notetsleep via sysmon (as expected)
  • one in notesleep via templateThread (as expected)
  • one in allocm via schedule, resetspinning, and newm
  • one in stopm via schedule and startlockedm

The thread in allocm looks suspiciously like a race in the runtime. In particular:

  • startm calls mReserveID (incrementing mnext) before it calls newm
  • The newm thread was interrupted by SIGQUIT during allocm, which is before the new thread is actually started for the M (in newm1).

So this looks like a genuine watchdog failure: the kernel defers delivery of the final SIGQUIT signal because all of the threads that could receive it are already handling a SIGQUIT (with that signal presumably masked), and the main thread is spinning in the docrash loop waiting for acknowledgement from a fifth thread that had an M ID reserved but was never actually started.

Perhaps startm needs to block signals from at some point before calling mReserveID until newm has returned?

@bcmills bcmills removed the FixPending Issues that have a fix which has not yet been reviewed or submitted. label Jan 10, 2024
@prattmic
Copy link
Member

Thanks for the analysis @bcmills! I agree with you, this looks like a genuine race condition. It is also note new in https://go.dev/cl/536895, it appears to have ~always existed (https://go.dev/cl/245018 extended the race window, but it was present before that too).

Regarding a fix, I'll need to think some more. I'm not thrilled about masking signals, but I don't have a better idea at the moment.

I could imagine this causing some of those other crashes, but not all. e.g., #60316: I don't see how this could manifest as a SIGCHLD death.

Since this bug isn't new, fixing it need not block the release, but we should still do something about the test. Should we mark it flaky until this bug is fixed?

@bcmills
Copy link
Contributor

bcmills commented Jan 11, 2024

Yep, I agree that a testenv.SkipFlaky in the new test would be fine to unblock the Go 1.22 release.

@gopherbot
Copy link
Contributor Author

Change https://go.dev/cl/556356 mentions this issue: runtime: mark TestGdbCoreCrashThreadBacktrace as flaky

ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
This test exercises the SIGQUIT crash loop and managed to trigger the
race from golang#65138 at least once.

For golang#65138.
Fixes golang#64752.

Change-Id: I11091510aa7ae4f58b1d748e53df2e3e3dbfb323
Reviewed-on: https://go-review.googlesource.com/c/go/+/556356
Auto-Submit: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
gopherbot pushed a commit that referenced this issue May 10, 2024
Running with few threads usually does not need 500ms to crash, so let it
crash as soon as possible. While the test may caused more time on slow
machine, try to expand the sleep time in test.

Updates #64752

Change-Id: I635fab846bd5e1735808d4d47bb9032d5a04cc2b
GitHub-Last-Rev: 84f3844
GitHub-Pull-Request: #65018
Reviewed-on: https://go-review.googlesource.com/c/go/+/554615
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
Auto-Submit: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Projects
Status: Done
Archived in project
Development

No branches or pull requests

5 participants