-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Several tests fail on mono with mono_threads_pthread_kill: pthread_kill failed with error 11 #32377
Comments
This is happening during a GC STW when Mono is trying to stop a thread. Evidently In mono/mono@e301847 we added the ability to abort a preemptive suspend request (at which point GC STW will loop back around and try again), so we could try to recover from this What do you think @vargaz @lateralusX ? Additional reference: here's how Boehm picks the suspend signal https://github.com/ivmai/bdwgc/blob/b44088c81ce1e864b4c1d05a523c0150beeb7b8b/include/private/gc_priv.h#L2635 |
On Windows we will follow the semantics (believe mac does similar), if suspend fails (or GetThreadContext) the thread will be left running and mono_threads_suspend_begin_async_suspend will return FALSE and state machine is reset to a state reflecting thread to run as normal (fixed by mono/mono@e301847). On Linux as you said, you currently get an error that will take down the process before any of that could happen, resetting thread to running state and returning FALSE back to caller of mono_threads_suspend_begin_async_suspend. When STW is driving the suspend, returning FALSE back will mark that thread to be skipped, so even if thread state machine is correctly reflecting current state of thread (preventing future asserts), I believe our STW implementation is not fully implemented to support such a scenario, since STW mainly view those threads as currently shutting down or detaching from the runtime, not expecting them to continue running managed code. The fix resetting the state machine on failure had a different use case, in that case threads calling suspend/resume on other threads could trigger this failure and then next GC would trigger an assert in state machine, even if the thread was currently running and could successfully be suspended by STW. So if this is hit during normal STW, then I believe current machinery won't trigger a retry in STW, when mono_threads_suspend_begin_async_suspend returns FALSE. So if we are going to rely on that it needs to be fixed in STW. I know we had some sporadic failures calling SuspendThread that on some downstream Mono implementations was resolved with a retry loop (with a limited set of retry's), if all retries where exhausted, it would still return FALSE and let caller handle it. Maybe we could do something similar on Linux, there you also have a specific error code that you should retry, but still it might make sense to have a limited set of retries. I also believe we should look into fixing STW to better understand failures due to suspend a thread vs threads shutting down, detaching from runtime etc or we will always have the risk of exiting a STW with running threads due to suspend failures. |
Fails again in #33716 Different set of assemblies fail in three runs: |
Could this be a default action to disable the failing tests followed by a draft PR enabling it, which someone can be assigned to push fix commits at? This way only one PR would be red instead of all of them. |
@lambdageek could you please take care of this |
Try to address dotnet#32377 (signal queue overflow) by sleeping and retrying a few times.
Ok, we're trying to fix this with #33966 The issue is:
How we're addressing the issue:
Update: Update 2: |
Try to address dotnet/runtime#32377 (signal queue overflow) by sleeping and retrying a few times. This is kind of a hack, but it woudl be good to see if it will address the issue
…19291) Try to address dotnet/runtime#32377 (signal queue overflow) by sleeping and retrying a few times. This is kind of a hack, but it would be good to see if it will address the issue Co-authored-by: lambdageek <lambdageek@users.noreply.github.com>
Closing as the issue looks like it's been addressed. Hvaven't seen this in recent runs. |
@lambdageek it seems like this is happening again: #35040 |
Ok, on a hunch I checked what
which is reasonable. But nonetheless somehow we're flooding the realtime signal queue. I don't really have any good ideas for how it might be happening, though. If it's a mono problem, it doesn't seem like just bubbling up the pthread_kill error to the GC STW loop is a good plan (lots of effort and unclear that it'll fix the problem). Going to try running some libraries testsuite like |
Looking at the stack trace from Viktor's comment, I see a lot of threads in Unfortunately, unless I block the signal with The only uses of There are a few other threads in other system calls that might be suspicious:
|
Hm. Will focus on |
Update:
Current plan: Despite not having a reliable reproduction, the right thing to do is to extend the thread suspend machinery to deal with transient thread suspension failures. This comes up on non-POSIX-signals backends too - for example the Win32 Currently however, So the right solution is to return a more refined return code and give STW a chance to try again if multiple threads are in a transient failure state. |
I can't find any cases of this happening in the last 2 weeks. Of course, it's possible that's just my tooling misbehaving, but let's close for now given it's quite plausibly no longer an issue |
From #32364
on:
netcoreapp5.0-Linux-Debug-x64-Mono_release-Ubuntu.1804.Amd64.Open
See https://dev.azure.com/dnceng/public/_build/results?buildId=522989&view=logs&j=b4344b0d-0f92-5d69-ccaf-e0b24fbf14a2&t=555ea487-e6f6-5e9e-ac96-b1fa34f2540d&l=98
Test areas that fail:
https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.IO.FileSystem.Tests/console
https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Linq.Queryable.Tests/console
https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Net.Primitives.Functional.Tests/console
https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Net.ServicePoint.Tests/console
https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Threading.Timer.Tests/console
https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Xml.Linq.xNodeReader.Tests/console
https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Xml.RW.XmlWriterApi.Tests/console
For example:
https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Composition.AttributeModel.Tests/console
The text was updated successfully, but these errors were encountered: