Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several tests fail on mono with mono_threads_pthread_kill: pthread_kill failed with error 11 #32377

Closed
ahsonkhan opened this issue Feb 15, 2020 · 16 comments
Labels
area-VM-threading-mono blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' runtime-mono specific to the Mono runtime
Milestone

Comments

@ahsonkhan
Copy link
Member

From #32364
on:

  • netcoreapp5.0-Linux-Debug-x64-Mono_release-Ubuntu.1804.Amd64.Open

See https://dev.azure.com/dnceng/public/_build/results?buildId=522989&view=logs&j=b4344b0d-0f92-5d69-ccaf-e0b24fbf14a2&t=555ea487-e6f6-5e9e-ac96-b1fa34f2540d&l=98

Test areas that fail:
https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.IO.FileSystem.Tests/console
https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Linq.Queryable.Tests/console
https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Net.Primitives.Functional.Tests/console
https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Net.ServicePoint.Tests/console
https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Threading.Timer.Tests/console
https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Xml.Linq.xNodeReader.Tests/console
https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Xml.RW.XmlWriterApi.Tests/console

For example:
https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Composition.AttributeModel.Tests/console

===========================================================================================================
~/work/A8F20901/w/8E960823/e ~/work/A8F20901/w/8E960823/e
  Discovering: System.Composition.AttributeModel.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Composition.AttributeModel.Tests (found 19 test cases)
  Starting:    System.Composition.AttributeModel.Tests (parallel test collections = on, max threads = 2)
mono_threads_pthread_kill: pthread_kill failed with error 11 - potential kernel OOM or signal queue overflow

=================================================================
	Native Crash Reporting
=================================================================
Got a SIGABRT while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================

=================================================================
	Native stacktrace:
=================================================================
	0x7f5d815d7cbd - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
	0x7f5d8157d7b5 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
	0x7f5d815d7355 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
	0x7f5d82e87890 - /lib/x86_64-linux-gnu/libpthread.so.0 : 
	0x7f5d81f7fe97 - /lib/x86_64-linux-gnu/libc.so.6 : gsignal
	0x7f5d81f81801 - /lib/x86_64-linux-gnu/libc.so.6 : abort
	0x7f5d81786b35 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
	0x7f5d8176acd0 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
	0x7f5d81786f25 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
	0x7f5d81786fbe - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : monoeg_g_log
	0x7f5d8177e313 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
	0x7f5d8177ac82 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
	0x7f5d8171b88e - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
	0x7f5d8172e92d - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
	0x7f5d8172b66f - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
	0x7f5d8172b5d3 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
	0x7f5d81720f6e - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
	0x7f5d817213e9 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
	0x7f5d8171d3c6 - /home/helixbot/work/A8F20901/p/shared/Microsoft.NETCore.App/5.0.0/libcoreclr.so : 
	0x414f0f87 - Unknown

=================================================================
	Telemetry Dumper:
=================================================================
Pkilling 0x140039405434624x from 0x140039413958400x
Pkilling 0x140039540684544x from 0x140039413958400x
Pkilling 0x140039411816192x from 0x140039413958400x
Pkilling 0x140039609288512x from 0x140039413958400x
Pkilling 0x140039416076032x from 0x140039413958400x
Pkilling 0x140039409698560x from 0x140039413958400x
Pkilling 0x140039407552256x from 0x140039413958400x
Entering thread summarizer pause from 0x140039413958400x
Finished thread summarizer pause from 0x140039413958400x.
Could not exec mono-hang-watchdog, expected on path '/usr/local/etc/../bin/mono-hang-watchdog' (errno 2)

Waiting for dumping threads to resume

=================================================================
	External Debugger Dump:
=================================================================
[New LWP 27863]
[New LWP 27864]
[New LWP 27865]
[New LWP 27868]
[New LWP 27869]
[New LWP 27870]
[New LWP 27871]
[New LWP 27872]
[New LWP 27873]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f5d82e829f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555946394848) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88	../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
  Id   Target Id         Frame 
* 1    Thread 0x7f5d832a8740 (LWP 27862) "dotnet" 0x00007f5d82e829f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555946394848) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  2    Thread 0x7f5d80bff700 (LWP 27863) "SGen worker" 0x00007f5d82e829f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f5d81a7c798 <work_cond+40>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  3    Thread 0x7f5d7f13b700 (LWP 27864) "Finalizer" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  4    Thread 0x7f5d77b0c700 (LWP 27865) "dotnet" 0x00007f5d82e86384 in __libc_read (fd=5, buf=0x7f5d77b0bee7, nbytes=1) at ../sysdeps/unix/sysv/linux/read.c:27
  5    Thread 0x7f5d77860700 (LWP 27868) "dotnet" 0x00007f5d82e8723a in __waitpid (pid=27875, stat_loc=0x7f5d7785d2bc, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
  6    Thread 0x7f5d77a65700 (LWP 27869) "dotnet" 0x00007f5d82e82ed9 in futex_reltimed_wait_cancelable (private=<optimized out>, reltime=0x7f5d77a64060, expected=0, futex_word=0x555946394998) at ../sysdeps/unix/sysv/linux/futex-internal.h:142
  7    Thread 0x7f5d77655700 (LWP 27870) "dotnet" 0x00007f5d82e856d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f5d60000b80) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  8    Thread 0x7f5d77450700 (LWP 27871) "dotnet" 0x00007f5d82e82ed9 in futex_reltimed_wait_cancelable (private=<optimized out>, reltime=0x7f5d7744f6a0, expected=0, futex_word=0x7f5d7744f7c8) at ../sysdeps/unix/sysv/linux/futex-internal.h:142
  9    Thread 0x7f5d77244700 (LWP 27872) "dotnet" 0x00007f5d82e856d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f5d58000b80) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  10   Thread 0x7f5d7703f700 (LWP 27873) "dotnet" 0x00007f5d82e856d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f5d5c000b80) at ../sysdeps/unix/sysv/linux/futex-internal.h:205

Thread 10 (Thread 0x7f5d7703f700 (LWP 27873)):
#0  0x00007f5d82e856d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f5d5c000b80) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x7f5d5c000b80, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007f5d82e857c8 in __new_sem_wait_slow (sem=0x7f5d5c000b80, abstime=0x0) at sem_waitcommon.c:181
#3  0x00007f5d81779148 in mono_os_sem_wait (sem=0x7f5d5c000b80, flags=MONO_SEM_FLAGS_NONE) at /__w/1/s/src/mono/mono/utils/mono-os-semaphore.h:203
#4  mono_thread_info_wait_for_resume (info=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads.c:232
#5  0x00007f5d8177ec00 in mono_threads_state_poll_with_info (info=0x7f5d5c000b20) at /__w/1/s/src/mono/mono/utils/mono-threads-coop.c:149
#6  0x00007f5d8177edce in mono_threads_enter_gc_safe_region_unbalanced_with_info (info=0x7f5d5c000b20, stackdata=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads-coop.c:328
#7  0x00007f5d8163a5e5 in mono_coop_mutex_lock (mutex=0x5559463b8570) at /__w/1/s/src/mono/mono/utils/mono-coop-mutex.h:55
#8  mono_domain_lock (domain=0x5559463b8570) at /__w/1/s/src/mono/mono/metadata/domain.c:2038
#9  0x00007f5d816afd0f in mono_method_add_generic_virtual_invocation (domain=0x5559463b8570, vtable=0x7f5d5c03f868, vtable_slot=0x7f5d5c03f850, method=0x7f5d680cbf00, code=0x41626d10) at /__w/1/s/src/mono/mono/metadata/object.c:1827
#10 0x00007f5d815801aa in common_call_trampoline (regs=<optimized out>, code=<optimized out>, m=0x7f5d580097e0, vt=<optimized out>, vtable_slot=0x7f5d5c03f850, error=<optimized out>) at /__w/1/s/src/mono/mono/mini/mini-trampolines.c:655
#11 0x00007f5d81580e39 in mono_vcall_trampoline (regs=0x7f5d7703d998, code=0x414f4ddc "H\211E\320\351S", slot=-3, tramp=<optimized out>) at /__w/1/s/src/mono/mono/mini/mini-trampolines.c:856
#12 0x00000000413b9293 in ?? ()
#13 0x00007f5d5c002585 in ?? ()
#14 0x00007f5d80f9bb88 in ?? ()
#15 0x00007f5d7703db60 in ?? ()
#16 0x00007f5d7703d998 in ?? ()
#17 0x00007f5d5c001d20 in ?? ()
#18 0x00007f5d5c03f868 in ?? ()
#19 0x00007f5d80ed4498 in ?? ()
#20 0x00007f5d80006f77 in ?? ()
#21 0x00007f5d80f7a838 in ?? ()
#22 0x00007f5d7703db60 in ?? ()
#23 0x00007f5d7703dbc0 in ?? ()
#24 0x0000000000000002 in ?? ()
#25 0x00007f5d80febb48 in ?? ()
#26 0x00000000000007b4 in ?? ()
#27 0x0000000000000001 in ?? ()
#28 0x00007f5d680cbf00 in ?? ()
#29 0x00007f5d8164723f in mono_class_is_assignable_from_internal (klass=<optimized out>, oklass=<optimized out>) at /__w/1/s/src/mono/mono/metadata/class.c:3631
#30 0x0000000000000000 in ?? ()

Thread 9 (Thread 0x7f5d77244700 (LWP 27872)):
#0  0x00007f5d82e856d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f5d58000b80) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x7f5d58000b80, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007f5d82e857c8 in __new_sem_wait_slow (sem=0x7f5d58000b80, abstime=0x0) at sem_waitcommon.c:181
#3  0x00007f5d81779148 in mono_os_sem_wait (sem=0x7f5d58000b80, flags=MONO_SEM_FLAGS_NONE) at /__w/1/s/src/mono/mono/utils/mono-os-semaphore.h:203
#4  mono_thread_info_wait_for_resume (info=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads.c:232
#5  0x00007f5d8177eefe in mono_threads_exit_gc_safe_region_unbalanced_internal (cookie=0x7f5d58000b20, stackdata=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads-coop.c:391
#6  0x00007f5d8163a5ff in mono_coop_mutex_lock (mutex=0x5559463b8570) at /__w/1/s/src/mono/mono/utils/mono-coop-mutex.h:59
#7  mono_domain_lock (domain=0x5559463b8570) at /__w/1/s/src/mono/mono/metadata/domain.c:2038
#8  0x00007f5d8163a7dd in mono_domain_code_reserve (domain=0x5559463b8570, size=1792) at /__w/1/s/src/mono/mono/metadata/domain.c:1434
#9  0x00007f5d814e610d in mono_codegen (cfg=0x7f5d5819ee20) at /__w/1/s/src/mono/mono/mini/mini.c:2244
#10 0x00007f5d814e84e7 in mini_method_compile (method=<optimized out>, opts=<optimized out>, domain=<optimized out>, flags=<optimized out>, parts=<optimized out>, aot_method_index=<optimized out>) at /__w/1/s/src/mono/mono/mini/mini.c:3866
#11 0x00007f5d814e96eb in mono_jit_compile_method_inner (method=0x7f5d58083a90, target_domain=0x5559463b8570, opt=393, error=0x7f5d7723fca8) at /__w/1/s/src/mono/mono/mini/mini.c:4060
#12 0x00007f5d814ed046 in mono_jit_compile_method_with_opt (method=0x7f5d58083a90, opt=<optimized out>, jit_only=0, error=0x7f5d7723fca8) at /__w/1/s/src/mono/mono/mini/mini-runtime.c:2448
#13 0x00007f5d815800fa in common_call_trampoline (regs=<optimized out>, code=<optimized out>, m=0x7f5d58083a90, vt=<optimized out>, vtable_slot=<optimized out>, error=<optimized out>) at /__w/1/s/src/mono/mono/mini/mini-trampolines.c:635
#14 0x00007f5d81580e39 in mono_vcall_trampoline (regs=0x7f5d7723fd58, code=0x41656d4c "L\213\360H\205\300\017\205)", slot=8, tramp=<optimized out>) at /__w/1/s/src/mono/mono/mini/mini-trampolines.c:856
#15 0x00000000413b9293 in ?? ()
#16 0x00007f5d77240a1d in ?? ()
#17 0x00007f5d80006e85 in ?? ()
#18 0x00007f5d7723ff20 in ?? ()
#19 0x00007f5d7723fd58 in ?? ()
#20 0x00007f5d58001d20 in ?? ()
#21 0x00007f5d5c0deca0 in ?? ()
#22 0x0000000000000001 in ?? ()
#23 0x0000000000000001 in ?? ()
#24 0x00007f5d80c01b40 in ?? ()
#25 0x00007f5d7723ff20 in ?? ()
#26 0x00007f5d7723fff0 in ?? ()
#27 0x00007f5d80c84d20 in ?? ()
#28 0x00007f5d80fd0790 in ?? ()
#29 0x0000000000000000 in ?? ()

Thread 8 (Thread 0x7f5d77450700 (LWP 27871)):
#0  0x00007f5d82e82ed9 in futex_reltimed_wait_cancelable (private=<optimized out>, reltime=0x7f5d7744f6a0, expected=0, futex_word=0x7f5d7744f7c8) at ../sysdeps/unix/sysv/linux/futex-internal.h:142
#1  __pthread_cond_wait_common (abstime=0x7f5d7744f758, mutex=0x7f5d680209c0, cond=0x7f5d7744f7a0) at pthread_cond_wait.c:533
#2  __pthread_cond_timedwait (cond=0x7f5d7744f7a0, mutex=0x7f5d680209c0, abstime=0x7f5d7744f758) at pthread_cond_wait.c:667
#3  0x00007f5d81770d64 in mono_os_cond_timedwait (cond=0x7f5d7744f7a0, mutex=0x7f5d680209c0, timeout_ms=20000) at /__w/1/s/src/mono/mono/utils/mono-os-mutex.c:75
#4  0x00007f5d817772b4 in mono_coop_cond_timedwait (cond=0x7f5d7744f7a0, mutex=<optimized out>, timeout_ms=20000) at /__w/1/s/src/mono/mono/utils/mono-coop-mutex.h:103
#5  mono_lifo_semaphore_timed_wait (semaphore=0x7f5d680209c0, timeout_ms=20000) at /__w/1/s/src/mono/mono/utils/lifo-semaphore.c:48
#6  0x00000000415f9e16 in ?? ()
#7  0x00007f5d80f86518 in ?? ()
#8  0x0000000000000046 in ?? ()
#9  0x0000000000000002 in ?? ()
#10 0x00007f5d80f82d58 in ?? ()
#11 0x0000000000004e20 in ?? ()
#12 0x00007f5d64002580 in ?? ()
#13 0x00007f5d7744fbe0 in ?? ()
#14 0x00007f5d7744f820 in ?? ()
#15 0x00007f5d7744fbe0 in ?? ()
#16 0x00000000415f9d78 in ?? ()
#17 0x00007f5d80f82d58 in ?? ()
#18 0x0000000000004e20 in ?? ()
#19 0x00007f5d7744fbe0 in ?? ()
#20 0x00000000415f9a1c in ?? ()
#21 0x00007f5d80f82d58 in ?? ()
#22 0x0000000000004e20 in ?? ()
#23 0x00007f5d64000b20 in ?? ()
#24 0x0000000000000002 in ?? ()
#25 0x00007f5d80f86518 in ?? ()
#26 0x0000000000000044 in ?? ()
#27 0x0000000000000000 in ?? ()

Thread 7 (Thread 0x7f5d77655700 (LWP 27870)):
#0  0x00007f5d82e856d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f5d60000b80) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x7f5d60000b80, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007f5d82e857c8 in __new_sem_wait_slow (sem=0x7f5d60000b80, abstime=0x0) at sem_waitcommon.c:181
#3  0x00007f5d81779148 in mono_os_sem_wait (sem=0x7f5d60000b80, flags=MONO_SEM_FLAGS_NONE) at /__w/1/s/src/mono/mono/utils/mono-os-semaphore.h:203
#4  mono_thread_info_wait_for_resume (info=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads.c:232
#5  0x00007f5d8177eefe in mono_threads_exit_gc_safe_region_unbalanced_internal (cookie=0x7f5d60000b20, stackdata=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads-coop.c:391
#6  0x00007f5d8177bb3d in mono_coop_cond_timedwait (cond=<optimized out>, mutex=<optimized out>, timeout_ms=500) at /__w/1/s/src/mono/mono/utils/mono-coop-mutex.h:105
#7  sleep_interruptable (ms=500, alerted=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads.c:1640
#8  mono_thread_info_sleep (ms=500, alerted=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads.c:1672
#9  0x00007f5d816c172e in mono_sleep_internal (ms=<optimized out>, error=<optimized out>, allow_interruption=<optimized out>) at /__w/1/s/src/mono/mono/metadata/threads.c:1806
#10 ves_icall_System_Threading_Thread_Sleep_internal (ms=500, allow_interruption=1 '\001', error=<optimized out>) at /__w/1/s/src/mono/mono/metadata/threads.c:1839
#11 0x00007f5d81673cf8 in ves_icall_System_Threading_Thread_Sleep_internal_raw (a0=500, a1=1 '\001') at /__w/1/s/src/mono/mono/metadata/icall-def-netcore.h:510
#12 0x00000000415f0746 in ?? ()
#13 0x00007f5d80f83030 in ?? ()
#14 0x00007f5d77654e68 in ?? ()
#15 0x0000555946cd8928 in ?? ()
#16 0x00007f5d80f83030 in ?? ()
#17 0x00000000000001f4 in ?? ()
#18 0x00007f5d60002580 in ?? ()
#19 0x00007f5d77654be0 in ?? ()
#20 0x00007f5d77654ae0 in ?? ()
#21 0x00007f5d77654be0 in ?? ()
#22 0x00000000415f0628 in ?? ()
#23 0x00007f5d80f830b0 in ?? ()
#24 0x00005559463f8468 in ?? ()
#25 0x00007f5d77654be0 in ?? ()
#26 0x00000000415f05f0 in ?? ()
#27 0x00007f5d77654be0 in ?? ()
#28 0x00000000415ef774 in ?? ()
#29 0x0000000000000000 in ?? ()

Thread 6 (Thread 0x7f5d77a65700 (LWP 27869)):
#0  0x00007f5d82e82ed9 in futex_reltimed_wait_cancelable (private=<optimized out>, reltime=0x7f5d77a64060, expected=0, futex_word=0x555946394998) at ../sysdeps/unix/sysv/linux/futex-internal.h:142
#1  __pthread_cond_wait_common (abstime=0x7f5d77a64118, mutex=0x555946394948, cond=0x555946394970) at pthread_cond_wait.c:533
#2  __pthread_cond_timedwait (cond=0x555946394970, mutex=0x555946394948, abstime=0x7f5d77a64118) at pthread_cond_wait.c:667
#3  0x00007f5d81770d64 in mono_os_cond_timedwait (cond=0x555946394970, mutex=0x555946394948, timeout_ms=12000) at /__w/1/s/src/mono/mono/utils/mono-os-mutex.c:75
#4  0x00007f5d816dfd04 in mono_coop_cond_timedwait (cond=<optimized out>, mutex=<optimized out>, timeout_ms=12000) at /__w/1/s/src/mono/mono/utils/mono-coop-mutex.h:103
#5  mono_w32handle_timedwait_signal_naked (cond=<optimized out>, mutex=<optimized out>, timeout=12000, poll=0, alerted=0x7f5d77a641bc) at /__w/1/s/src/mono/mono/metadata/w32handle.c:652
#6  mono_w32handle_timedwait_signal_handle (handle_data=<optimized out>, timeout=12000, poll=0, alerted=0x7f5d77a641bc) at /__w/1/s/src/mono/mono/metadata/w32handle.c:767
#7  0x00007f5d816dfbc7 in mono_w32handle_wait_one (handle=<optimized out>, timeout=12000, alertable=<optimized out>) at /__w/1/s/src/mono/mono/metadata/w32handle.c:882
#8  0x00007f5d816dfe24 in mono_w32handle_wait_multiple (handles=0x7f5d77a648e0, nhandles=1, waitall=<optimized out>, timeout=12000, alertable=1, error=0x2) at /__w/1/s/src/mono/mono/metadata/w32handle.c:1002
#9  0x00007f5d816c269f in ves_icall_System_Threading_WaitHandle_Wait_internal (handles=0x7f5d77a648e0, numhandles=1, waitall=0 '\000', timeout=12000, error=0x7f5d77a64808) at /__w/1/s/src/mono/mono/metadata/threads.c:2314
#10 0x00007f5d81673ebc in ves_icall_System_Threading_WaitHandle_Wait_internal_raw (a0=0x7f5d77a648e0, a1=1, a2=0 '\000', a3=12000) at /__w/1/s/src/mono/mono/metadata/icall-def-netcore.h:516
#11 0x00000000415a4bd2 in ?? ()
#12 0x00007f5d80f82e08 in ?? ()
#13 0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7f5d77860700 (LWP 27868)):
#0  0x00007f5d82e8723a in __waitpid (pid=27875, stat_loc=0x7f5d7785d2bc, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
#1  0x00007f5d815d7f01 in dump_native_stacktrace (signal=<optimized out>, mctx=<optimized out>) at /__w/1/s/src/mono/mono/mini/mini-posix.c:1113
#2  mono_dump_native_crash_info (signal=<optimized out>, mctx=0x7f5d7785ddd0, info=<optimized out>) at /__w/1/s/src/mono/mono/mini/mini-posix.c:1157
#3  0x00007f5d8157d7b5 in mono_handle_native_crash (signal=0x7f5d817f5b83 "SIGABRT", mctx=0x7f5d7785ddd0, info=0x7f5d7785e0b0) at /__w/1/s/src/mono/mono/mini/mini-exceptions.c:3427
#4  0x00007f5d815d7355 in sigabrt_signal_handler (_dummy=6, _info=0x7f5d7785e0b0, context=0x7f5d7785df80) at /__w/1/s/src/mono/mono/mini/mini-posix.c:234
#5  <signal handler called>
#6  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#7  0x00007f5d81f81801 in __GI_abort () at abort.c:79
#8  0x00007f5d81786b35 in monoeg_assert_abort () at /__w/1/s/src/mono/mono/eglib/goutput.c:57
#9  0x00007f5d8176acd0 in mono_log_write_logfile (log_domain=<optimized out>, level=<optimized out>, hdr=<optimized out>, message=0x7f5d68036990 "mono_threads_pthread_kill: pthread_kill failed with error 11 - potential kernel OOM or signal queue overflow") at /__w/1/s/src/mono/mono/utils/mono-log-common.c:136
#10 0x00007f5d81786f25 in monoeg_g_logstr (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, msg=0x0) at /__w/1/s/src/mono/mono/eglib/goutput.c:151
#11 monoeg_g_logv_nofree (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, format=<optimized out>, args=<optimized out>) at /__w/1/s/src/mono/mono/eglib/goutput.c:166
#12 0x00007f5d81786fbe in monoeg_g_logv (log_domain=0x2 <error: Cannot access memory at address 0x2>, log_level=2005265504, format=0x0, args=0x7f5d7785f3a0) at /__w/1/s/src/mono/mono/eglib/goutput.c:173
#13 monoeg_g_log (log_domain=0x2 <error: Cannot access memory at address 0x2>, log_level=2005265504, format=0x0) at /__w/1/s/src/mono/mono/eglib/goutput.c:182
#14 0x00007f5d8177e313 in mono_threads_pthread_kill (info=0x7f5d64000b20, signum=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads-posix.c:209
#15 mono_threads_suspend_begin_async_suspend (info=0x7f5d64000b20, interrupt_kernel=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-threads-posix.c:358
#16 0x00007f5d8177ac82 in mono_thread_info_begin_suspend (info=0x7f5d64000b20, phase=<optimized out>) at /__w/1/s/src/mono/mono/utils/atomic.h:261
#17 0x00007f5d8171b88e in sgen_unified_suspend_stop_world () at /__w/1/s/src/mono/mono/metadata/sgen-stw.c:288
#18 sgen_client_stop_world (generation=0, serial_collection=0) at /__w/1/s/src/mono/mono/metadata/sgen-stw.c:124
#19 0x00007f5d8172e92d in sgen_stop_world (generation=0, serial_collection=0) at /__w/1/s/src/mono/mono/sgen/sgen-gc.c:3869
#20 0x00007f5d8172b66f in sgen_perform_collection_inner (requested_size=<optimized out>, generation_to_collect=<optimized out>, reason=<optimized out>, forced_serial=<optimized out>, stw=<optimized out>) at /__w/1/s/src/mono/mono/sgen/sgen-gc.c:2566
#21 sgen_perform_collection (requested_size=4096, generation_to_collect=0, reason=0x7f5d817eda16 "Nursery full", forced_serial=0, stw=1) at /__w/1/s/src/mono/mono/sgen/sgen-gc.c:2679
#22 0x00007f5d8172b5d3 in sgen_ensure_free_space (size=4096, generation=<optimized out>) at /__w/1/s/src/mono/mono/sgen/sgen-gc.c:2545
#23 0x00007f5d81720f6e in sgen_alloc_obj_nolock (vtable=0x555946ae0c70, size=128) at /__w/1/s/src/mono/mono/sgen/sgen-alloc.c:279
#24 0x00007f5d817213e9 in sgen_alloc_obj (vtable=0x555946ae0c70, size=128) at /__w/1/s/src/mono/mono/sgen/sgen-alloc.c:454
#25 0x00007f5d8171d3c6 in mono_gc_alloc_obj (vtable=0x2, size=140039413951584) at /__w/1/s/src/mono/mono/metadata/sgen-mono.c:921
#26 0x00000000414f0f87 in ?? ()
#27 0x00007f5d80ffffc0 in ?? ()
#28 0x00007f5d68018d50 in ?? ()
#29 0x00007f5d680186e0 in ?? ()
#30 0x0000555946ae0c70 in ?? ()
#31 0x0000000000000080 in ?? ()
#32 0x00007f5d6803d800 in ?? ()
#33 0x00007f5d81000040 in ?? ()
#34 0x00007f5d7785f730 in ?? ()
#35 0x00007f5d80fd2308 in ?? ()
#36 0x00000000414c4d24 in ?? ()
#37 0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7f5d77b0c700 (LWP 27865)):
#0  0x00007f5d82e86384 in __libc_read (fd=5, buf=0x7f5d77b0bee7, nbytes=1) at ../sysdeps/unix/sysv/linux/read.c:27
#1  0x00007f5d7eb5fd5e in SignalHandlerLoop (arg=0x555946ac6e20) at /__w/1/s/src/libraries/Native/Unix/System.Native/pal_signal.c:90
#2  0x00007f5d82e7c6db in start_thread (arg=0x7f5d77b0c700) at pthread_create.c:463
#3  0x00007f5d8206288f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f5d7f13b700 (LWP 27864)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f5d82e7f023 in __GI___pthread_mutex_lock (mutex=0x7f5d81a82490 <sgen_gc_mutex>) at ../nptl/pthread_mutex_lock.c:78
#2  0x00007f5d8172bfb9 in mono_os_mutex_lock (mutex=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-os-mutex.h:105
#3  mono_coop_mutex_lock (mutex=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-coop-mutex.h:57
#4  sgen_gc_lock () at /__w/1/s/src/mono/mono/sgen/sgen-gc.c:3811
#5  0x00007f5d8171f5a8 in mono_gc_skip_thread_changing (skip=0) at /__w/1/s/src/mono/mono/metadata/sgen-mono.c:2201
#6  0x00007f5d8177a308 in mono_thread_info_set_flags (flags=MONO_THREAD_INFO_FLAGS_NONE) at /__w/1/s/src/mono/mono/utils/mono-threads.c:814
#7  0x00007f5d8170de0e in finalizer_thread (unused=<optimized out>) at /__w/1/s/src/mono/mono/metadata/gc.c:969
#8  0x00007f5d816c7b0a in start_wrapper_internal (start_info=0x0, stack_ptr=<optimized out>) at /__w/1/s/src/mono/mono/metadata/threads.c:1233
#9  0x00007f5d816c7989 in start_wrapper (data=0x555946406350) at /__w/1/s/src/mono/mono/metadata/threads.c:1308
#10 0x00007f5d82e7c6db in start_thread (arg=0x7f5d7f13b700) at pthread_create.c:463
#11 0x00007f5d8206288f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f5d80bff700 (LWP 27863)):
#0  0x00007f5d82e829f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f5d81a7c798 <work_cond+40>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7f5d81a7c748 <lock>, cond=0x7f5d81a7c770 <work_cond>) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7f5d81a7c770 <work_cond>, mutex=0x7f5d81a7c748 <lock>) at pthread_cond_wait.c:655
#3  0x00007f5d8175fa53 in mono_os_cond_wait (cond=<optimized out>, mutex=<optimized out>) at /__w/1/s/src/mono/mono/utils/mono-os-mutex.h:219
#4  get_work (worker_index=<optimized out>, work_context=<optimized out>, do_idle=<optimized out>, job=<optimized out>) at /__w/1/s/src/mono/mono/sgen/sgen-thread-pool.c:165
#5  thread_func (data=0x0) at /__w/1/s/src/mono/mono/sgen/sgen-thread-pool.c:196
#6  0x00007f5d82e7c6db in start_thread (arg=0x7f5d80bff700) at pthread_create.c:463
#7  0x00007f5d8206288f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f5d832a8740 (LWP 27862)):
#0  0x00007f5d82e829f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555946394848) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x5559463947f8, cond=0x555946394820) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x555946394820, mutex=0x5559463947f8) at pthread_cond_wait.c:655
#3  0x00007f5d81770d80 in mono_os_cond_wait (cond=0x555946394820, mutex=0x5559463947f8) at /__w/1/s/src/mono/mono/utils/mono-os-mutex.h:219
#4  mono_os_cond_timedwait (cond=0x555946394820, mutex=0x5559463947f8, timeout_ms=0) at /__w/1/s/src/mono/mono/utils/mono-os-mutex.c:32
#5  0x00007f5d816dfd04 in mono_coop_cond_timedwait (cond=<optimized out>, mutex=<optimized out>, timeout_ms=4294967295) at /__w/1/s/src/mono/mono/utils/mono-coop-mutex.h:103
#6  mono_w32handle_timedwait_signal_naked (cond=<optimized out>, mutex=<optimized out>, timeout=4294967295, poll=0, alerted=0x7ffeb6cbe9bc) at /__w/1/s/src/mono/mono/metadata/w32handle.c:652
#7  mono_w32handle_timedwait_signal_handle (handle_data=<optimized out>, timeout=4294967295, poll=0, alerted=0x7ffeb6cbe9bc) at /__w/1/s/src/mono/mono/metadata/w32handle.c:767
#8  0x00007f5d816dfbc7 in mono_w32handle_wait_one (handle=<optimized out>, timeout=4294967295, alertable=<optimized out>) at /__w/1/s/src/mono/mono/metadata/w32handle.c:882
#9  0x00007f5d816dfe24 in mono_w32handle_wait_multiple (handles=0x7ffeb6cbf0e0, nhandles=1, waitall=<optimized out>, timeout=4294967295, alertable=1, error=0x0) at /__w/1/s/src/mono/mono/metadata/w32handle.c:1002
#10 0x00007f5d816c269f in ves_icall_System_Threading_WaitHandle_Wait_internal (handles=0x7ffeb6cbf0e0, numhandles=1, waitall=0 '\000', timeout=-1, error=0x7ffeb6cbf008) at /__w/1/s/src/mono/mono/metadata/threads.c:2314
#11 0x00007f5d81673ebc in ves_icall_System_Threading_WaitHandle_Wait_internal_raw (a0=0x7ffeb6cbf0e0, a1=1, a2=0 '\000', a3=-1) at /__w/1/s/src/mono/mono/metadata/icall-def-netcore.h:516
#12 0x00000000415a4bd2 in ?? ()
#13 0x00007f5d80ee7b10 in ?? ()
#14 0x0000000000000000 in ?? ()

=================================================================
	Basic Fault Address Reporting
=================================================================
Memory around native instruction pointer (0x7f5d81f7fe97):0x7f5d81f7fe87  d2 4c 89 ce bf 02 00 00 00 b8 0e 00 00 00 0f 05  .L..............
0x7f5d81f7fe97  48 8b 8c 24 08 01 00 00 64 48 33 0c 25 28 00 00  H..$....dH3.%(..
0x7f5d81f7fea7  00 44 89 c0 75 1f 48 81 c4 18 01 00 00 c3 0f 1f  .D..u.H.........
0x7f5d81f7feb7  00 48 8b 15 a9 bf 3a 00 f7 d8 41 b8 ff ff ff ff  .H....:...A.....

=================================================================
	Managed Stacktrace:
=================================================================
	  at <unknown> <0xffffffff>
	  at System.Object:__icall_wrapper_mono_gc_alloc_obj <0x00076>
	  at System.Object:AllocSmall <0x00103>
	  at Xunit.DelegatingExecutionSummarySink:OnMessageWithTypes <0x0032b>
	  at Xunit.DelegatingXmlCreationSink:OnMessageWithTypes <0x00054>
	  at Xunit.DelegatingLongRunningTestDetectionSink:OnMessageWithTypes <0x00071>
	  at Xunit.MessageSinkAdapter:OnMessageWithTypes <0x0004c>
	  at Xunit.OptimizedRemoteMessageSink:OnMessage <0x0005d>
	  at Xunit.Sdk.MessageBus:DispatchMessages <0x00094>
	  at Xunit.Sdk.MessageBus:ReporterWorker <0x0005f>
	  at <>c:<QueueUserWorkItem>b__5_0 <0x0006b>
	  at System.Threading.Tasks.Task:InnerInvoke <0x000a2>
	  at <>c:<.cctor>b__274_0 <0x0002f>
	  at System.Threading.ExecutionContext:RunInternal <0x00101>
	  at System.Threading.Tasks.Task:ExecuteWithThreadLocal <0x00283>
	  at System.Threading.Tasks.Task:ExecuteEntryUnsafe <0x00097>
	  at <>c:<.cctor>b__10_0 <0x0005b>
	  at System.Threading.Thread:StartCallback <0x00109>
	  at System.Object:runtime_invoke_void__this__ <0x00091>
=================================================================
./RunTests.sh: line 161: 27862 Aborted                 "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Composition.AttributeModel.Tests.runtimeconfig.json --depsfile System.Composition.AttributeModel.Tests.deps.json xunit.console.dll System.Composition.AttributeModel.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing -notrait category=nonnetcoreapptests -notrait category=nonlinuxtests $RSP_FILE
~/work/A8F20901/w/8E960823/e
----- end Sat Feb 15 08:36:14 UTC 2020 ----- exit code 134 ----------------------------------------------------------
exit code 134 means SIGABRT Abort. Managed or native assert, or runtime check such as heap corruption, caused call to abort(). Core dumped.
Looking around for any Linux dump...
... found no dump in /home/helixbot/work/A8F20901/w/8E960823/e
@lambdageek
Copy link
Member

This is happening during a GC STW when Mono is trying to stop a thread. Evidently pthread_kill is returning EAGAIN (11) which is an indication that the kernel's global real-time queue has overflowed. Not clear if this is a transient condition and we could try again or not.

In mono/mono@e301847 we added the ability to abort a preemptive suspend request (at which point GC STW will loop back around and try again), so we could try to recover from this EAGAIN thing on Linux.

What do you think @vargaz @lateralusX ?

Additional reference: here's how Boehm picks the suspend signal https://github.com/ivmai/bdwgc/blob/b44088c81ce1e864b4c1d05a523c0150beeb7b8b/include/private/gc_priv.h#L2635

@lambdageek lambdageek removed the untriaged New issue has not been triaged by the area owner label Feb 27, 2020
@lateralusX
Copy link
Member

On Windows we will follow the semantics (believe mac does similar), if suspend fails (or GetThreadContext) the thread will be left running and mono_threads_suspend_begin_async_suspend will return FALSE and state machine is reset to a state reflecting thread to run as normal (fixed by mono/mono@e301847). On Linux as you said, you currently get an error that will take down the process before any of that could happen, resetting thread to running state and returning FALSE back to caller of mono_threads_suspend_begin_async_suspend. When STW is driving the suspend, returning FALSE back will mark that thread to be skipped, so even if thread state machine is correctly reflecting current state of thread (preventing future asserts), I believe our STW implementation is not fully implemented to support such a scenario, since STW mainly view those threads as currently shutting down or detaching from the runtime, not expecting them to continue running managed code. The fix resetting the state machine on failure had a different use case, in that case threads calling suspend/resume on other threads could trigger this failure and then next GC would trigger an assert in state machine, even if the thread was currently running and could successfully be suspended by STW.

So if this is hit during normal STW, then I believe current machinery won't trigger a retry in STW, when mono_threads_suspend_begin_async_suspend returns FALSE. So if we are going to rely on that it needs to be fixed in STW.

I know we had some sporadic failures calling SuspendThread that on some downstream Mono implementations was resolved with a retry loop (with a limited set of retry's), if all retries where exhausted, it would still return FALSE and let caller handle it. Maybe we could do something similar on Linux, there you also have a specific error code that you should retry, but still it might make sense to have a limited set of retries.

I also believe we should look into fixing STW to better understand failures due to suspend a thread vs threads shutting down, detaching from runtime etc or we will always have the risk of exiting a STW with running threads due to suspend failures.

@ericstj
Copy link
Member

ericstj commented Mar 21, 2020

Failing in #33816

Saw this in different tests upon rerun:
Microsoft.Extensions.FileProviders.Physical.Tests
System.Xml.XPath.Tests
Microsoft.Extensions.DependencyInjection.Tests
System.Xml.RW.ReaderSettings.Tests

#33915
Microsoft.CSharp.Tests
System.Net.Ping.Functional.Tests
System.Xml.RW.XmlReader.Tests

#33917
System.Xml.RW.XmlReader.Tests

#33852
System.Text.Encoding.Tests

Rolling build https://dev.azure.com/dnceng/public/_build/results?buildId=568429
System.Net.Primitives.Functional.Tests
System.Resources.ResourceManager.Tests
System.Security.Cryptography.Algorithms.Tests
System.Xml.XmlSchema.XmlSchemaValidatorApi.Tests

I just sampled the top failing builds for dotnet/runtime and all of them had the mono leg failing with this error. Seems like we need to have someone looking at this soon or disable the mono test leg. @marek-safar

@jaredpar, can you run your tool (or remind me how) to get the complete data on this?

All failing tests have console output with mono_threads_pthread_kill: pthread_kill failed with error 11 - potential kernel OOM or signal queue overflow

@jaredpar
Copy link
Member

jaredpar commented Mar 21, 2020

Here are the failures over the last 100 builds

Build Kind Console Log
568318 PR #33911 console.log
568429 Rolling console.log
568083 PR #33816 console.log
568583 PR #33915 console.log
568564 Rolling console.log
568452 PR #33852 console.log
568546 PR #33917 console.log
568335 PR #33674 console.log
568356 PR #33883 console.log
568322 Rolling console.log
568285 PR #33910 console.log
568331 PR #33589 console.log
567761 PR #33892 console.log
568102 PR #33899 console.log
568081 Rolling console.log
567048 PR #33716 console.log
568160 PR #33899 console.log
568064 Rolling console.log
567899 PR #33816 console.log
568200 PR #33704 console.log
566817 PR #33850 console.log
568158 PR #33905 console.log
567968 PR #33848 console.log
567720 PR #33890 console.log
567750 PR #33834 console.log
566811 PR #33849 console.log
567962 PR #33886 console.log
568000 PR #33889 console.log
567723 PR #33701 console.log
567777 PR #33868 console.log
567842 PR #33896 console.log
567923 Rolling console.log
567658 PR #33589 console.log
567857 PR #33869 console.log
567074 PR #33857 console.log
567199 PR #33444 console.log
567026 PR #33855 console.log
567452 PR #33841 console.log
567702 PR #33889 console.log
567349 PR #33827 console.log
567642 PR #33459 console.log
567233 Rolling console.log
567449 PR #33874 console.log
567236 PR #33869 console.log
567326 PR #33871 console.log
567175 PR #33834 console.log
567171 PR #33827 console.log
567150 PR #33819 console.log
567106 PR #33860 console.log
567099 Rolling console.log
567117 PR #33849 console.log
566861 PR #33851 console.log
566950 PR #33852 console.log

Command

runfo search-helix -d runtime -c 100 -pr -v "mono_threads_pthread_kill: pthread_kill failed with error 11"

@jaredpar jaredpar added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Mar 21, 2020
@am11
Copy link
Member

am11 commented Mar 23, 2020

Seems like we need to have someone looking at this soon or disable the mono test leg.

Could this be a default action to disable the failing tests followed by a draft PR enabling it, which someone can be assigned to push fix commits at? This way only one PR would be red instead of all of them.

@marek-safar
Copy link
Contributor

@lambdageek could you please take care of this

@lambdageek
Copy link
Member

lambdageek commented Mar 23, 2020

Ok, we're trying to fix this with #33966

The issue is:

  1. each thread on Linux has a queue of "real time" (signal number >= SIGRTMIN) signals.
  2. Something is overflowing the queue.
  3. The failure happens when we use one of those signals to suspend threads when the GC needs to stop the world.

How we're addressing the issue:

  1. [mono] retry mono_threads_pthread_kill if result == EAGAIN on Linux #33966 will look for the EAGAIN error and sleep and retry instead of failing.
  2. Longer term, we will need to work on a fix that propagates the cause of the suspend failure to the suspend initiator (the GC) and let it handle it, instead of just failing. The suspend initiator will then be adjusted to try again (this is better than 33966 because the suspend initiator has more context).
  3. Longest term in the future, fully cooperative suspend will address this by not using signals for suspends at all. Right now we're using a hybrid scheme on linux where threads that are running native P/Invoked code or that are in a blocking operation are suspended using pthread_kill. That will not be necessary with full coop. (But full coop requires an embedding API break, which is not something we're doing right now)

Update:
Another possibility is that the stop the world procedure is going haywire and just sending out a ton of pthread_kills for some reason - i'm going to look over the console logs from Jared's comment to see if there's evidence of that.

Update 2:
I was reading things about the signal queue wrong, I think. it's a per-thread queue. So more likely that we're flooding it somehow

monojenkins pushed a commit to monojenkins/mono that referenced this issue Mar 23, 2020
Try to address dotnet/runtime#32377
(signal queue overflow) by sleeping and retrying a few times.

This is kind of a hack, but it woudl be good to see if it will address the issue
lambdageek added a commit that referenced this issue Mar 24, 2020
…33966)

* [mono] retry mono_threads_pthread_kill if result == EAGAIN on Linux

Try to address #32377
(signal queue overflow) by sleeping and retrying a few times.
lambdageek added a commit to mono/mono that referenced this issue Mar 24, 2020
…19291)

Try to address dotnet/runtime#32377
(signal queue overflow) by sleeping and retrying a few times.

This is kind of a hack, but it would be good to see if it will address the issue

Co-authored-by: lambdageek <lambdageek@users.noreply.github.com>
@jaredpar
Copy link
Member

Closing as the issue looks like it's been addressed. Hvaven't seen this in recent runs.

@lambdageek
Copy link
Member

Ok, on a hunch I checked what RLIMIT_SIGPENDING is on helix bots. and it's

ulimit -i is 31805

which is reasonable.

But nonetheless somehow we're flooding the realtime signal queue. I don't really have any good ideas for how it might be happening, though. If it's a mono problem, it doesn't seem like just bubbling up the pthread_kill error to the GC STW loop is a good plan (lots of effort and unclear that it'll fix the problem).

Going to try running some libraries testsuite like System.Net.Sockets.Tests on a local Ubuntu VM to see if I can reproduce this locally.

@lambdageek
Copy link
Member

lambdageek commented Apr 30, 2020

Looking at the stack trace from Viktor's comment, I see a lot of threads in mono_coop_cond_timedwait, so I did an experiment where I send a realtime signal to a thread waiting on a condition variable

Unfortunately, unless I block the signal with pthread_sigmask or sigprocmask, it will get delivered to the thread without overflowing the queue even while it's in the call.

The only uses of pthread_setmask or sigprocmask in dotnet/runtime/src/mono and dotnet/runtime/src/libraries don't appear to block our GC suspend signal, so I don't think this is the problem.

There are a few other threads in other system calls that might be suspicious:

  • there's Thread 25 the System.Native SignalHandlerLoop function that's in a read call - this thread sometimes calls managed callbacks, which means that Mono can see it and would try to preemptively suspend it if it was in GC Safe mode.

    So next I'm going to see if read on a pipe is uninterruptible by realtime signals. If that's the case, we might need to deal with this thread specially (or maybe refactor it in some way)

  • Threads 24 and 23 are in epoll_wait but that is explicitly documented to return EINTR so I would be surprised if the signal handler doesn't run here for some reason

  • Thread 3 is in a pthread_mutex_lock - I expect that this is interruptible since the condvar example was interruptible - it would be surprising if different pthread primitives work differently.

@lambdageek
Copy link
Member

Hm. read from a pipe and pthread_mutex_lock both interruptible. I don't think this line of investigation is going to pan out.

Will focus on pthread_sigmask - there's a place in pal_process.c where all the signals are blocked while the process forks. I wonder if under load we bombard this thread with signals while it doesn't get a chance to run.

@lambdageek
Copy link
Member

Update:

  1. Doesn't look like pthread_sigmask in pal_process.c was relevant
  2. I still can only reproduce this failure by artificially setting ulimit -i 2 (real time signal queue limit is 2 events) before running a testsuite.

Current plan:

Despite not having a reliable reproduction, the right thing to do is to extend the thread suspend machinery to deal with transient thread suspension failures. This comes up on non-POSIX-signals backends too - for example the Win32 SuspendThread function can sometimes return -1 in which case we call mono_threads_transition_abort_async_suspend to try to back out of suspending the thread. If we succeed, the thread is still in a running state and we can try again.

Currently however, mono_threads_suspend_begin_async_suspend returns a boolean on all platforms, and so if suspending doesn't work we don't know whether to try again. (Currently we mark the thread as skipped and exclude it from the remainder of STW. This is a bad idea when the thread is interacting with the managed heap.)

So the right solution is to return a more refined return code and give STW a chance to try again if multiple threads are in a transient failure state.

@CoffeeFlux CoffeeFlux added this to the 6.0.0 milestone Jul 6, 2020
@directhex
Copy link
Contributor

I can't find any cases of this happening in the last 2 weeks. Of course, it's possible that's just my tooling misbehaving, but let's close for now given it's quite plausibly no longer an issue

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-threading-mono blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' runtime-mono specific to the Mono runtime
Projects
None yet
Development

No branches or pull requests