Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rare freeze on exit #50038

Closed
kpamnany opened this issue Jun 2, 2023 · 12 comments · May be fixed by #51466
Closed

Rare freeze on exit #50038

kpamnany opened this issue Jun 2, 2023 · 12 comments · May be fixed by #51466
Labels
bug Indicates an unexpected problem or unintended behavior

Comments

@kpamnany
Copy link
Contributor

kpamnany commented Jun 2, 2023

julia> versioninfo()
Julia Version 1.8.2+RAI
Commit 36034abf26 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-unknown-linux-gnu)
  CPU: 12 × Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 14 on 12 virtual cores
Environment:
  JULIA_PROJECT = /home/kpamnany/raicode.master
  JULIA_NUM_THREADS = 12,2

This may be another atexit-related problem/race. We caught this freeze on one of our CI tests (a script runs gdb to get thread backtraces after 30 minutes of silence). Some of the thread backtraces are informative.

Thread 1 backtrace.
Thread 1 (Thread 0x7ffff7da7040 (LWP 3646561) "julia"):
#0  0x00007ffff7837700 in _mm_pause () at /nix/store/gwdq0r345gjpnsbllqjgrb8c8z23aiw9-gcc-10.3.0/lib/gcc/x86_64-unknown-linux-gnu/10.3.0/include/xmmintrin.h:1332
#1  jl_gc_wait_for_the_world () at /build/source/src/gc.c:209
#2  ijl_gc_collect (collection=collection@entry=JL_GC_AUTO) at /build/source/src/gc.c:3395
#3  0x00007ffff78385dd in maybe_collect (ptls=0x464b60) at /build/source/src/gc.c:910
#4  jl_gc_pool_alloc_inner (osize=32, pool_offset=<optimized out>, ptls=0x464b60) at /build/source/src/gc.c:1254
#5  ijl_gc_pool_alloc (ptls=0x464b60, pool_offset=<optimized out>, osize=32) at /build/source/src/gc.c:1303
#6  0x00007fffe39b9471 in julia_inline_apply!_14014 () at compiler/ssair/inlining.jl:1031
#7  0x00007fffe3bfa064 in julia_process_simple!_12824 () at compiler/ssair/inlining.jl:1129
#8  0x00007fffe39b25d6 in julia_assemble_inline_todo!_18174 () at compiler/ssair/inlining.jl:1364
#9  0x00007fffe32ed2b6 in julia_ssa_inlining_pass!_13386 () at compiler/ssair/inlining.jl:82
#10 0x00007fffe32ed48d in jfptr_ssa_inlining_passNOT._13387.clone_1 () from /nix/store/6yrrk7h6d2amqrgzk80h30pkylbm6nvm-julia-1.8.2-patched/lib/julia/sys.so
#11 0x00007fffe3bf543a in julia_run_passes_13834 () at compiler/optimize.jl:539
#12 0x00007fffe39cd180 in optimize () at compiler/optimize.jl:504
#13 julia__typeinf_15708 () at compiler/typeinfer.jl:261
#14 0x00007fffe3bf0861 in julia_typeinf_17317 () at compiler/typeinfer.jl:217
#15 0x00007fffe3bf27ea in julia_typeinf_edge_14764 () at compiler/typeinfer.jl:881
#16 0x00007fffe3ae6789 in julia_abstract_call_method_14701 () at compiler/abstractinterpretation.jl:647
#17 0x00007fffe39a4de0 in julia_abstract_call_gf_by_type_13619 () at compiler/abstractinterpretation.jl:139
#18 0x00007fffe398aa91 in julia_abstract_call_known_15092 () at compiler/abstractinterpretation.jl:1716
#19 0x00007fffe398be84 in julia_abstract_call_15027 () at compiler/abstractinterpretation.jl:1786
#20 0x00007fffe3befe79 in julia_abstract_call_15039 () at compiler/abstractinterpretation.jl:1753
#21 0x00007fffe39c1b81 in julia_abstract_eval_statement_16908 () at compiler/abstractinterpretation.jl:1910
#22 0x00007fffe39c8d0b in julia_typeinf_local_17838 () at compiler/abstractinterpretation.jl:2386
#23 0x00007fffe39cbd2e in julia_typeinf_nocycle_14583 () at compiler/abstractinterpretation.jl:2482
#24 0x00007fffe39ccbd9 in julia__typeinf_15708 () at compiler/typeinfer.jl:234
#25 0x00007fffe3bf0861 in julia_typeinf_17317 () at compiler/typeinfer.jl:217
#26 0x00007fffe3b2b4f7 in julia_typeinf_ext_12957 () at compiler/typeinfer.jl:971
#27 0x00007fffe330e45f in julia_typeinf_ext_toplevel_14792 () at compiler/typeinfer.jl:1004
#28 0x00007fffe330e813 in julia_typeinf_ext_toplevel_14788 () at compiler/typeinfer.jl:1000
#29 0x00007fffe330e853 in jfptr_typeinf_ext_toplevel_14789.clone_1 () from /nix/store/6yrrk7h6d2amqrgzk80h30pkylbm6nvm-julia-1.8.2-patched/lib/julia/sys.so
#30 0x00007ffff77db6ac in jl_apply (nargs=3, args=0x7fffffffac00) at /build/source/src/julia.h:1842
#31 jl_type_infer (mi=mi@entry=0x7fffe479c660 <jl_system_image_data+8554016>, world=world@entry=32254, force=force@entry=0) at /build/source/src/gf.c:319
#32 0x00007ffff736dd15 in jl_generate_fptr_impl (mi=0x7fffe479c660 <jl_system_image_data+8554016>, world=32254) at /build/source/src/jitlayers.cpp:332
#33 0x00007ffff77d9fc7 in jl_compile_method_internal (world=32254, mi=0x7fffe479c660 <jl_system_image_data+8554016>) at /build/source/src/gf.c:2081
#34 jl_compile_method_internal (mi=0x7fffe479c660 <jl_system_image_data+8554016>, world=32254) at /build/source/src/gf.c:2025
#35 0x00007ffff77daeed in _jl_invoke (world=32254, mfunc=<optimized out>, nargs=1, args=0x7fffffffaf68, F=0x7fffe479c3f0 <jl_system_image_data+8553392>) at /build/source/src/gf.c:2359
#36 ijl_apply_generic (F=<optimized out>, args=0x7fffffffaf68, nargs=<optimized out>) at /build/source/src/gf.c:2549
#37 0x00007ffff782cfb7 in jl_apply (nargs=2, args=0x7fffffffaf60) at /build/source/src/julia.h:1842
#38 run_finalizer (ct=ct@entry=0x7fffedf08010, o=<optimized out>, ff=<optimized out>) at /build/source/src/gc.c:283
#39 0x00007ffff782de35 in jl_gc_run_finalizers_in_list (ct=ct@entry=0x7fffedf08010, list=list@entry=0x7fffffffb0f0) at /build/source/src/gc.c:370
#40 0x00007ffff782dfef in run_finalizers (ct=0x7fffedf08010) at /build/source/src/gc.c:413
#41 0x00007ffff77fba8b in ijl_atexit_hook (exitcode=exitcode@entry=0) at /build/source/src/init.c:236
#42 0x00007ffff7841618 in jl_repl_entrypoint (argc=<optimized out>, argv=0x7fffffffb608) at /build/source/src/jlapi.c:720
#43 0x0000000000401059 in main (argc=<optimized out>, argv=<optimized out>) at /build/source/cli/loader_exe.c:59
Thread 6 backtrace.
Thread 6 (Thread 0x7fffe0c9b640 (LWP 3646566) "julia"):
#0  0x00007ffff7843b86 in _mm_pause () at /nix/store/gwdq0r345gjpnsbllqjgrb8c8z23aiw9-gcc-10.3.0/lib/gcc/x86_64-unknown-linux-gnu/10.3.0/include/xmmintrin.h:1332
#1  jl_safepoint_wait_gc () at /build/source/src/safepoint.c:163
#2  0x00007ffff7843778 in jl_set_gc_and_wait () at /build/source/src/julia_internal.h:839
#3  segv_handler (context=0x7fff74ffe480, info=<optimized out>, sig=11) at /build/source/src/signals-unix.c:326
#4  segv_handler (sig=11, info=<optimized out>, context=0x7fff74ffe480) at /build/source/src/signals-unix.c:313
#5  <signal handler called>
#6  0x00007ffff783925c in jl_gc_state_set (state=<optimized out>, old_state=<optimized out>, ptls=0x7fff50000b60) at /build/source/src/julia_threads.h:341
#7  maybe_collect (ptls=0x7fff50000b60) at /build/source/src/julia_threads.h:334
#8  jl_gc_pool_alloc_inner (osize=<optimized out>, pool_offset=<optimized out>, ptls=0x7fff50000b60) at /build/source/src/gc.c:1254
#9  jl_gc_pool_alloc_noinline (osize=<optimized out>, pool_offset=<optimized out>, ptls=0x7fff50000b60) at /build/source/src/gc.c:1313
#10 jl_gc_alloc_ (ty=0x4eadc000, sz=64, ptls=0x7fff50000b60) at /build/source/src/julia_internal.h:369
#11 jl_gc_alloc (ptls=0x7fff50000b60, sz=sz@entry=64, ty=ty@entry=0x4eadc000) at /build/source/src/gc.c:3455
#12 0x00007ffff77c398f in jl_gc_alloc_buf (sz=64, ptls=<optimized out>) at /build/source/src/julia_internal.h:404
#13 jl_reserve_excstack (reserved_size=<optimized out>, stack=<optimized out>) at /build/source/src/rtutils.c:335
#14 jl_push_excstack (stack=stack@entry=0x7fffedf08d10, exception=0x7fffe4b322a0 <jl_system_image_data+12313184>, bt_data=0x7fffd0188010, bt_size=4) at /build/source/src/rtutils.c:347
#15 0x00007ffff77fd392 in throw_internal (ct=ct@entry=0x7fffedf08c90, exception=<optimized out>, exception@entry=0x7fffe4b322a0 <jl_system_image_data+12313184>) at /build/source/src/task.c:621
#16 0x00007ffff77fd6a8 in ijl_throw (e=0x7fffe4b322a0 <jl_system_image_data+12313184>) at /build/source/src/task.c:655
#17 0x00007fffe3832538 in wait () at asyncevent.jl:155
#18 julia_profile_printing_listener_63537 () at /build/source/usr/share/julia/stdlib/v1.8/Profile/src/Profile.jl:39
#19 0x00007fffe30a4bca in julia_#3_63526 () at threadingconstructs.jl:321
#20 0x00007fffe30a4bd9 in jfptr_YY.3_63527.clone_1 () from /nix/store/6yrrk7h6d2amqrgzk80h30pkylbm6nvm-julia-1.8.2-patched/lib/julia/sys.so
#21 0x00007ffff77fd26f in jl_apply (nargs=1, args=<unavailable>) at /build/source/src/julia.h:1842
#22 start_task () at /build/source/src/task.c:932
Thread 3 backtrace.
Thread 3 (Thread 0x7fffe1c9d640 (LWP 3646563) "julia"):
#0  0x00007ffff7fb3cf5 in __futex_abstimed_wait_common64 () from /nix/store/6yrrk7h6d2amqrgzk80h30pkylbm6nvm-julia-1.8.2-patched/bin/../lib/julia/libpthread.so.0
#1  0x00007ffff7fadc22 in pthread_cond_wait@@GLIBC_2.3.2 () from /nix/store/6yrrk7h6d2amqrgzk80h30pkylbm6nvm-julia-1.8.2-patched/bin/../lib/julia/libpthread.so.0
#2  0x00007ffff78a7c39 in uv_cond_wait (cond=<optimized out>, mutex=<optimized out>) at /build/source/deps/srccache/libuv-e6f0e4900e195c8352f821abe2b3cffc3089547b/src/unix/thread.c:883
#3  0x00007ffff782afb7 in ijl_task_get_next (trypoptask=<optimized out>, q=<optimized out>, checkempty=0x7fffe4796d10 <jl_system_image_data+8531152>) at /build/source/src/partr.c:423
#4  0x00007fffe33ca5a6 in julia_poptask_45439 () at task.jl:929
#5  0x00007fffe336436c in julia_wait_52004 () at task.jl:938
#6  0x00007b056bbbcc33 in ?? ()
#7  0x0000000000000000 in ?? ()

In thread 1, the atexit hook runs finalizers; a finalizer has to be compiled; the compiler allocates and triggers GC; the thread stops the world, but it never proceeds from there.

Thread 6 had an exception thrown and while handling that, gets the stop-the-world and waits along with thread 1.

All the other threads are like thread 3, which appears to be asleep and hasn't been woken up for GC.

It's not clear why the freeze happened but we've been seeing and fixing some atexit-related races recently and this may be an instance of another such race.

Cc: @d-netto

@vchuravy
Copy link
Member

vchuravy commented Jun 2, 2023

Thread 6 had an exception thrown and while handling that, gets the stop-the-world and waits along with thread 1.

Slight correction, the exception is the stop the world signal.

If you have a process dump at that point you could look at how many threads we expect to reach the STW. It looks like we are waiting for threads that are not coming.

@kpamnany
Copy link
Contributor Author

kpamnany commented Jun 2, 2023

@NHDaly points out that threads 34 and 35 (the interactive threads) also have different back traces than the default threads but are also odd/confusing -- why would both threads be stuck in sigprocmask at the same time? That's not a blocking call?

Thread 34 backtrace.
Thread 34 (Thread 0x7fff5effe640 (LWP 3646594) "julia"):
#0  0x00007ffff7e4d130 in pthread_sigmask@GLIBC_2.2.5 () from /nix/store/0xxjx37fcy2nl3yz6igmv4mag2a7giq6-glibc-2.33-123/lib/libc.so.6
#1  0x00007ffff7e05e29 in sigprocmask () from /nix/store/0xxjx37fcy2nl3yz6igmv4mag2a7giq6-glibc-2.33-123/lib/libc.so.6
#2  0x00007ffff777d9f6 in _ULx86_64_dwarf_step () from /nix/store/6yrrk7h6d2amqrgzk80h30pkylbm6nvm-julia-1.8.2-patched/bin/../lib/julia/libunwind.so.8
#3  0x00007ffff7779b40 in _ULx86_64_step () from /nix/store/6yrrk7h6d2amqrgzk80h30pkylbm6nvm-julia-1.8.2-patched/bin/../lib/julia/libunwind.so.8
#4  0x00007ffff782b583 in jl_unw_step (from_signal_handler=<optimized out>, sp=<synthetic pointer>, ip=<synthetic pointer>, cursor=0x7b056d633460) at /build/source/src/stackwalk.c:553
#5  jl_unw_stepn (cursor=cursor@entry=0x7b056d633460, bt_data=bt_data@entry=0x7fffb035e010, bt_size=bt_size@entry=0x7b056d633088, sp=sp@entry=0x0, maxsize=maxsize@entry=80000, skip=1, skip@entry=3, ppgcstack=0x7b056d633080, from_signal_handler=<optimized out>) at /build/source/src/stackwalk.c:99
#6  0x00007ffff782baa0 in rec_backtrace (bt_data=0x7fffb035e010, maxsize=maxsize@entry=80000, skip=skip@entry=2) at /build/source/src/stackwalk.c:222
#7  0x00007ffff77fcdd0 in record_backtrace (ptls=0x7fff30000b60, skip=skip@entry=1) at /build/source/src/task.c:345
#8  0x00007ffff77fd69d in ijl_throw (e=0x7fffe4b322a0 <jl_system_image_data+12313184>) at /build/source/src/task.c:654
#9  0x00007fff5c138293 in ?? ()
#10 0x0101000000000000 in ?? ()
#11 0x00007fff09a7c078 in ?? ()
#12 0x00007fff1ad10b00 in ?? ()
#13 0x00007fff04730eb0 in ?? ()
#14 0x00007fff04410010 in ?? ()
#15 0x00007fff1deae710 in ?? ()
#16 0x00007fff2a1e63b8 in ?? ()
#17 0x00007fff2a28f4b8 in ?? ()
#18 0x00007fffef05acb0 in ?? ()
#19 0x00007ffff1f13ca0 in ?? ()
#20 0x00007b056d6339a0 in ?? ()
#21 0x0000000000000000 in ?? ()
Thread 35 backtrace.
Thread 35 (Thread 0x7fff5dbfd640 (LWP 3646595) "julia"):
#0  0x00007ffff7e4d130 in pthread_sigmask@GLIBC_2.2.5 () from /nix/store/0xxjx37fcy2nl3yz6igmv4mag2a7giq6-glibc-2.33-123/lib/libc.so.6
#1  0x00007ffff7e05e29 in sigprocmask () from /nix/store/0xxjx37fcy2nl3yz6igmv4mag2a7giq6-glibc-2.33-123/lib/libc.so.6
#2  0x00007ffff777d9f6 in _ULx86_64_dwarf_step () from /nix/store/6yrrk7h6d2amqrgzk80h30pkylbm6nvm-julia-1.8.2-patched/bin/../lib/julia/libunwind.so.8
#3  0x00007ffff7779b40 in _ULx86_64_step () from /nix/store/6yrrk7h6d2amqrgzk80h30pkylbm6nvm-julia-1.8.2-patched/bin/../lib/julia/libunwind.so.8
#4  0x00007ffff782b583 in jl_unw_step (from_signal_handler=<optimized out>, sp=<synthetic pointer>, ip=<synthetic pointer>, cursor=0x7fff03b8f460) at /build/source/src/stackwalk.c:553
#5  jl_unw_stepn (cursor=cursor@entry=0x7fff03b8f460, bt_data=bt_data@entry=0x7fffb02c1010, bt_size=bt_size@entry=0x7fff03b8f088, sp=sp@entry=0x0, maxsize=maxsize@entry=80000, skip=skip@entry=3, ppgcstack=0x7fff03b8f080, from_signal_handler=<optimized out>) at /build/source/src/stackwalk.c:99
#6  0x00007ffff782baa0 in rec_backtrace (bt_data=0x7fffb02c1010, maxsize=maxsize@entry=80000, skip=skip@entry=2) at /build/source/src/stackwalk.c:222
#7  0x00007ffff77fcdd0 in record_backtrace (ptls=0x7fff34000b60, skip=skip@entry=1) at /build/source/src/task.c:345
#8  0x00007ffff77fd69d in ijl_throw (e=0x7fffe4b322a0 <jl_system_image_data+12313184>) at /build/source/src/task.c:654
#9  0x00007fff5c135923 in ?? ()
#10 0x0101000000000000 in ?? ()
#11 0x00007fff11038398 in ?? ()
#12 0x00007fff1ad10b00 in ?? ()
#13 0x00007fff0440c8b0 in ?? ()
#14 0x00007fff0ab8cc50 in ?? ()
#15 0x00007fff1deae710 in ?? ()
#16 0x00007fff2a1e63b8 in ?? ()
#17 0x00007fff2a292e90 in ?? ()
#18 0x00007fffef05b068 in ?? ()
#19 0x00007ffff1f13ca0 in ?? ()
#20 0x00007fff03b8f9e0 in ?? ()
#21 0x000000008436e62b in ?? ()
#22 0x00000000000000d0 in ?? ()
#23 0x0000000000000000 in ?? ()

@kpamnany
Copy link
Contributor Author

kpamnany commented Jun 2, 2023

This run was with 32,2 threads.

@vchuravy
Copy link
Member

vchuravy commented Jun 2, 2023

So it's interesting that we go through sigprocmask.

The use of sigprocmask() is unspecified in a multithreaded
process; see pthread_sigmask(3).

@vchuravy
Copy link
Member

vchuravy commented Jun 2, 2023

https://www.gnu.org/software/libc/manual/html_node/Process-Signal-Mask.html

So that seems like a bug in libunwind?

@kpamnany
Copy link
Contributor Author

kpamnany commented Jun 2, 2023

The stack shows pthread_sigmask on top of sigprocmask, so I suspect that even though POSIX.1 doesn't specify the behavior of sigprocmask in a multi-threaded program, the implementation handles it correctly.

In a different context, @vtjnash had pointed at a libunwind commit which masks signals during dl_iterate_phdr. If signals are masked while a backtrace is produced, is the GC's stop-the-world SIGSEGV being missed?

@vchuravy
Copy link
Member

vchuravy commented Jun 2, 2023

The sigsegv is only caused by the thread doing a load in Julia/Runtime code.

So it shouldn't be missed. But maybe we need to transition those thread to GC unsafe?

@kpamnany
Copy link
Contributor Author

kpamnany commented Jun 2, 2023

Ah right. We aren't sending a signal for stop-the-world, we're causing a signal. Okay.

So make rec_backtrace GC unsafe? That will reduce stop-the-world time, right?

@vchuravy
Copy link
Member

vchuravy commented Jun 3, 2023

Yes, but the question still is: Are thread 34&35 in your example hanging?

So making it "GC unsafe" would stop the other threads from hanging in STW, but we shouldn't deadlock as long as Thread 34 and Thread 35 eventually return to managed or proper runtime code. I don't think there is a safepoint anywhere in jl_throw.

@kpamnany
Copy link
Contributor Author

kpamnany commented Jun 5, 2023

It's hard to say if they're hanging. The stack traces are from an automated gdb attach and thread apply all bt which is triggered by 30 minutes of silence. I don't know if it's more unlikely that both threads have the same stack trace and are in the same place or that sigprocmask seems to freeze, but both seem to have happened here.

@brenhinkeller brenhinkeller added the bug Indicates an unexpected problem or unintended behavior label Aug 4, 2023
@vtjnash
Copy link
Member

vtjnash commented Dec 21, 2023

From the stacktrace, it looks like this should have been fixed by #41616 (only in v1.10, since it was reverted in v1.9 branch)

@kpamnany
Copy link
Contributor Author

I think we haven't seen this since backporting #41616 and #47393 to our build, so will close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants