-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rare freeze on exit #50038
Comments
Slight correction, the exception is the stop the world signal. If you have a process dump at that point you could look at how many threads we expect to reach the STW. It looks like we are waiting for threads that are not coming. |
@NHDaly points out that threads 34 and 35 (the interactive threads) also have different back traces than the default threads but are also odd/confusing -- why would both threads be stuck in Thread 34 backtrace.Thread 34 (Thread 0x7fff5effe640 (LWP 3646594) "julia"):
#0 0x00007ffff7e4d130 in pthread_sigmask@GLIBC_2.2.5 () from /nix/store/0xxjx37fcy2nl3yz6igmv4mag2a7giq6-glibc-2.33-123/lib/libc.so.6
#1 0x00007ffff7e05e29 in sigprocmask () from /nix/store/0xxjx37fcy2nl3yz6igmv4mag2a7giq6-glibc-2.33-123/lib/libc.so.6
#2 0x00007ffff777d9f6 in _ULx86_64_dwarf_step () from /nix/store/6yrrk7h6d2amqrgzk80h30pkylbm6nvm-julia-1.8.2-patched/bin/../lib/julia/libunwind.so.8
#3 0x00007ffff7779b40 in _ULx86_64_step () from /nix/store/6yrrk7h6d2amqrgzk80h30pkylbm6nvm-julia-1.8.2-patched/bin/../lib/julia/libunwind.so.8
#4 0x00007ffff782b583 in jl_unw_step (from_signal_handler=<optimized out>, sp=<synthetic pointer>, ip=<synthetic pointer>, cursor=0x7b056d633460) at /build/source/src/stackwalk.c:553
#5 jl_unw_stepn (cursor=cursor@entry=0x7b056d633460, bt_data=bt_data@entry=0x7fffb035e010, bt_size=bt_size@entry=0x7b056d633088, sp=sp@entry=0x0, maxsize=maxsize@entry=80000, skip=1, skip@entry=3, ppgcstack=0x7b056d633080, from_signal_handler=<optimized out>) at /build/source/src/stackwalk.c:99
#6 0x00007ffff782baa0 in rec_backtrace (bt_data=0x7fffb035e010, maxsize=maxsize@entry=80000, skip=skip@entry=2) at /build/source/src/stackwalk.c:222
#7 0x00007ffff77fcdd0 in record_backtrace (ptls=0x7fff30000b60, skip=skip@entry=1) at /build/source/src/task.c:345
#8 0x00007ffff77fd69d in ijl_throw (e=0x7fffe4b322a0 <jl_system_image_data+12313184>) at /build/source/src/task.c:654
#9 0x00007fff5c138293 in ?? ()
#10 0x0101000000000000 in ?? ()
#11 0x00007fff09a7c078 in ?? ()
#12 0x00007fff1ad10b00 in ?? ()
#13 0x00007fff04730eb0 in ?? ()
#14 0x00007fff04410010 in ?? ()
#15 0x00007fff1deae710 in ?? ()
#16 0x00007fff2a1e63b8 in ?? ()
#17 0x00007fff2a28f4b8 in ?? ()
#18 0x00007fffef05acb0 in ?? ()
#19 0x00007ffff1f13ca0 in ?? ()
#20 0x00007b056d6339a0 in ?? ()
#21 0x0000000000000000 in ?? () Thread 35 backtrace.Thread 35 (Thread 0x7fff5dbfd640 (LWP 3646595) "julia"):
#0 0x00007ffff7e4d130 in pthread_sigmask@GLIBC_2.2.5 () from /nix/store/0xxjx37fcy2nl3yz6igmv4mag2a7giq6-glibc-2.33-123/lib/libc.so.6
#1 0x00007ffff7e05e29 in sigprocmask () from /nix/store/0xxjx37fcy2nl3yz6igmv4mag2a7giq6-glibc-2.33-123/lib/libc.so.6
#2 0x00007ffff777d9f6 in _ULx86_64_dwarf_step () from /nix/store/6yrrk7h6d2amqrgzk80h30pkylbm6nvm-julia-1.8.2-patched/bin/../lib/julia/libunwind.so.8
#3 0x00007ffff7779b40 in _ULx86_64_step () from /nix/store/6yrrk7h6d2amqrgzk80h30pkylbm6nvm-julia-1.8.2-patched/bin/../lib/julia/libunwind.so.8
#4 0x00007ffff782b583 in jl_unw_step (from_signal_handler=<optimized out>, sp=<synthetic pointer>, ip=<synthetic pointer>, cursor=0x7fff03b8f460) at /build/source/src/stackwalk.c:553
#5 jl_unw_stepn (cursor=cursor@entry=0x7fff03b8f460, bt_data=bt_data@entry=0x7fffb02c1010, bt_size=bt_size@entry=0x7fff03b8f088, sp=sp@entry=0x0, maxsize=maxsize@entry=80000, skip=skip@entry=3, ppgcstack=0x7fff03b8f080, from_signal_handler=<optimized out>) at /build/source/src/stackwalk.c:99
#6 0x00007ffff782baa0 in rec_backtrace (bt_data=0x7fffb02c1010, maxsize=maxsize@entry=80000, skip=skip@entry=2) at /build/source/src/stackwalk.c:222
#7 0x00007ffff77fcdd0 in record_backtrace (ptls=0x7fff34000b60, skip=skip@entry=1) at /build/source/src/task.c:345
#8 0x00007ffff77fd69d in ijl_throw (e=0x7fffe4b322a0 <jl_system_image_data+12313184>) at /build/source/src/task.c:654
#9 0x00007fff5c135923 in ?? ()
#10 0x0101000000000000 in ?? ()
#11 0x00007fff11038398 in ?? ()
#12 0x00007fff1ad10b00 in ?? ()
#13 0x00007fff0440c8b0 in ?? ()
#14 0x00007fff0ab8cc50 in ?? ()
#15 0x00007fff1deae710 in ?? ()
#16 0x00007fff2a1e63b8 in ?? ()
#17 0x00007fff2a292e90 in ?? ()
#18 0x00007fffef05b068 in ?? ()
#19 0x00007ffff1f13ca0 in ?? ()
#20 0x00007fff03b8f9e0 in ?? ()
#21 0x000000008436e62b in ?? ()
#22 0x00000000000000d0 in ?? ()
#23 0x0000000000000000 in ?? () |
This run was with |
So it's interesting that we go through
|
https://www.gnu.org/software/libc/manual/html_node/Process-Signal-Mask.html So that seems like a bug in libunwind? |
The stack shows In a different context, @vtjnash had pointed at a |
The sigsegv is only caused by the thread doing a load in Julia/Runtime code. So it shouldn't be missed. But maybe we need to transition those thread to GC unsafe? |
Ah right. We aren't sending a signal for stop-the-world, we're causing a signal. Okay. So make |
Yes, but the question still is: Are thread 34&35 in your example hanging? So making it "GC unsafe" would stop the other threads from hanging in STW, but we shouldn't deadlock as long as Thread 34 and Thread 35 eventually return to managed or proper runtime code. I don't think there is a safepoint anywhere in jl_throw. |
It's hard to say if they're hanging. The stack traces are from an automated |
From the stacktrace, it looks like this should have been fixed by #41616 (only in v1.10, since it was reverted in v1.9 branch) |
This may be another
atexit
-related problem/race. We caught this freeze on one of our CI tests (a script runsgdb
to get thread backtraces after 30 minutes of silence). Some of the thread backtraces are informative.Thread 1 backtrace.
Thread 6 backtrace.
Thread 3 backtrace.
In thread 1, the
atexit
hook runs finalizers; a finalizer has to be compiled; the compiler allocates and triggers GC; the thread stops the world, but it never proceeds from there.Thread 6 had an exception thrown and while handling that, gets the stop-the-world and waits along with thread 1.
All the other threads are like thread 3, which appears to be asleep and hasn't been woken up for GC.
It's not clear why the freeze happened but we've been seeing and fixing some
atexit
-related races recently and this may be an instance of another such race.Cc: @d-netto
The text was updated successfully, but these errors were encountered: