Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI failing on nightly #2476

Closed
lgoettgens opened this issue Jun 16, 2023 · 3 comments
Closed

CI failing on nightly #2476

lgoettgens opened this issue Jun 16, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@lgoettgens
Copy link
Member

Since around June 14th, nightly tests start hanging after running for a few minutes and then get killed after 2h30min.

It started some time between d5412e9 and 2bca8fc.

I would assume this is due to some changes in julia itself, since only nightly fails. I could bisect it using https://github.com/oscar-system/Oscar.jl/actions/runs/5258937597/jobs/9503831992 and https://github.com/oscar-system/Oscar.jl/actions/runs/5263887152/jobs/9514506365 to JuliaLang/julia@320e00d...8a1b642

@lgoettgens lgoettgens added the bug Something isn't working label Jun 16, 2023
@benlorenz
Copy link
Member

My guess is that JuliaLang/julia@03c4bc1#diff-b6ee767647e20ffec70782ae28c0f9d50dc5eb5d2e5285f9d7071064434fe3d9 requires a rebuild of libjulia_jll and some GAP jlls, since it changes some structs in gc.h, cc @fingolfin .

The process seems to run into some deadlock, I can reproduce this locally with just GAP.jl which also randomly gets stuck (just not every time like Oscar.jl, but maybe this is just because the tests are a lot shorter):

  * frame #0: 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183
    frame #1: 0x000014fdceda86f1 libpthread.so.0`__pthread_cond_wait at pthread_cond_wait.c:508
    frame #2: 0x000014fdceda8630 libpthread.so.0`__pthread_cond_wait(cond=0x000014fdce1fc920, mutex=0x000014fdce1fc960) at pthread_cond_wait.c:638
    frame #3: 0x000014fdcdd2dd6a libjulia-internal.so.1`uv_cond_wait(cond=0x000014fdce1fc920, mutex=0x000014fdce1fc960) at thread.c:883
    frame #4: 0x000014fdcdcb2175 libjulia-internal.so.1`jl_safepoint_wait_gc at safepoint.c:173:13
    frame #5: 0x000014fdcdcb1f9b libjulia-internal.so.1`segv_handler [inlined] jl_set_gc_and_wait at julia_internal.h:945:5
    frame #6: 0x000014fdcdcb1f7c libjulia-internal.so.1`segv_handler at signals-unix.c:351:9
    frame #7: 0x000014fdcdcb1f12 libjulia-internal.so.1`segv_handler(sig=<unavailable>, info=<unavailable>, context=0x000014fdc1ff93c0) at signals-unix.c:338:24
    frame #8: 0x000014fdcedad8c0 libpthread.so.0`__restore_rt
    frame #9: 0x000014fdcdc68de4 libjulia-internal.so.1`jl_gc_state_save_and_set at julia_threads.h:348:9
    frame #10: 0x000014fdcdc68de0 libjulia-internal.so.1`jl_gc_state_save_and_set [inlined] jl_gc_state_set(old_state='\x01', state='\0', ptls=0x0000000000b90b60) at julia_threads.h:341:22
    frame #11: 0x000014fdcdc68de0 libjulia-internal.so.1`jl_gc_state_save_and_set(ptls=0x0000000000b90b60, state='\0') at julia_threads.h:354:12
    frame #12: 0x000014fdcdc6945f libjulia-internal.so.1`ijl_sig_throw at task.c:756:5
    frame #13: 0x000014fdcdc69447 libjulia-internal.so.1`ijl_sig_throw at task.c:801:5

Thread list:

(lldb) thread info all
thread #1: tid = 6955, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #2: tid = 6957, 0x000014fdcebedc7c libc.so.6`__GI___sigtimedwait(set=0x000014fdc69fec70, info=0x000014fdc69fecf0, timeout=0x0000000000000000) at sigtimedwait.c:29, name = 'julia', stop reason = signal SIGSTOP

thread #3: tid = 6958, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #4: tid = 6959, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #5: tid = 6960, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #6: tid = 6961, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #7: tid = 6962, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #8: tid = 6963, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #9: tid = 6964, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #10: tid = 6965, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #11: tid = 6966, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #12: tid = 6967, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #13: tid = 6968, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

Everything waiting, 0% CPU load, this might be due to some memory corruption due to the changed structs?

@fingolfin
Copy link
Member

Thank you for filing the issue. I hope we can figure it out soon...

@benlorenz
Copy link
Member

benlorenz commented Jun 23, 2023

This was fixed on julia master now, thanks Max, tests with nightly do work on ubuntu. Unfortunately the upload job for the julia macos nightlies is currently broken, once that is fixed macos should also work again.

Edit: Macos nightly tests also succeeded now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants