Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failure in Scheduler code #55235

Closed
d-netto opened this issue Jul 24, 2024 · 5 comments · Fixed by #55440
Closed

Assertion failure in Scheduler code #55235

d-netto opened this issue Jul 24, 2024 · 5 comments · Fixed by #55440
Labels
bug Indicates an unexpected problem or unintended behavior ci Continuous integration

Comments

@d-netto
Copy link
Member

d-netto commented Jul 24, 2024

See https://buildkite.com/julialang/julia-master/builds/38431#0190e57f-77e8-461e-afd1-be9abc0297f8:

[556] signal 6 (-6): Aborted
in expression starting at none:1
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7f3c58f4f40e)
__assert_fail at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
ijl_task_get_next at /cache/build/builder-amdci5-0/julialang/julia-master/src/scheduler.c:452
poptask at ./task.jl:1168
wait at ./task.jl:1177
uv_write at ./stream.jl:1073
unsafe_write at ./stream.jl:1146
write at ./strings/io.jl:248 [inlined]
print at ./strings/io.jl:250
unknown function (ip: 0x7f3bbbc89126)
_jl_invoke at /cache/build/builder-amdci5-0/julialang/julia-master/src/gf.c:3177 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-0/julialang/julia-master/src/gf.c:3354
showerror at ./errorshow.jl:152
unknown function (ip: 0x7f3bbbc890b6)
_jl_invoke at /cache/build/builder-amdci5-0/julialang/julia-master/src/gf.c:3177 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-0/julialang/julia-master/src/gf.c:3354
_atexit at ./initdefs.jl:467
jfptr__atexit_67251.1 at /cache/build/tester-amdci4-10/julialang/julia-master/julia-d00e19822c/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci5-0/julialang/julia-master/src/gf.c:3177 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-0/julialang/julia-master/src/gf.c:3354
jl_apply at /cache/build/builder-amdci5-0/julialang/julia-master/src/julia.h:2183 [inlined]
ijl_atexit_hook at /cache/build/builder-amdci5-0/julialang/julia-master/src/init.c:267
jl_exit_thread0_cb at /cache/build/builder-amdci5-0/julialang/julia-master/src/signals-unix.c:508
Allocations: 1 (Pool: 1; Big: 0); GC: 0

This happened in #55233, which is basically a NFC and doesn't change anything in the scheduler, so I think it's unlikely to be related to the PR.

@d-netto d-netto added bug Indicates an unexpected problem or unintended behavior ci Continuous integration labels Jul 24, 2024
@d-netto
Copy link
Member Author

d-netto commented Jul 24, 2024

This test runs inside rr, so there might be a trace uploaded somewhere?

CC: @DilumAluthge who might know.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 24, 2024

You missed the assertion text in your copy. It is this:

julia: /cache/build/builder-amdci5-0/julialang/julia-master/src/scheduler.c:452: ijl_task_get_next: Assertion `__extension__ ({ __auto_type __atomic_load_ptr = (&ptls->sleep_check_state); __typeof__ (*__atomic_load_ptr) __atomic_load_tmp; __atomic_load (__atomic_load_ptr, &__atomic_load_tmp, (memory_order_relaxed)); __atomic_load_tmp; }) == not_sleeping' failed.

When a signal causes a thread to resume, we need to also force it back into the not_sleeping state and increment nrunning. Similar to #54721, but needs to also happen when the signal response is to terminate the process directly (such as in jl_task_frame_noreturn) and not just when it throws an InterruptException. I am not entirely certain that we can keep the nrunning counter accurate in this case, but it probably shouldn't matter as we should be attempting to tear down the process fairly aggressively and not wait for nrunning to go to zero (though someone could trick it by calling wait() from their atexit hook such that it cannot exit)

@d-netto
Copy link
Member Author

d-netto commented Jul 24, 2024

Ah, OK. Thanks for the clarification.

Suspect it's fine to close then?

@DilumAluthge
Copy link
Member

This test runs inside rr, so there might be a trace uploaded somewhere?

Yeah, if you follow the link to Buildkite, you can click on the "Artifacts" tab, and then you can download the rr trace.

It might be split across multiple parts that you need to combine back together.

@giordano
Copy link
Contributor

This error is happening with a high rate lately.

vtjnash added a commit that referenced this issue Aug 9, 2024
vtjnash added a commit that referenced this issue Aug 14, 2024
vtjnash added a commit that referenced this issue Aug 14, 2024
vtjnash added a commit that referenced this issue Aug 15, 2024
Fixes #55235

Disables the assertion failure in the scheduler, so that we are more
likely to be able to report the underlying failure and run atexit
handlers successfully. This should clean up some of the error messages
that occur on timeout.
```
julia> sleep(5)
^\
[89829] signal 3: Quit: 3
in expression starting at REPL[1]:1
kevent at /usr/lib/system/libsystem_kernel.dylib (unknown line)
unknown function (ip: 0x0)
Allocations: 830502 (Pool: 830353; Big: 149); GC: 1
Quit: 3
```
lazarusA pushed a commit to lazarusA/julia that referenced this issue Aug 17, 2024
Fixes JuliaLang#55235

Disables the assertion failure in the scheduler, so that we are more
likely to be able to report the underlying failure and run atexit
handlers successfully. This should clean up some of the error messages
that occur on timeout.
```
julia> sleep(5)
^\
[89829] signal 3: Quit: 3
in expression starting at REPL[1]:1
kevent at /usr/lib/system/libsystem_kernel.dylib (unknown line)
unknown function (ip: 0x0)
Allocations: 830502 (Pool: 830353; Big: 149); GC: 1
Quit: 3
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior ci Continuous integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants