-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashes or hangs running doctests in CI (possibly related to GAP.jl and/or Singular.jl) #2336
Comments
I did see this once with 1.9 as well for Oscar.jl: https://github.com/oscar-system/Oscar.jl/actions/runs/4844111264/jobs/8632169406 For the linux-nightly-crash I ran this in a bunch of loops until it died and collected core files recently, which gives this backtrace:
I looked at several core files and the backtrace seem similar (having the NewBag or ResizeBag from gap at triggering some error in the GC). Another one:
And another one, this time with ResizeBag:
|
Thanks! So first off, this means the Singular.jl crashes are not directly related. I wonder if you could run a simple loop for Singular.jl master? In case that reveals something, it should probably go into a new issue on the Singular.jl repo So the crashes here are in the GAP kernel code which accesses the Julia GC to allocate memory. while trying to allocate a GAP object. That's really strange, but it gives me a start, I'll try to see if and what changed in this regard on the Julia side recently... Hmm, e.g. I noticed they introduced a new multithreading GC mark-loop inhttps://github.com/JuliaLang/julia/pull/48600 so perhaps we are not taking some lock we should be taking? Then again your reports seems all to be single-threaded. Hmm, I wonder if we could bisect the crash on the Julia side ... it is a bit difficult, though, because we probably can't say how often the loop has to pass before we can be reasonably sure the crash cause is not present (to be clear: I am not trying to imply Julia is at fault; it may well be the GAP code; but clearly the crash is revealed by a change on the Julia side; that change may be buggy; or it may just turn an existing bug in GAP from non-lethal into lethal. Dunno) |
I note that JuliaLang/julia#48600 was merged 4 days ago and the oldest occurrence of the crash I could find is from 3 days ago. Could still be a coincidence, of course. |
... and of course I totally missed that @benlorenz reported seeing this once before with Julia 1.9, oops. So, can this be reproduced with older Julia versions then? |
@benlorenz I am trying to make sense of the line numbers in those backtraces; could you tell me from which Julia commit that crashing binary was made? |
My julia nightly was a few days old, it seems this was from just before the multithreaded GC loop:
I will have a look if I can bisect something and / or run that on Singular as well. I had about 10 doctest loops open, the fastest one crashed after 12 tries, the slowest one took 276 tries. PS: Sorry about the delay, I was busy running around the building too much without power / network. |
And here is an example from the CI on 26.04.: https://github.com/oscar-system/Oscar.jl/actions/runs/4807011160/jobs/8555232756 |
edit: Singular crash backtraces moved to oscar-system/Singular.jl#641 |
Another night, another crash, this time with 1.9.0-rc3 (official binary for linux x64). Out of the 10 doctest loops I ran only one crashed with a core file, but the backtrace looks similar to nightly:
And it looks like I might also have one or two stuck doctest processes, I will attach a debugger later today. Edit:
|
With the debug output I also used for #2341 it turns out that this is triggered in the
Apart from that I am still bisecting the julia commits. |
Preliminary result of the somewhat tedious bisect:
I.e. it started with the same change (JuliaLang/julia#49315) that fixed the test-timeouts in #2187. But this is not on 1.9 so maybe this change just exposed some other issue, it did increase the GC counts quite a bit so maybe this just happens way more often now. I will keep the doctest running a bit more on the commit before that. Edit:
|
Is this still relevant or is it maybe fixed? Any idea, @benlorenz ? |
I haven't looked at the doctest crashes recently but there are still crashes from time to time (as usual for the doctests without any helpful output ....): |
OK. I wonder if we could arrange things so that we do get more helpful crash logs... Normally Julia prints backtraces even when there is a segfault, e.g.:
Why not here? Is the crash so "hard"? Or is Documenter redirecting output somehow? |
Documenter is using IOCapture to redirect the output, like this:
Maybe we can switch to a custom Documenter fork that at least prints which blocks are executed, e.g. like I did here: Or another option would be to record the run with BugReport and if it fails with a segmentation fault archive the corresponding files, with |
I tried A patched Documenter versions seems to be working, crashes so far:
I will add more failures to this post once I see them and keep retrying that job of the (macos-) runners are not busy otherwise. |
I haven't seen any 1.9 crashes in a while (other then the one on macOS, caused by timeouts). The others all seem to be memory crashes. I am not sure if any are still having backtraces like we saw here, involving GAP? And even if, I wonder whether this is now essentially the same as issue #2441?!? |
Just after you wrote this the doctests here crashed for julia 1.9 (on ubuntu) with a segfault (and as usual no other information). |
This still seems to be happening with 1.10 and also during normal tests. The following backtrace looks similar to some of the other backtraces, crash during free within a sweep that was triggered from a GAP memory allocation ( This was in the Polymake.jl downstream tests with Oscar master and julia 1.10:
|
Has anyone seen this recently? |
There was nobody reporting a recent occurrence. This can be reopened once it appears again. |
Recently I've observed here and also in the Singular.jl repository some crashes running doctest CI jobs; or in other cases these just hang "forever". For Oscar so far I've seen these only with Julia nightly; for Singular.jl also with other Julia versions. Unfortunately there is almost no information in the relevant CI logs. (The crashes in Singular.jl may or may not be related; I just thought I should point out the coincidence, in case there is a connection after all)
Examples:
Actually there are also cases where the regular tests just "hang" without making any progress; not sure if this might be related or not:
To work on this, it would of course be good to be able to reproduce these; and/or to get the CI to produce more information, like a backtrace from the crash. If anyone has an idea how we might achieve that...
CC @hannes14 @benlorenz
The text was updated successfully, but these errors were encountered: