-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI failures for ubuntu on 1.10 and nightly, probably out-of-memory issues #2441
Comments
After "wall of chambers" comes
(copied from a different run), so I am not too surprised that this might get killed. Same for the second log you posted, where the next test would be
|
See #2570 for some more examples of this, this does not only happen with julia 1.8 but also 1.6 and 1.10. So far we have seen this happen:
Feel free to add more examples. |
Is it possible to get more information? Like did the process OOM or did it take too long? |
I can try to get the runtime of the test faster. But I think that more of 50% of it is already compilation time. |
I think the process was killed due to an OOM situation. According to the timestamps in the log (can be turned on via the settings knob in the top right of that page) this was about 1.5 minutes after the Maybe you can adapt the code to use a custom seeded rng, e.g. For the default julia rng knowing the seed would probably not really help as there might be an unknown + non-deterministic number of other uses which change the rng. |
I cannot say much towards the detailed cause or a possible resolution. However, I can share a few insights onto
As said, I am not sure about the cause. But maybe this information helps @benlorenz to guess/track down the cause for this? |
The original report in this ticket was from the first of June, the last change in the line_bundle_cohomologies test file was just a week earlier: #2396 Please note that this is not a segfault but the process seems to run out of memory. |
Another new case:
from the OscarCI tests in GAP: https://github.com/oscar-system/GAP.jl/actions/runs/5608123338/jobs/10260113884#step:7:1471 The next testgroup after CoherentSheaves.jl would be elliptic_surface.jl which also needs quite a lot of memory allocations:
|
@benlorenz I stand corrected. Thank you for reminding me of this. I could narrow this down to a single line. This line would compute all monomials of a certain degree in a certain graded ring (a.k.a. a
So a fix could be to just remove this line from the tests. A corresponding PR is here: #2579. Alternatively, we could of course dive deeper and see if the homogeneous component computation could be improved. If the latter is interesting, it can be reproduced as follows: julia> ray_generators = [[1, 0, 0,-2,-3], [0, 1, 0,-2,-3], [0, 0, 1,-2,-3], [-1,-1,-1,-2,-3], [0, 0, 0, 1, 0], [0, 0, 0, 0, 1], [0, 0, 0,-2,-3]];
julia> max_cones = [[1, 2, 3, 5, 6], [1, 2, 3, 5, 7], [1, 2, 3, 6, 7], [2, 3, 4, 5, 6], [2, 3, 4, 5, 7], [2, 3, 4, 6, 7], [1, 3, 4, 5, 6], [1, 3, 4, 5, 7], [1, 3, 4, 6, 7], [1, 2, 4, 5, 6], [1, 2, 4, 5, 7], [1, 2, 4, 6, 7]];
julia> weierstrass_over_p3 = normal_toric_variety(ray_generators, max_cones; non_redundant = true);
julia> R = cox_ring(weierstrass_over_p3)
Multivariate polynomial ring in 7 variables over QQ graded by
x1 -> [1 0]
x2 -> [1 0]
x3 -> [1 0]
x4 -> [1 0]
x5 -> [0 2]
x6 -> [0 3]
x7 -> [-4 1] Of this ring julia> l4 = anticanonical_bundle(weierstrass_over_p3);
julia> b = basis_of_global_sections(l4);
julia> length(b) == 4551
true Let me cc @wdecker regarding the computation of homogeneous components, just in case. |
The elliptic surface stuff is a really complex data structure and computation. We worked hard to get this test running in 2 minutes. I am afraid that the test cannot be simplified much further. Plus it takes more time for compilation than for computation right now. Not sure what can be done about the allocations though. There is probably room for improvement all over the place. |
This looks like degree To me this feels like integer points in a polyhedron given by some non-negative constraints and two equations:
Note that here we have If you really wanted (0,6) you could easily change this and get the original number back:
|
|
Yes it could be an OOM. I did 200 test runs on a server and each run took between 37-43.7 seconds. But why do we OOM? Does the bot have so little memory? Or do we leak and it piles up? |
Concerning
the tests that takes most of the reported 5.8 seconds are |
There is really something bad going on with 1.10 and nightly, previously this did happen infrequently, but since a few days (at most one week) this is happening all the time. I started a matrix job for 1.10 and nightly with GC logging and got this:
Log: https://github.com/oscar-system/Oscar.jl/actions/runs/5689868672/job/15422159399#step:6:3171 There are plenty more failed runs in this job. |
I noticed one further thing in the GC output. In my first run (e.g this one) of these 16 jobs for every one of them the GC
In some cases, like above, there is also the weird huge But disabling these test did not seem to help, the new job also failed for all 16 runs: https://github.com/oscar-system/Oscar.jl/actions/runs/5690757887 |
The GC uses some allocation heuristics and GC heuristics to decide the next target and they are usually pretty good, but it turns out sometimes something happens that makes them go crazy. |
I've now submitted JuliaLang/julia#50705 to ensure upstream is aware and can add this to the 1.10 milestone |
Supposedly JuliaLang/julia#50682 may help with this. We'll see! |
I tried with JuliaLang/julia#50682 and Hecke (which has the same problem). We still have the same problem, but only later. Right before the OOM killer did its job, I saw:
This is on a machine with 32 GB of memory (of which 21 were available). Not sure about those numbers, but how can 92 GB live byes be possible? |
For the record, this happens on julia 1.6 as well. See https://github.com/oscar-system/Oscar.jl/actions/runs/5714074722/job/15480670050?pr=2609 |
That test is disabled for now (#2579), it also happened on 1.8. But that was rarely in contrast to the constant failures on nightly and 1.10 way earlier in our testsuite. |
@benlorenz Can you reproduce it locally (w/o JuliaLang/julia#50682)? |
I haven't really tested without that PR. I have run a bunch of tests with that PR merged and it reliably crashes when running it in a memory-constrained cgroup. A problem with that is that malloc behaves slightly different in such a cgroup than when the memory is really full (it doesn't return I have run various tests in the Github CI (7GB RAM on Ubuntu) to find a value for the heap size hint that works, maybe around 3.5GB (This is with plain nightly / 1.10.0-beta1). For comparison: julia 1.9.2 runs through the whole testsuite (about 80 minutes) with 8GB heap-size-hint in a 8.5GB cgroup, while julia nightly does not succeed in a 10.5GB cgroup with a hint of 8GB. I think there are several things contributing here:
[1]: julia> using Test
julia> using Oscar
----- ----- ----- - -----
| | | | | | | | | |
| | | | | | | |
| | ----- | | | |-----
| | | | |-----| | |
| | | | | | | | | |
----- ----- ----- - - - -
...combining (and extending) ANTIC, GAP, Polymake and Singular
Version 0.13.0-DEV ...
... which comes with absolutely no warranty whatsoever
Type: '?Oscar' for more information
(c) 2019-2023 by The OSCAR Development Team
julia> GC.enable_logging(true)
julia> GC.gc(false); GC.gc(true)
GC: pause 187.85ms. collected 281.680916MB. incr
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.88 MB, heap_size 490.38 MB, heap_target 752.88 MB, live_bytes 341.70 MB
, Fragmentation 0.697GC: pause 86.89ms. collected 0.282654MB. full recollect
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.75 MB, heap_size 490.25 MB, heap_target 752.75 MB, live_bytes 382.11 MB
, Fragmentation 0.779GC: pause 386.24ms. collected 1.028664MB. incr
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.62 MB, heap_size 489.22 MB, heap_target 751.72 MB, live_bytes 381.95 MB
, Fragmentation 0.781
julia> @test GAP.Globals.TestDirectory(GAP.Globals.DirectoriesPackageLibrary(GAP.Obj("OscarInterface"), GAP.Obj("tst")))
Architecture: x86_64-pc-linux-gnu-julia1.11-64-kv8
testing: /home/lorenz/software/polymake/julia/Oscar.jl/src/../gap/OscarInterface//tst/alnuth/ALNUTH.tst
# line 1 of 86 (1%)GC: pause 85.86ms. collected 0.753052MB. full recollect
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.62 MB, heap_size 489.23 MB, heap_target 751.73 MB, live_bytes 380.92 MB
, Fragmentation 0.779GC: pause 368.91ms. collected 0.018028MB. incr
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.61 MB, heap_size 489.21 MB, heap_target 751.71 MB, live_bytes 380.97 MB
# line 18 of 86 (20%)GC: pause 99.35ms. collected 333.437424MB. incr
Heap stats: bytes_mapped 512.12 MB, bytes_resident 500.62 MB, heap_size 676.11 MB, heap_target 982.36 MB, live_bytes 380.95 MB
, Fragmentation 0.563GC: pause 126.04ms. collected 522.456375MB. incr
Heap stats: bytes_mapped 704.17 MB, bytes_resident 670.53 MB, heap_size 875.80 MB, heap_target 1313.30 MB, live_bytes 413.54 MB
, Fragmentation 0.472GC: pause 216.44ms. collected 754.326622MB. incr
Heap stats: bytes_mapped 1024.25 MB, bytes_resident 964.05 MB, heap_size 1164.24 MB, heap_target 1689.24 MB, live_bytes 454.10 MB
# line 81 of 86 (94%)GC: pause 230.63ms. collected 1098.841599MB. incr
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 1126.45 MB, heap_size 1333.83 MB, heap_target 1683.83 MB, live_bytes 551.80 MB
44871 ms (1092 ms GC) and 11.2MB allocated for alnuth/ALNUTH.tst
testing: /home/lorenz/software/polymake/julia/Oscar.jl/src/../gap/OscarInterface//tst/alnuth/examples.t\
st
# line 1 of 271 (0%)GC: pause 146.92ms. collected 81.539536MB. full recollect
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 1102.11 MB, heap_size 1307.03 MB, heap_target 1744.53 MB, live_bytes 585.39 MB
, Fragmentation 0.448GC: pause 390.70ms. collected 155.567131MB. incr
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 803.80 MB, heap_size 994.14 MB, heap_target 1387.89 MB, live_bytes 574.64 MB
6042 ms (561 ms GC) and 15.3MB allocated for alnuth/examples.tst
testing: /home/lorenz/software/polymake/julia/Oscar.jl/src/../gap/OscarInterface//tst/alnuth/manual.tst
# line 1 of 64 (1%)GC: pause 160.84ms. collected 188.842827MB. full recollect
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 787.28 MB, heap_size 1011.43 MB, heap_target 1405.18 MB, live_bytes 419.07 MB
, Fragmentation 0.414GC: pause 356.13ms. collected 0.587646MB. incr
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 787.28 MB, heap_size 1010.73 MB, heap_target 1404.48 MB, live_bytes 479.72 MB
818 ms (592 ms GC) and 3.63MB allocated for alnuth/manual.tst
testing: /home/lorenz/software/polymake/julia/Oscar.jl/src/../gap/OscarInterface//tst/alnuth/polynome.t\
st
# line 2 of 47 (4%)GC: pause 122.30ms. collected 23.319710MB. full recollect
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 787.28 MB, heap_size 1008.29 MB, heap_target 1402.04 MB, live_bytes 479.13 MB
, Fragmentation 0.475GC: pause 334.29ms. collected 32.628418MB. incr
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 787.28 MB, heap_size 1002.85 MB, heap_target 1396.60 MB, live_bytes 499.14 MB
# line 16 of 47 (34%)GC: pause 193.51ms. collected 294.139137MB. incr
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 791.02 MB, heap_size 1299.46 MB, heap_target 1736.96 MB, live_bytes 466.51 MB
40842 ms (925 ms GC) and 392MB allocated for alnuth/polynome.tst
-----------------------------------
total 92573 ms (3170 ms GC) and 423MB allocated
0 failures in 4 files
#I No errors detected while testing
GC: pause 109.34ms. collected 542.099785MB. incr
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 804.59 MB, heap_size 1525.13 MB, heap_target 1787.63 MB, live_bytes 1611.86 MB
, Fragmentation 1.057Test Passed
julia> GC.gc(false); GC.gc(true)
GC: pause 25.73ms. collected 29.697510MB. incr
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 785.30 MB, heap_size 1092.44 MB, heap_target 1486.19 MB, live_bytes 8130.76 MB
, Fragmentation 7.443GC: pause 71.46ms. collected 76.772232MB. full recollect
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 785.30 MB, heap_size 1092.44 MB, heap_target 1486.19 MB, live_bytes 8120.73 MB
, Fragmentation 7.434GC: pause 282.94ms. collected 7.942421MB. incr
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 784.48 MB, heap_size 1091.55 MB, heap_target 1485.30 MB, live_bytes 8043.96 MB
, Fragmentation 7.369 Note that during the GC call after the tests were done the |
I can't reproduce that, which is really odd. |
The
This happens reliably on two (very) different Linux machines with an official nightly binary (and also with a completely fresh julia depot). I will try some debugging. |
So it seems there are a couple of issues. I fixed a bug on the PR but it still seems that our |
Unfortunately, JuliaLang/julia#50682 being present in julia nightly didn't fix this issue. Any idea on how to proceed? |
We should mention this over at JuliaLang/julia#50705 |
I'll play with it a bit more, see if I can figure something out. |
Did you run on github action or with a local cgroup, because it does seem that the testset is very close to needing 7GB, at least when running locally. And from looking at it does seem like the GC is running, at least locally. It's just not able to free stuff. |
What's the status of this? Have the changes in Julia fixed this? Or did we end up disabling some tests (which?) to workaround? In other words: does this need to stay open? If so, it would be good to have an explicit pointer to whatever disabled tests can be used to repro the issue. |
There are no disabled tests.
I don't think this needs to stay open. |
Describe the bug
In the past days, I noticed the test-ubuntu-1.8 job to terminated unexpectedly mid-way. A restart of the test did always succeed.
See e.g. https://github.com/oscar-system/Oscar.jl/actions/runs/5145822885/jobs/9263999579 and https://github.com/oscar-system/Oscar.jl/actions/runs/5133045620/jobs/9235131573
The text was updated successfully, but these errors were encountered: