Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use RPTU local runners instead of github hosted linux runners #3986

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

aaruni96
Copy link
Member

No description provided.

@lgoettgens lgoettgens added the CI label Jul 31, 2024
@aaruni96
Copy link
Member Author

This PR replaces the github hosted runners by the RPTU runners. This means all tests are expected to be very fast, but if the RPTU runners ever go offline, testing will block / fail until they are back online.

Is it worth to try to figure out how to have github runners as an automatic fallback?

if: runner.os == 'macOS' && runner.environment == 'self-hosted'
# runner.environment is supposed to be a valid property: https://github.com/orgs/community/discussions/48359#discussioncomment-9059557
run: echo "NUMPROCS=5" >> $GITHUB_ENV
- name: "Use multiple processes for self-hosted Linux runners"
if: runner.os == 'Linux' && runner.environment == 'self-hosted'
run: echo "NUMPROCS=6" >> $GITHUB_ENV
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be made part of the runner config instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no unified way of doing this (that I know of), one would have to set the environment variable for each runner (and then if/when we update this, that will have to be changed manually for each runner).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/running-scripts-before-or-after-a-job to set the NUMPROC environment variable on the runners? IMHO that's really were this setting belongs (and then we can decide on the runners how many cores we want to assign)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

71cc52c resolves this, right, @fingolfin ?

Copy link

codecov bot commented Jul 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.57%. Comparing base (c146706) to head (71cc52c).
Report is 39 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3986      +/-   ##
==========================================
- Coverage   84.58%   84.57%   -0.02%     
==========================================
  Files         631      631              
  Lines       84831    85054     +223     
==========================================
+ Hits        71757    71935     +178     
- Misses      13074    13119      +45     

see 48 files with indirect coverage changes

@micjoswig
Copy link
Member

Maybe a fully automatic setup is unnecessary? But being able to switch manually (with little effort) would be great, I suppose.

@aaruni96
Copy link
Member Author

Reverted unnecessary cosmetic changes, and let the CI run this time. However, the "required" tests will no longer run (as they are required on ubuntu-latest, which we don't test anymore). We will have to change this, once you think its okay @fingolfin .

@aaruni96
Copy link
Member Author

The unit file for the runner now controls NUMPROCS, and also adds system limits at 600% CPU (corresponding to NUMPROCS=6), and 8 GB of RAM (MemoryMax is set to 8G, systemd should trigger OOM if it gets to 8 GB of active RAM usage). We can tweak these settings to whatever is required. In principle, we have ~750GB of RAM available over 128 cores, so, a "fair" breakdown allows us to give upto ~35G memory for 6 cores.

@aaruni96
Copy link
Member Author

Looks like 8 GB is far too low a limit. Runners need some amount of memory per core: trying to watch htop during testing, a little more than 4 GB per core. For 6 cores, that should be ~24 GB per runner. However also monitoring just the memory usage of the runner + all sub processes as a whole, I recorded a peak usage of 40.8 GB.

@fingolfin
Copy link
Member

Let's assign 30gb per group and see where it leads us

@aaruni96
Copy link
Member Author

aaruni96 commented Oct 25, 2024

Capping memory at 30 GB, it seems to run without problems, with peak memory recorded at 29.94 GB. ( I am running the 1.10 / long test suite)

Copy link
Member

@fingolfin fingolfin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks fine now, but I won't merge this on a Friday. Will get back to it next week

@benlorenz
Copy link
Member

Can we also do at least one full run (of all jobs) please? Right now some jobs still show e.g. Successful in 119m because they ran before the changed resources. Especially 1.11 and nightly are rather important regarding memory as they tend to be a bit more demanding and they haven't been re-run as far as I can see.

@benlorenz
Copy link
Member

I think we need a bit more memory, the short testgroup by now is somewhat more demanding than the long one.

Workers

There is output like this (in 1.11 short):

      From worker 3:	GC: pause 482846.73ms. collected 1.737617MB. incr 
      From worker 3:	Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1271.83 MB,
      From worker 3:	heap_size 1892.75 MB, heap_target 2641.51 MB, Fragmentation 0.585

The longest pauses on this branch (RPTU runner) for 1.11 and nightly:
(Only from the workers, excluding the pauses during teardown, see below)

2024-10-26T10:48:13.3884710Z       From worker 5:	GC: pause 100716.38ms. collected 35.932709MB. incr 
2024-10-26T10:48:16.6525456Z       From worker 7:	GC: pause 101767.34ms. collected 620.466560MB. incr 
2024-10-26T11:07:43.0083398Z       From worker 3:	GC: pause 482846.73ms. collected 1.737617MB. incr 

2024-10-26T11:07:50.2651862Z       From worker 2:	GC: pause 2298.29ms. collected 882.389534MB. incr 
2024-10-26T11:15:55.3986915Z       From worker 5:	GC: pause 47293.04ms. collected 514.686832MB. incr 
2024-10-26T11:15:04.6189583Z       From worker 5:	GC: pause 113512.48ms. collected 1164.965448MB. full 

On the current master (github runner):

2024-10-26T08:54:31.7158720Z GC: pause 1986.41ms. collected 0.003098MB. incr 
2024-10-26T09:13:04.8278705Z GC: pause 2592.45ms. collected 294.604805MB. incr 
2024-10-26T08:45:19.9139584Z GC: pause 2980.01ms. collected 1435.183800MB. incr 

2024-10-26T09:15:04.8216055Z GC: pause 3385.56ms. collected 1.666733MB. incr 
2024-10-26T09:36:29.7037000Z GC: pause 4112.78ms. collected 311.601254MB. incr 
2024-10-26T08:57:39.2085587Z GC: pause 4847.97ms. collected 1272.886726MB. incr 

Master process

There are also some very long pauses at the end during teardown of the workers, and the workers are not terminating properly, maybe they are also in a GC pause and get killed before it is complete:

GC: pause 48326.45ms. collected 256.090321MB. incr 
Heap stats: bytes_mapped 448.11 MB, bytes_resident 413.20 MB,
heap_size 632.07 MB, heap_target 864.12 MB, Fragmentation 0.665
┌ Warning: Forcibly interrupting busy workers
│   exception = rmprocs: pids [2, 3, 4, 5, 6, 7] not terminated after 5.0 seconds.
└ @ Distributed ~/oscar-runners/runner-08/_work/_tool/julia/nightly/x64/share/julia/stdlib/v1.12/Distributed/src/cluster.jl:1253
┌ Warning: rmprocs: process 1 not removed
└ @ Distributed ~/oscar-runners/runner-08/_work/_tool/julia/nightly/x64/share/julia/stdlib/v1.12/Distributed/src/cluster.jl:1049

GC: pause 74448.68ms. collected 72.181580MB. full recollect
Heap stats: bytes_mapped 448.11 MB, bytes_resident 413.20 MB,
heap_size 628.77 MB, heap_target 864.12 MB, Fragmentation 0.673

GC: pause 34306.09ms. collected 5.858597MB. incr 
Heap stats: bytes_mapped 448.11 MB, bytes_resident 413.20 MB,
heap_size 624.77 MB, heap_target 959.36 MB, Fragmentation 0.676
     Testing Oscar tests passed 

These pauses are quite weird as they come from the master process which probably did not do any testing? Around 500MB is very little memory and still several minutes of GC pauses.

How does this MemoryMax behave regarding swap space? Are the processes allowed to be swapped out to stay below this limit (this could explain the long pauses at the end...)?

In comparison the nightly job on the github runner for the current master branch:

GC: pause 1320.36ms. collected 2878.824236MB. full recollect
Heap stats: bytes_mapped 5505.34 MB, bytes_resident 5205.77 MB,
heap_size 7844.25 MB, heap_target 8270.59 MB, Fragmentation 0.860

GC: pause 4112.78ms. collected 311.601254MB. incr 
Heap stats: bytes_mapped 5505.34 MB, bytes_resident 4973.03 MB,
heap_size 7543.47 MB, heap_target 8216.91 MB, Fragmentation 0.630
     Testing Oscar tests passed 

There is also a bit of a pause at the end but only a few seconds vs a few minutes.

@aaruni96
Copy link
Member Author

aaruni96 commented Oct 26, 2024

Done. 1.11-nightly, short, and nightly, short took much longer than others, and were constantly running against memory limits, but completed.

Memory: 29.9G (high: 30.0G available: 288.0K)

Edit: looks like only 1.11-nightly, short took significantly longer (60+) minutes, and short tests all take around 40 minutes on Linux. This looks suspicious, given that short tests take only 13 minutes on macOS, and only 20 minutes before we added memory limits.

@aaruni96
Copy link
Member Author

How does this MemoryMax behave regarding swap space? Are the processes allowed to be swapped out to stay below this limit (this could explain the long pauses at the end...)?

Yeps. We also need to set SwapMax if we want to limit the amount of swapping that is allowed, but then if it reaches the swap limit, things might start dying because of OOM ?

@fieker fieker removed the triage label Nov 6, 2024
@fieker
Copy link
Contributor

fieker commented Nov 6, 2024

with @aaruni96

@@ -37,17 +37,20 @@ jobs:
- 'nightly'
group: [ 'short', 'long' ]
os:
- ubuntu-latest
- [Linux, RPTU]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do that (I am guessing not, but perhaps worth a try?)

Suggested change
- [Linux, RPTU]
- [Linux, RPTU, ${{ matrix.group }}]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this may not work, note that we use the os field for just one thing (I believe), namely in line 28: runs-on: ${{ matrix.os }}.

So in theory it should be possible to use a more complex expression there which takes both matrix.os and matrix.group into account. You may have to rethink what kind of string is put into matrix.os, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps change the matrix.os values to just plain Linux or macOS and then change runs-on: ${{ matrix.os }} to

runs-on: [ ${{ matrix.os }}, RPTU, ${{ matrix.group }}$ ]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I really would want to do is:

  • test if matrix.os is a string or a "list" (??)
  • if it is a string, pass it on
  • if it is a list, append "group" to the list and pass it on

However the GitHub documentation on expression is... "sparse", to put it kindly. So I have no idea if any of these are even possible. They don't ever mention what happens if os: [a, b] is interpolated into an expression like runs-on: ${{ matrix.os }} -- is it treated as a string, or an object? If it is an object, you'd expect some functionality to interact with it, but if it's there, then it is either undocumented or I just can't find the documentation :-(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(OK, to be fair, the docs do suggest that it may be possible to write ${{ matrix.os[42] }} to access the object at position 41 (0-indexed? 1-indexed?) of matrix.os if it is a list. But still no idea how to find out whether it is a list (which we could avoid if we just make sure it is always one) nor how to append to it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case of os: [one, dimensional, list], matrix.os is one of those items (one, or dimensional, or list) at any given time

in case of

os:
  - [first, part, of, 2d, list]
  - [second, part, of 2d, list]
  ...

matrix.os is an entire list item (first, part, of, 2d, list), as a yaml array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I had to search for "array" and it does mention a bit more: we can also use contains to check if an array contains some element; and join to turn an array into a list. Then there is also fromJSON and format.... So one could do this evil thing: take the list, turn it into a string via join using commas as seperators; use format to "append" to this flattened list, and put [ and ] around it. This now is valid JSON and we can use fromJSON to turn it back into the desired list with one element added.

Of course we will also be top on the list of people to be exterminated when the machines rise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runs-on: [ ${{ matrix.os }}, RPTU, ${{ matrix.group }}$ ]

This idea doesn't work, for reasons I don't understand.

Please look at the CI run for fd433b6 : runs-on: [ '${{ matrix.os }}', 'high-memory'] makes it just ignore the config and run on Linux, RPTU anyway? Here, matrix.os should just be Linux, and matrix.hosted is RPTU, but its just defined, and not used anywhere....

https://github.com/oscar-system/Oscar.jl/actions/runs/11705888151

@aaruni96 aaruni96 force-pushed the ak96/linux-local-runner branch 2 times, most recently from 3d393d4 to fd4285b Compare November 6, 2024 13:46
Attempt 3 : Use Max's suggestion of building runs-on "manually"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants