Use RPTU local runners instead of github hosted linux runners #3986

aaruni96 · 2024-07-31T12:47:56Z

No description provided.

aaruni96 · 2024-07-31T13:00:23Z

This PR replaces the github hosted runners by the RPTU runners. This means all tests are expected to be very fast, but if the RPTU runners ever go offline, testing will block / fail until they are back online.

Is it worth to try to figure out how to have github runners as an automatic fallback?

fingolfin · 2024-07-31T13:46:15Z

.github/workflows/CI.yml

        if: runner.os == 'macOS' && runner.environment == 'self-hosted'
        # runner.environment is supposed to be a valid property: https://github.com/orgs/community/discussions/48359#discussioncomment-9059557
        run: echo "NUMPROCS=5" >> $GITHUB_ENV
+      - name: "Use multiple processes for self-hosted Linux runners"
+        if: runner.os == 'Linux' && runner.environment == 'self-hosted'
+        run: echo "NUMPROCS=6" >> $GITHUB_ENV


Can't this be made part of the runner config instead?

There's no unified way of doing this (that I know of), one would have to set the environment variable for each runner (and then if/when we update this, that will have to be changed manually for each runner).

Can't we use https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/running-scripts-before-or-after-a-job to set the NUMPROC environment variable on the runners? IMHO that's really were this setting belongs (and then we can decide on the runners how many cores we want to assign)

71cc52c resolves this, right, @fingolfin ?

codecov · 2024-07-31T14:33:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.57%. Comparing base (c146706) to head (71cc52c).
Report is 39 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3986      +/-   ##
==========================================
- Coverage   84.58%   84.57%   -0.02%     
==========================================
  Files         631      631              
  Lines       84831    85054     +223     
==========================================
+ Hits        71757    71935     +178     
- Misses      13074    13119      +45

see 48 files with indirect coverage changes

micjoswig · 2024-08-01T15:01:01Z

Maybe a fully automatic setup is unnecessary? But being able to switch manually (with little effort) would be great, I suppose.

[skip ci]

aaruni96 · 2024-10-16T10:53:40Z

Reverted unnecessary cosmetic changes, and let the CI run this time. However, the "required" tests will no longer run (as they are required on ubuntu-latest, which we don't test anymore). We will have to change this, once you think its okay @fingolfin .

…its)

aaruni96 · 2024-10-24T13:24:27Z

The unit file for the runner now controls NUMPROCS, and also adds system limits at 600% CPU (corresponding to NUMPROCS=6), and 8 GB of RAM (MemoryMax is set to 8G, systemd should trigger OOM if it gets to 8 GB of active RAM usage). We can tweak these settings to whatever is required. In principle, we have ~750GB of RAM available over 128 cores, so, a "fair" breakdown allows us to give upto ~35G memory for 6 cores.

aaruni96 · 2024-10-24T21:17:08Z

Looks like 8 GB is far too low a limit. Runners need some amount of memory per core: trying to watch htop during testing, a little more than 4 GB per core. For 6 cores, that should be ~24 GB per runner. However also monitoring just the memory usage of the runner + all sub processes as a whole, I recorded a peak usage of 40.8 GB.

fingolfin · 2024-10-25T04:55:55Z

Let's assign 30gb per group and see where it leads us

aaruni96 · 2024-10-25T09:35:02Z

Capping memory at 30 GB, it seems to run without problems, with peak memory recorded at 29.94 GB. ( I am running the 1.10 / long test suite)

fingolfin

Everything looks fine now, but I won't merge this on a Friday. Will get back to it next week

benlorenz · 2024-10-25T12:12:15Z

Can we also do at least one full run (of all jobs) please? Right now some jobs still show e.g. Successful in 119m because they ran before the changed resources. Especially 1.11 and nightly are rather important regarding memory as they tend to be a bit more demanding and they haven't been re-run as far as I can see.

benlorenz · 2024-10-26T12:00:01Z

I think we need a bit more memory, the short testgroup by now is somewhat more demanding than the long one.

Workers

There is output like this (in 1.11 short):

      From worker 3:	GC: pause 482846.73ms. collected 1.737617MB. incr 
      From worker 3:	Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1271.83 MB,
      From worker 3:	heap_size 1892.75 MB, heap_target 2641.51 MB, Fragmentation 0.585

The longest pauses on this branch (RPTU runner) for 1.11 and nightly:
(Only from the workers, excluding the pauses during teardown, see below)

2024-10-26T10:48:13.3884710Z       From worker 5:	GC: pause 100716.38ms. collected 35.932709MB. incr 
2024-10-26T10:48:16.6525456Z       From worker 7:	GC: pause 101767.34ms. collected 620.466560MB. incr 
2024-10-26T11:07:43.0083398Z       From worker 3:	GC: pause 482846.73ms. collected 1.737617MB. incr 

2024-10-26T11:07:50.2651862Z       From worker 2:	GC: pause 2298.29ms. collected 882.389534MB. incr 
2024-10-26T11:15:55.3986915Z       From worker 5:	GC: pause 47293.04ms. collected 514.686832MB. incr 
2024-10-26T11:15:04.6189583Z       From worker 5:	GC: pause 113512.48ms. collected 1164.965448MB. full

On the current master (github runner):

2024-10-26T08:54:31.7158720Z GC: pause 1986.41ms. collected 0.003098MB. incr 
2024-10-26T09:13:04.8278705Z GC: pause 2592.45ms. collected 294.604805MB. incr 
2024-10-26T08:45:19.9139584Z GC: pause 2980.01ms. collected 1435.183800MB. incr 

2024-10-26T09:15:04.8216055Z GC: pause 3385.56ms. collected 1.666733MB. incr 
2024-10-26T09:36:29.7037000Z GC: pause 4112.78ms. collected 311.601254MB. incr 
2024-10-26T08:57:39.2085587Z GC: pause 4847.97ms. collected 1272.886726MB. incr

Master process

There are also some very long pauses at the end during teardown of the workers, and the workers are not terminating properly, maybe they are also in a GC pause and get killed before it is complete:

GC: pause 48326.45ms. collected 256.090321MB. incr 
Heap stats: bytes_mapped 448.11 MB, bytes_resident 413.20 MB,
heap_size 632.07 MB, heap_target 864.12 MB, Fragmentation 0.665
┌ Warning: Forcibly interrupting busy workers
│   exception = rmprocs: pids [2, 3, 4, 5, 6, 7] not terminated after 5.0 seconds.
└ @ Distributed ~/oscar-runners/runner-08/_work/_tool/julia/nightly/x64/share/julia/stdlib/v1.12/Distributed/src/cluster.jl:1253
┌ Warning: rmprocs: process 1 not removed
└ @ Distributed ~/oscar-runners/runner-08/_work/_tool/julia/nightly/x64/share/julia/stdlib/v1.12/Distributed/src/cluster.jl:1049

GC: pause 74448.68ms. collected 72.181580MB. full recollect
Heap stats: bytes_mapped 448.11 MB, bytes_resident 413.20 MB,
heap_size 628.77 MB, heap_target 864.12 MB, Fragmentation 0.673

GC: pause 34306.09ms. collected 5.858597MB. incr 
Heap stats: bytes_mapped 448.11 MB, bytes_resident 413.20 MB,
heap_size 624.77 MB, heap_target 959.36 MB, Fragmentation 0.676
     Testing Oscar tests passed

These pauses are quite weird as they come from the master process which probably did not do any testing? Around 500MB is very little memory and still several minutes of GC pauses.

How does this MemoryMax behave regarding swap space? Are the processes allowed to be swapped out to stay below this limit (this could explain the long pauses at the end...)?

In comparison the nightly job on the github runner for the current master branch:

GC: pause 1320.36ms. collected 2878.824236MB. full recollect
Heap stats: bytes_mapped 5505.34 MB, bytes_resident 5205.77 MB,
heap_size 7844.25 MB, heap_target 8270.59 MB, Fragmentation 0.860

GC: pause 4112.78ms. collected 311.601254MB. incr 
Heap stats: bytes_mapped 5505.34 MB, bytes_resident 4973.03 MB,
heap_size 7543.47 MB, heap_target 8216.91 MB, Fragmentation 0.630
     Testing Oscar tests passed

There is also a bit of a pause at the end but only a few seconds vs a few minutes.

aaruni96 · 2024-10-26T12:01:16Z

Done. 1.11-nightly, short, and nightly, short took much longer than others, and were constantly running against memory limits, but completed.

Memory: 29.9G (high: 30.0G available: 288.0K)

Edit: looks like only 1.11-nightly, short took significantly longer (60+) minutes, and short tests all take around 40 minutes on Linux. This looks suspicious, given that short tests take only 13 minutes on macOS, and only 20 minutes before we added memory limits.

aaruni96 · 2024-10-26T12:09:38Z

How does this MemoryMax behave regarding swap space? Are the processes allowed to be swapped out to stay below this limit (this could explain the long pauses at the end...)?

Yeps. We also need to set SwapMax if we want to limit the amount of swapping that is allowed, but then if it reaches the swap limit, things might start dying because of OOM ?

fieker · 2024-11-06T11:02:15Z

with @aaruni96

fingolfin · 2024-11-06T13:01:29Z

.github/workflows/CI.yml

@@ -37,17 +37,20 @@ jobs:
          - 'nightly'
        group: [ 'short', 'long' ]
        os:
-          - ubuntu-latest
+          - [Linux, RPTU]


Can we do that (I am guessing not, but perhaps worth a try?)

Suggested change

- [Linux, RPTU]

- [Linux, RPTU, ${{ matrix.group }}]

While this may not work, note that we use the os field for just one thing (I believe), namely in line 28: runs-on: ${{ matrix.os }}.

So in theory it should be possible to use a more complex expression there which takes both matrix.os and matrix.group into account. You may have to rethink what kind of string is put into matrix.os, though.

Perhaps change the matrix.os values to just plain Linux or macOS and then change runs-on: ${{ matrix.os }} to

runs-on: [ ${{ matrix.os }}, RPTU, ${{ matrix.group }}$ ]

What I really would want to do is:

test if matrix.os is a string or a "list" (??)

if it is a string, pass it on

if it is a list, append "group" to the list and pass it on

However the GitHub documentation on expression is... "sparse", to put it kindly. So I have no idea if any of these are even possible. They don't ever mention what happens if os: [a, b] is interpolated into an expression like runs-on: ${{ matrix.os }} -- is it treated as a string, or an object? If it is an object, you'd expect some functionality to interact with it, but if it's there, then it is either undocumented or I just can't find the documentation :-(

(OK, to be fair, the docs do suggest that it may be possible to write ${{ matrix.os[42] }} to access the object at position 41 (0-indexed? 1-indexed?) of matrix.os if it is a list. But still no idea how to find out whether it is a list (which we could avoid if we just make sure it is always one) nor how to append to it.

in case of os: [one, dimensional, list], matrix.os is one of those items (one, or dimensional, or list) at any given time

in case of

os: - [first, part, of, 2d, list] - [second, part, of 2d, list] ...

matrix.os is an entire list item (first, part, of, 2d, list), as a yaml array.

OK, I had to search for "array" and it does mention a bit more: we can also use contains to check if an array contains some element; and join to turn an array into a list. Then there is also fromJSON and format.... So one could do this evil thing: take the list, turn it into a string via join using commas as seperators; use format to "append" to this flattened list, and put [ and ] around it. This now is valid JSON and we can use fromJSON to turn it back into the desired list with one element added.

Of course we will also be top on the list of people to be exterminated when the machines rise.

runs-on: [ ${{ matrix.os }}, RPTU, ${{ matrix.group }}$ ]

This idea doesn't work, for reasons I don't understand.

Please look at the CI run for fd433b6 : runs-on: [ '${{ matrix.os }}', 'high-memory'] makes it just ignore the config and run on Linux, RPTU anyway? Here, matrix.os should just be Linux, and matrix.hosted is RPTU, but its just defined, and not used anywhere....

https://github.com/oscar-system/Oscar.jl/actions/runs/11705888151

Attempt 3 : Use Max's suggestion of building runs-on "manually"

lgoettgens added the CI label Jul 31, 2024

fingolfin reviewed Jul 31, 2024

View reviewed changes

aaruni96 added 3 commits October 10, 2024 10:25

Use RPTU local runners instead of github hosted linux runners

fc2edf7

Uses multiple workers per job

95c67d0

Fix typo

c2eee44

[skip ci]

aaruni96 force-pushed the ak96/linux-local-runner branch from 7169678 to c2eee44 Compare October 10, 2024 08:26

fingolfin closed this Oct 11, 2024

fingolfin reopened this Oct 11, 2024

aaruni96 added 2 commits October 16, 2024 12:42

Revert changing single quotes to double quotes

da26461

bugfix + whitespace changes reverted

552550d

aaruni96 force-pushed the ak96/linux-local-runner branch from 613aadf to 552550d Compare October 16, 2024 10:49

fingolfin added the triage label Oct 22, 2024

Move NUMPROCS from CI.yml to runner config (along with hard usage lim…

71cc52c

…its)

fingolfin approved these changes Oct 25, 2024

View reviewed changes

fieker removed the triage label Nov 6, 2024

fingolfin reviewed Nov 6, 2024

View reviewed changes

aaruni96 force-pushed the ak96/linux-local-runner branch from 575c9cb to 448eda9 Compare November 6, 2024 13:36

aaruni96 force-pushed the ak96/linux-local-runner branch 2 times, most recently from 3d393d4 to fd4285b Compare November 6, 2024 13:46

Make short tests use only the high memory runners

082178d

Attempt 3 : Use Max's suggestion of building runs-on "manually"

aaruni96 force-pushed the ak96/linux-local-runner branch from fd4285b to 082178d Compare November 6, 2024 13:51

aaruni96 added 3 commits November 6, 2024 15:35

trial and error : does this work ?

867d7b3

trial and error oscar-system#2: the previous result didn't make sense...

908b4e2

trial and error number 3: use single quotes

fd433b6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use RPTU local runners instead of github hosted linux runners #3986

Use RPTU local runners instead of github hosted linux runners #3986

aaruni96 commented Jul 31, 2024

aaruni96 commented Jul 31, 2024

fingolfin Jul 31, 2024

aaruni96 Aug 2, 2024

fingolfin Oct 16, 2024

aaruni96 Oct 24, 2024

codecov bot commented Jul 31, 2024 •

edited

Loading

micjoswig commented Aug 1, 2024

aaruni96 commented Oct 16, 2024

aaruni96 commented Oct 24, 2024

aaruni96 commented Oct 24, 2024

fingolfin commented Oct 25, 2024

aaruni96 commented Oct 25, 2024 •

edited

Loading

fingolfin left a comment

benlorenz commented Oct 25, 2024

benlorenz commented Oct 26, 2024

aaruni96 commented Oct 26, 2024 •

edited

Loading

aaruni96 commented Oct 26, 2024

fieker commented Nov 6, 2024

fingolfin Nov 6, 2024

fingolfin Nov 6, 2024

fingolfin Nov 6, 2024

fingolfin Nov 6, 2024

fingolfin Nov 6, 2024

aaruni96 Nov 6, 2024

fingolfin Nov 6, 2024

aaruni96 Nov 6, 2024

Use RPTU local runners instead of github hosted linux runners #3986

Are you sure you want to change the base?

Use RPTU local runners instead of github hosted linux runners #3986

Conversation

aaruni96 commented Jul 31, 2024

aaruni96 commented Jul 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 31, 2024 • edited Loading

Codecov Report

micjoswig commented Aug 1, 2024

aaruni96 commented Oct 16, 2024

aaruni96 commented Oct 24, 2024

aaruni96 commented Oct 24, 2024

fingolfin commented Oct 25, 2024

aaruni96 commented Oct 25, 2024 • edited Loading

fingolfin left a comment

Choose a reason for hiding this comment

benlorenz commented Oct 25, 2024

benlorenz commented Oct 26, 2024

Workers

Master process

aaruni96 commented Oct 26, 2024 • edited Loading

aaruni96 commented Oct 26, 2024

fieker commented Nov 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 31, 2024 •

edited

Loading

aaruni96 commented Oct 25, 2024 •

edited

Loading

aaruni96 commented Oct 26, 2024 •

edited

Loading