-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use RPTU local runners instead of github hosted linux runners #3986
base: master
Are you sure you want to change the base?
Conversation
This PR replaces the github hosted runners by the RPTU runners. This means all tests are expected to be very fast, but if the RPTU runners ever go offline, testing will block / fail until they are back online. Is it worth to try to figure out how to have github runners as an automatic fallback? |
.github/workflows/CI.yml
Outdated
if: runner.os == 'macOS' && runner.environment == 'self-hosted' | ||
# runner.environment is supposed to be a valid property: https://github.com/orgs/community/discussions/48359#discussioncomment-9059557 | ||
run: echo "NUMPROCS=5" >> $GITHUB_ENV | ||
- name: "Use multiple processes for self-hosted Linux runners" | ||
if: runner.os == 'Linux' && runner.environment == 'self-hosted' | ||
run: echo "NUMPROCS=6" >> $GITHUB_ENV |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this be made part of the runner config instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no unified way of doing this (that I know of), one would have to set the environment variable for each runner (and then if/when we update this, that will have to be changed manually for each runner).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/running-scripts-before-or-after-a-job to set the NUMPROC
environment variable on the runners? IMHO that's really were this setting belongs (and then we can decide on the runners how many cores we want to assign)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
71cc52c resolves this, right, @fingolfin ?
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3986 +/- ##
==========================================
- Coverage 84.58% 84.57% -0.02%
==========================================
Files 631 631
Lines 84831 85054 +223
==========================================
+ Hits 71757 71935 +178
- Misses 13074 13119 +45 |
Maybe a fully automatic setup is unnecessary? But being able to switch manually (with little effort) would be great, I suppose. |
7169678
to
c2eee44
Compare
613aadf
to
552550d
Compare
Reverted unnecessary cosmetic changes, and let the CI run this time. However, the "required" tests will no longer run (as they are required on |
The unit file for the runner now controls |
Looks like 8 GB is far too low a limit. Runners need some amount of memory per core: trying to watch htop during testing, a little more than 4 GB per core. For 6 cores, that should be ~24 GB per runner. However also monitoring just the memory usage of the runner + all sub processes as a whole, I recorded a peak usage of 40.8 GB. |
Let's assign 30gb per group and see where it leads us |
Capping memory at 30 GB, it seems to run without problems, with peak memory recorded at 29.94 GB. ( I am running the 1.10 / long test suite) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks fine now, but I won't merge this on a Friday. Will get back to it next week
Can we also do at least one full run (of all jobs) please? Right now some jobs still show e.g. |
I think we need a bit more memory, the WorkersThere is output like this (in 1.11 short):
The longest pauses on this branch (RPTU runner) for 1.11 and nightly:
On the current master (github runner):
Master processThere are also some very long pauses at the end during teardown of the workers, and the workers are not terminating properly, maybe they are also in a GC pause and get killed before it is complete:
These pauses are quite weird as they come from the master process which probably did not do any testing? Around 500MB is very little memory and still several minutes of GC pauses. How does this In comparison the nightly job on the github runner for the current master branch:
There is also a bit of a pause at the end but only a few seconds vs a few minutes. |
Done.
Edit: looks like only |
Yeps. We also need to set SwapMax if we want to limit the amount of swapping that is allowed, but then if it reaches the swap limit, things might start dying because of OOM ? |
with @aaruni96 |
.github/workflows/CI.yml
Outdated
@@ -37,17 +37,20 @@ jobs: | |||
- 'nightly' | |||
group: [ 'short', 'long' ] | |||
os: | |||
- ubuntu-latest | |||
- [Linux, RPTU] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do that (I am guessing not, but perhaps worth a try?)
- [Linux, RPTU] | |
- [Linux, RPTU, ${{ matrix.group }}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this may not work, note that we use the os
field for just one thing (I believe), namely in line 28: runs-on: ${{ matrix.os }}
.
So in theory it should be possible to use a more complex expression there which takes both matrix.os
and matrix.group
into account. You may have to rethink what kind of string is put into matrix.os
, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps change the matrix.os
values to just plain Linux
or macOS
and then change runs-on: ${{ matrix.os }}
to
runs-on: [ ${{ matrix.os }}, RPTU, ${{ matrix.group }}$ ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I really would want to do is:
- test if
matrix.os
is a string or a "list" (??) - if it is a string, pass it on
- if it is a list, append "group" to the list and pass it on
However the GitHub documentation on expression is... "sparse", to put it kindly. So I have no idea if any of these are even possible. They don't ever mention what happens if os: [a, b]
is interpolated into an expression like runs-on: ${{ matrix.os }}
-- is it treated as a string, or an object? If it is an object, you'd expect some functionality to interact with it, but if it's there, then it is either undocumented or I just can't find the documentation :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(OK, to be fair, the docs do suggest that it may be possible to write ${{ matrix.os[42] }}
to access the object at position 41 (0-indexed? 1-indexed?) of matrix.os
if it is a list. But still no idea how to find out whether it is a list (which we could avoid if we just make sure it is always one) nor how to append to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case of os: [one, dimensional, list]
, matrix.os
is one of those items (one
, or dimensional
, or list
) at any given time
in case of
os:
- [first, part, of, 2d, list]
- [second, part, of 2d, list]
...
matrix.os
is an entire list item (first, part, of, 2d, list
), as a yaml array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I had to search for "array" and it does mention a bit more: we can also use contains
to check if an array contains some element; and join
to turn an array into a list. Then there is also fromJSON
and format
.... So one could do this evil thing: take the list, turn it into a string via join
using commas as seperators; use format to "append" to this flattened list, and put [
and ]
around it. This now is valid JSON and we can use fromJSON
to turn it back into the desired list with one element added.
Of course we will also be top on the list of people to be exterminated when the machines rise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runs-on: [ ${{ matrix.os }}, RPTU, ${{ matrix.group }}$ ]
This idea doesn't work, for reasons I don't understand.
Please look at the CI run for fd433b6 : runs-on: [ '${{ matrix.os }}', 'high-memory']
makes it just ignore the config and run on Linux, RPTU
anyway? Here, matrix.os
should just be Linux
, and matrix.hosted
is RPTU
, but its just defined, and not used anywhere....
https://github.com/oscar-system/Oscar.jl/actions/runs/11705888151
575c9cb
to
448eda9
Compare
3d393d4
to
fd4285b
Compare
Attempt 3 : Use Max's suggestion of building runs-on "manually"
fd4285b
to
082178d
Compare
No description provided.