Limit max number of parallel ci runs for docker-based runs #36610

tobiasdiez · 2023-10-31T13:00:13Z

Currently, the ci-linux workflow runs a lot of parallel runs which prevents other workflows (from PR) to be executed. We add a few more max_parallel statements to spread the stress on the ci over a longer time.

📝 Checklist

The title is concise, informative, and self-explanatory.
The description explains in detail what this PR is about.
I have linked a relevant issue or discussion.
I have created tests covering the changes.
I have updated the documentation accordingly.

⌛ Dependencies

kwankyu · 2023-10-31T13:56:37Z

.github/workflows/ci-linux.yml

@@ -101,6 +102,7 @@ jobs:
      tox_packages_factors: >-
          ["standard"]
      docker_push_repository: ghcr.io/${{ github.repository }}/
+      max_parallel: 12



I guess you picked up these numbers heuristically. But what is the reason that 12 for "standard" is less than 15 for "standard-pre"?

The heuristic was that "standard-pre" is rather important as you want to be notified early that there are profound compilation issues on a certain system. "standard" is a bit less important (and results seem to be more uniform across the systems tested), also there are a few systems dropping out as their pre step failed.

Note that the workflows do not fail when there are doctest failures.
It's crucial for "standard" to be completed as quickly as possible so that developers can inspect the results.

Also note that there is an overall walltime limit of 3 days (?) for the whole workflow to pass. So whatever job restrictions we put in, we need to make sure that this does not cause the later workflows to never run.

There is nothing crucial about these workflow runs. If they fail, then its just an ordinary bug that needs to be fixed. There is no need to have them finished as soon as possible. You can always remove some older/similar systems from the ci if things are then taking too long.

In my opinion it's not acceptable that other workflows are queued for 14+ hours because of it.

Ah. 60 runners. Then increasing "max-parallel" to 50 is a good solution!

Also we may decrease "max-parallel" for "minimal", "maximal", "experimental" to allow rooms for other PR workflow runs.

Sure. When I added the "standard-sitepackages" run (with a limit of 10 and running parallel with "standard"), I didn't decrease the max-parallel of "standard". We can reduce it a bit, from 30 perhaps to 25. I wouldn't go much lower than that.

How about basing your PR on Tobias's?

I've instead pushed a small change to #36616 (comment)

Thanks for checking; then the time to the release (~ 7 days) is the closer deadline

According to https://github.com/sagemath/sage/actions/runs/6598197147/usage, the overall run time is 28d, so an average number of 5 runners should be more than enough to make that deadline. With the proposed 10-15 runners, you get it in 3 to 4 days.

kwankyu · 2023-10-31T14:00:17Z

Anyway I like the idea.

mkoeppe · 2023-10-31T18:04:17Z

From the moment that a release has been pushed and the merged PRs are closed:

PR workflows still run with the previous releases's Docker image.
For PRs that have already been rebased on the new release, this will be slow as it has to rebuild all the changes in the new release.
For PRs that have not been rebased, the blocker PRs just closed are no longer applied, so failures already fixed there will show up again.

This is the situation until the "default-pre" and "default" jobs have completed and pushed the new Docker image.

So what we should do is ensure that this new Docker image is available as soon as possible. #36430, by making "default-pre" and "default" jobs separate from "standard-pre" and "standard", already did part of this, but we can do better: Currently, because "default" is queued after the "standard-pre" jobs, it has to wait until a runner becomes available.

So I would suggest the following:

standard-pre: increase the max-parallel from 30 (the default set in docker.yml) to 50 so that all standard-pre jobs can start at the same time

Edit: That's now #36616

mkoeppe · 2023-11-01T18:02:16Z

Let's close this one here in favor of #36616.

kwankyu · 2023-11-01T19:37:39Z

#36616 adjusted "standard" reasonably.

Decreasing max-parallel for "minimal" and others will make more rooms for other PRs. But as we have 60 runners, max-parallel 30 (current default) for them already gives enough room for other PRs. On the other hand, decreasing max-parallel for "minimal" and others will result in taking more time to finish "minimal" and others.

So let's see how things work after #36616. If there needs more adjustment, then let's use this PR. Otherwise close this PR as Matthias suggested.

tobiasdiez · 2023-11-02T00:58:25Z

#36616 adjusted "standard" reasonably.

I don't agree that this is enough reduction.

Decreasing max-parallel for "minimal" and others will make more rooms for other PRs. But as we have 60 runners, max-parallel 30 (current default) for them already gives enough room for other PRs. On the other hand, decreasing max-parallel for "minimal" and others will result in taking more time to finish "minimal" and others.

Sadly not true. Already during the first hours after a new release you have a long queue. Normally the conda workflow takes 2 hours, but as you see from https://github.com/sagemath/sage/actions/workflows/ci-conda.yml?page=4 and page 3 there are a lot of workflows that take 5 hours, so they are queued for 3h. Some are even in the queue for 12 hours, and that's usually the case when one of the "blocks" in the ci-linux workflow is finished so that the next block gets almost all runners.

… decrease for `standard`, `minimal-pre`    As proposed in sagemath#36610 (comment)   ### 📝 Checklist     - [x] The title is concise, informative, and self-explanatory. - [ ] The description explains in detail what this PR is about. - [x] I have linked a relevant issue or discussion. - [ ] I have created tests covering the changes. - [ ] I have updated the documentation accordingly. ### ⌛ Dependencies   URL: sagemath#36616 Reported by: Matthias Köppe Reviewer(s): Kwankyu Lee

kwankyu · 2023-11-02T02:12:21Z

OK. Then we would need decreasing max-parallel for "minimal" and others in addition to #36616.

If you rebase your PR on #36616 and make adjustments only to "minimal" and others, I am willing to get this PR in for the next beta.

… decrease for `standard`, `minimal-pre`    As proposed in sagemath#36610 (comment)   ### 📝 Checklist     - [x] The title is concise, informative, and self-explanatory. - [ ] The description explains in detail what this PR is about. - [x] I have linked a relevant issue or discussion. - [ ] I have created tests covering the changes. - [ ] I have updated the documentation accordingly. ### ⌛ Dependencies   URL: sagemath#36616 Reported by: Matthias Köppe Reviewer(s): Kwankyu Lee

mkoeppe · 2023-11-11T01:48:22Z

I don't think this PR is a serious proposal. If Tobias wants to demonstrate that after this radical cut of number of allotted runners by a factor 3 is suitable for running the portability tests, he can do so by running it in his own repo so we can have a look at the run.

tobiasdiez · 2023-11-11T15:39:12Z

I don't think this PR is a serious proposal.

Please stop these comments. This PR is clearly not an April fools joke.

If Tobias wants to demonstrate that after this radical cut of number of allotted runners by a factor 3 is suitable for running the portability tests, he can do so by running it in his own repo so we can have a look at the run.

I don't want to demonstrate that this is possible, I want to reduce the completion time of PR-related workflows after a release.
Anyway, here is the run: https://github.com/tobiasdiez/sage/actions/runs/6831315056 (standard will finish about 16h after the run is started; not sure what other information you expect to get out of the test run).

mkoeppe · 2023-11-11T17:01:59Z

(standard will finish about 16h after the run is started; not sure what other information you expect to get out of the test run).

The other tests that are part of the CI Linux.

mkoeppe · 2023-11-15T16:47:25Z

Proposing to close this PR. As I said in #36610 (comment), this PR is not a serious proposal.
From the beginning, it has lacked a justification for the radical changes that it makes. Instead, the author demands that reviewers justify the status quo.

tobiasdiez · 2023-11-16T00:36:02Z

There is nothing radical about this PR. It just gives priority to the PR workflows instead of the ci-linux, by reducing the priority of the later by roughly a factor of 2 (relative to the number of runners assigned at the point of creation of this PR).

tobiasdiez · 2023-11-16T00:36:33Z

Maybe start with answering the following question instead of insulting me:

Can you expand on why this is too slow? [...] What would be an acceptable time for you?

mkoeppe · 2023-11-16T01:12:06Z

For reference: #36572 (comment)

tobiasdiez · 2023-11-17T05:15:17Z

For reference: #36572 (comment)

I fail to see how this answers my questions.

mkoeppe · 2023-12-07T00:09:33Z

As I explained elsewhere: labels "invalid" + "needs review" = motion to close.

mkoeppe · 2023-12-21T02:24:59Z

As discussed in #36726 (comment)

tobiasdiez · 2023-12-21T05:32:53Z

That doesn't address my question at all.

github-actions · 2024-02-25T19:12:24Z

Documentation preview for this PR (built with commit 31f8802; changes) is ready! 🎉

mkoeppe · 2024-03-17T04:12:00Z

@kcrisman responding to your suggestion in https://groups.google.com/g/sage-devel/c/IgBYUJl33SQ/m/ciVrZ7x0AQAJ:

Confirming my vote of -1 on this PR.

kcrisman · 2024-03-18T16:15:22Z

That doesn't address my question at all.

Perhaps surprisingly, here I agree more that it's reasonable to ask for a concrete number and have discussion about that before closing the ticket. See the end of my comment here: #36725 (comment) Am I correct in inferring that some of the CI limit tickets are, at their core, about a disagreement in philosophy on whether PRs or "standard" builds should get priority? I'm a little in the dark here (hence my initial reluctance to get drawn in ...).

But anyway, I feel like that is all a sideshow. It seems like this should be a number that people can come to some consensus on. Can we at least agree on a hard upper and lower bound imposed by hardware/price/something? (And then maybe make a random choice based on some interesting distribution ... I'm totally not joking, randomized elections of this type are definitely proposed from time to time, and maybe it will help end the bad feelings here a tiny bit.)

jhpalmieri · 2024-04-11T20:43:16Z

-1 from me

Limit max number of parallel ci runs for docker-based runs

6646354

tobiasdiez added the s: needs review label Oct 31, 2023

tobiasdiez requested review from kwankyu and mkoeppe October 31, 2023 13:00

kwankyu reviewed Oct 31, 2023

View reviewed changes

mkoeppe mentioned this pull request Oct 31, 2023

CI Linux: Increase max_parallel for standard-pre, decrease for standard, minimal-pre #36616

Merged

5 tasks

mkoeppe added the r: duplicate label Nov 1, 2023

mkoeppe added pending and removed s: needs review labels Nov 1, 2023

kwankyu removed the r: duplicate label Nov 1, 2023

tobiasdiez added s: needs review and removed pending labels Nov 2, 2023

mkoeppe added the c: scripts label Nov 11, 2023

mkoeppe added s: needs info and removed s: needs review labels Nov 11, 2023

mkoeppe mentioned this pull request Nov 11, 2023

CI Linux incremental: Set max_parallel = 8, reduce standard-sitepackages platforms #36697

Merged

5 tasks

Merge branch 'develop' into ci-linux-max

f281644

tobiasdiez added s: needs review and removed s: needs info labels Nov 11, 2023

tobiasdiez removed the r: invalid label Nov 16, 2023

Merge branch 'develop' into ci-linux-max

edaa142

tobiasdiez mentioned this pull request Dec 6, 2023

Run ci-macos jobs in series #36725

Open

5 tasks

mkoeppe added the r: invalid label Dec 7, 2023

mkoeppe requested a review from NathanDunfield December 7, 2023 16:37

mkoeppe mentioned this pull request Dec 8, 2023

CI Linux: Replace use of pkill #36726

Merged

5 tasks

tobiasdiez removed the r: invalid label Dec 19, 2023

mkoeppe added the r: invalid label Dec 19, 2023

tobiasdiez removed the r: invalid label Dec 20, 2023

mkoeppe added the r: invalid label Dec 21, 2023

tobiasdiez removed the r: invalid label Dec 21, 2023

mkoeppe added the r: invalid label Dec 21, 2023

tobiasdiez added disputed PR is waiting for community vote, see https://groups.google.com/g/sage-devel/c/IgBYUJl33SQ and removed r: invalid labels Dec 23, 2023

tobiasdiez added 2 commits February 11, 2024 21:36

Merge branch 'develop' into ci-linux-max

1ce1a0a

Merge branch 'develop' into ci-linux-max

31f8802

mkoeppe requested a review from kcrisman March 17, 2024 04:09

vbraun force-pushed the develop branch from eba5e19 to e5f42fa Compare June 3, 2024 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit max number of parallel ci runs for docker-based runs #36610

Limit max number of parallel ci runs for docker-based runs #36610

tobiasdiez commented Oct 31, 2023 •

edited

Loading

kwankyu Oct 31, 2023

tobiasdiez Oct 31, 2023

mkoeppe Oct 31, 2023

mkoeppe Oct 31, 2023

tobiasdiez Nov 1, 2023

kwankyu Nov 1, 2023

mkoeppe Nov 1, 2023

kwankyu Nov 1, 2023

mkoeppe Nov 1, 2023

tobiasdiez Nov 2, 2023 •

edited

Loading

kwankyu commented Oct 31, 2023

mkoeppe commented Oct 31, 2023 •

edited

Loading

mkoeppe commented Nov 1, 2023

kwankyu commented Nov 1, 2023 •

edited

Loading

tobiasdiez commented Nov 2, 2023

kwankyu commented Nov 2, 2023

mkoeppe commented Nov 11, 2023

tobiasdiez commented Nov 11, 2023

mkoeppe commented Nov 11, 2023

mkoeppe commented Nov 15, 2023

tobiasdiez commented Nov 16, 2023

tobiasdiez commented Nov 16, 2023

mkoeppe commented Nov 16, 2023

tobiasdiez commented Nov 17, 2023

mkoeppe commented Dec 7, 2023

mkoeppe commented Dec 21, 2023

tobiasdiez commented Dec 21, 2023

github-actions bot commented Feb 25, 2024

mkoeppe commented Mar 17, 2024

kcrisman commented Mar 18, 2024

jhpalmieri commented Apr 11, 2024

Limit max number of parallel ci runs for docker-based runs #36610

Are you sure you want to change the base?

Limit max number of parallel ci runs for docker-based runs #36610

Conversation

tobiasdiez commented Oct 31, 2023 • edited Loading

📝 Checklist

⌛ Dependencies

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tobiasdiez Nov 2, 2023 • edited Loading

Choose a reason for hiding this comment

kwankyu commented Oct 31, 2023

mkoeppe commented Oct 31, 2023 • edited Loading

mkoeppe commented Nov 1, 2023

kwankyu commented Nov 1, 2023 • edited Loading

tobiasdiez commented Nov 2, 2023

kwankyu commented Nov 2, 2023

mkoeppe commented Nov 11, 2023

tobiasdiez commented Nov 11, 2023

mkoeppe commented Nov 11, 2023

mkoeppe commented Nov 15, 2023

tobiasdiez commented Nov 16, 2023

tobiasdiez commented Nov 16, 2023

mkoeppe commented Nov 16, 2023

tobiasdiez commented Nov 17, 2023

mkoeppe commented Dec 7, 2023

mkoeppe commented Dec 21, 2023

tobiasdiez commented Dec 21, 2023

github-actions bot commented Feb 25, 2024

mkoeppe commented Mar 17, 2024

kcrisman commented Mar 18, 2024

jhpalmieri commented Apr 11, 2024

tobiasdiez commented Oct 31, 2023 •

edited

Loading

tobiasdiez Nov 2, 2023 •

edited

Loading

mkoeppe commented Oct 31, 2023 •

edited

Loading

kwankyu commented Nov 1, 2023 •

edited

Loading