Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit max number of parallel ci runs for docker-based runs #36610

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

tobiasdiez
Copy link
Contributor

@tobiasdiez tobiasdiez commented Oct 31, 2023

Currently, the ci-linux workflow runs a lot of parallel runs which prevents other workflows (from PR) to be executed. We add a few more max_parallel statements to spread the stress on the ci over a longer time.

📝 Checklist

  • The title is concise, informative, and self-explanatory.
  • The description explains in detail what this PR is about.
  • I have linked a relevant issue or discussion.
  • I have created tests covering the changes.
  • I have updated the documentation accordingly.

⌛ Dependencies

@@ -101,6 +102,7 @@ jobs:
tox_packages_factors: >-
["standard"]
docker_push_repository: ghcr.io/${{ github.repository }}/
max_parallel: 12

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you picked up these numbers heuristically. But what is the reason that 12 for "standard" is less than 15 for "standard-pre"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The heuristic was that "standard-pre" is rather important as you want to be notified early that there are profound compilation issues on a certain system. "standard" is a bit less important (and results seem to be more uniform across the systems tested), also there are a few systems dropping out as their pre step failed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the workflows do not fail when there are doctest failures.
It's crucial for "standard" to be completed as quickly as possible so that developers can inspect the results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that there is an overall walltime limit of 3 days (?) for the whole workflow to pass. So whatever job restrictions we put in, we need to make sure that this does not cause the later workflows to never run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing crucial about these workflow runs. If they fail, then its just an ordinary bug that needs to be fixed. There is no need to have them finished as soon as possible. You can always remove some older/similar systems from the ci if things are then taking too long.

In my opinion it's not acceptable that other workflows are queued for 14+ hours because of it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. 60 runners. Then increasing "max-parallel" to 50 is a good solution!

Also we may decrease "max-parallel" for "minimal", "maximal", "experimental" to allow rooms for other PR workflow runs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. When I added the "standard-sitepackages" run (with a limit of 10 and running parallel with "standard"), I didn't decrease the max-parallel of "standard". We can reduce it a bit, from 30 perhaps to 25. I wouldn't go much lower than that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about basing your PR on Tobias's?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've instead pushed a small change to #36616 (comment)

Copy link
Contributor Author

@tobiasdiez tobiasdiez Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking; then the time to the release (~ 7 days) is the closer deadline

According to https://github.com/sagemath/sage/actions/runs/6598197147/usage, the overall run time is 28d, so an average number of 5 runners should be more than enough to make that deadline. With the proposed 10-15 runners, you get it in 3 to 4 days.

@kwankyu
Copy link
Collaborator

kwankyu commented Oct 31, 2023

Anyway I like the idea.

@mkoeppe
Copy link
Member

mkoeppe commented Oct 31, 2023

From the moment that a release has been pushed and the merged PRs are closed:

  • PR workflows still run with the previous releases's Docker image.
  • For PRs that have already been rebased on the new release, this will be slow as it has to rebuild all the changes in the new release.
  • For PRs that have not been rebased, the blocker PRs just closed are no longer applied, so failures already fixed there will show up again.

This is the situation until the "default-pre" and "default" jobs have completed and pushed the new Docker image.

So what we should do is ensure that this new Docker image is available as soon as possible. #36430, by making "default-pre" and "default" jobs separate from "standard-pre" and "standard", already did part of this, but we can do better: Currently, because "default" is queued after the "standard-pre" jobs, it has to wait until a runner becomes available.

So I would suggest the following:

  • standard-pre: increase the max-parallel from 30 (the default set in docker.yml) to 50 so that all standard-pre jobs can start at the same time

Edit: That's now #36616

@mkoeppe
Copy link
Member

mkoeppe commented Nov 1, 2023

Let's close this one here in favor of #36616.

@kwankyu
Copy link
Collaborator

kwankyu commented Nov 1, 2023

#36616 adjusted "standard" reasonably.

Decreasing max-parallel for "minimal" and others will make more rooms for other PRs. But as we have 60 runners, max-parallel 30 (current default) for them already gives enough room for other PRs. On the other hand, decreasing max-parallel for "minimal" and others will result in taking more time to finish "minimal" and others.

So let's see how things work after #36616. If there needs more adjustment, then let's use this PR. Otherwise close this PR as Matthias suggested.

@tobiasdiez
Copy link
Contributor Author

#36616 adjusted "standard" reasonably.

I don't agree that this is enough reduction.

Decreasing max-parallel for "minimal" and others will make more rooms for other PRs. But as we have 60 runners, max-parallel 30 (current default) for them already gives enough room for other PRs. On the other hand, decreasing max-parallel for "minimal" and others will result in taking more time to finish "minimal" and others.

Sadly not true. Already during the first hours after a new release you have a long queue. Normally the conda workflow takes 2 hours, but as you see from https://github.com/sagemath/sage/actions/workflows/ci-conda.yml?page=4 and page 3 there are a lot of workflows that take 5 hours, so they are queued for 3h. Some are even in the queue for 12 hours, and that's usually the case when one of the "blocks" in the ci-linux workflow is finished so that the next block gets almost all runners.

vbraun pushed a commit to vbraun/sage that referenced this pull request Nov 2, 2023
… decrease for `standard`, `minimal-pre`

    
<!-- ^^^^^
Please provide a concise, informative and self-explanatory title.
Don't put issue numbers in there, do this in the PR body below.
For example, instead of "Fixes sagemath#1234" use "Introduce new method to
calculate 1+1"
-->
<!-- Describe your changes here in detail -->

<!-- Why is this change required? What problem does it solve? -->
As proposed in
sagemath#36610 (comment)

<!-- If this PR resolves an open issue, please link to it here. For
example "Fixes sagemath#12345". -->
<!-- If your change requires a documentation PR, please link it
appropriately. -->

### 📝 Checklist

<!-- Put an `x` in all the boxes that apply. -->
<!-- If your change requires a documentation PR, please link it
appropriately -->
<!-- If you're unsure about any of these, don't hesitate to ask. We're
here to help! -->
<!-- Feel free to remove irrelevant items. -->

- [x] The title is concise, informative, and self-explanatory.
- [ ] The description explains in detail what this PR is about.
- [x] I have linked a relevant issue or discussion.
- [ ] I have created tests covering the changes.
- [ ] I have updated the documentation accordingly.

### ⌛ Dependencies

<!-- List all open PRs that this PR logically depends on
- sagemath#12345: short description why this is a dependency
- sagemath#34567: ...
-->

<!-- If you're unsure about any of these, don't hesitate to ask. We're
here to help! -->
    
URL: sagemath#36616
Reported by: Matthias Köppe
Reviewer(s): Kwankyu Lee
@kwankyu
Copy link
Collaborator

kwankyu commented Nov 2, 2023

OK. Then we would need decreasing max-parallel for "minimal" and others in addition to #36616.

If you rebase your PR on #36616 and make adjustments only to "minimal" and others, I am willing to get this PR in for the next beta.

vbraun pushed a commit to vbraun/sage that referenced this pull request Nov 5, 2023
… decrease for `standard`, `minimal-pre`

    
<!-- ^^^^^
Please provide a concise, informative and self-explanatory title.
Don't put issue numbers in there, do this in the PR body below.
For example, instead of "Fixes sagemath#1234" use "Introduce new method to
calculate 1+1"
-->
<!-- Describe your changes here in detail -->

<!-- Why is this change required? What problem does it solve? -->
As proposed in
sagemath#36610 (comment)

<!-- If this PR resolves an open issue, please link to it here. For
example "Fixes sagemath#12345". -->
<!-- If your change requires a documentation PR, please link it
appropriately. -->

### 📝 Checklist

<!-- Put an `x` in all the boxes that apply. -->
<!-- If your change requires a documentation PR, please link it
appropriately -->
<!-- If you're unsure about any of these, don't hesitate to ask. We're
here to help! -->
<!-- Feel free to remove irrelevant items. -->

- [x] The title is concise, informative, and self-explanatory.
- [ ] The description explains in detail what this PR is about.
- [x] I have linked a relevant issue or discussion.
- [ ] I have created tests covering the changes.
- [ ] I have updated the documentation accordingly.

### ⌛ Dependencies

<!-- List all open PRs that this PR logically depends on
- sagemath#12345: short description why this is a dependency
- sagemath#34567: ...
-->

<!-- If you're unsure about any of these, don't hesitate to ask. We're
here to help! -->
    
URL: sagemath#36616
Reported by: Matthias Köppe
Reviewer(s): Kwankyu Lee
@mkoeppe
Copy link
Member

mkoeppe commented Nov 11, 2023

I don't think this PR is a serious proposal. If Tobias wants to demonstrate that after this radical cut of number of allotted runners by a factor 3 is suitable for running the portability tests, he can do so by running it in his own repo so we can have a look at the run.

@tobiasdiez
Copy link
Contributor Author

I don't think this PR is a serious proposal.

Please stop these comments. This PR is clearly not an April fools joke.

If Tobias wants to demonstrate that after this radical cut of number of allotted runners by a factor 3 is suitable for running the portability tests, he can do so by running it in his own repo so we can have a look at the run.

I don't want to demonstrate that this is possible, I want to reduce the completion time of PR-related workflows after a release.
Anyway, here is the run: https://github.com/tobiasdiez/sage/actions/runs/6831315056 (standard will finish about 16h after the run is started; not sure what other information you expect to get out of the test run).

@mkoeppe
Copy link
Member

mkoeppe commented Nov 11, 2023

(standard will finish about 16h after the run is started; not sure what other information you expect to get out of the test run).

The other tests that are part of the CI Linux.

@mkoeppe
Copy link
Member

mkoeppe commented Nov 15, 2023

Proposing to close this PR. As I said in #36610 (comment), this PR is not a serious proposal.
From the beginning, it has lacked a justification for the radical changes that it makes. Instead, the author demands that reviewers justify the status quo.

@tobiasdiez
Copy link
Contributor Author

There is nothing radical about this PR. It just gives priority to the PR workflows instead of the ci-linux, by reducing the priority of the later by roughly a factor of 2 (relative to the number of runners assigned at the point of creation of this PR).

@tobiasdiez
Copy link
Contributor Author

Maybe start with answering the following question instead of insulting me:

Can you expand on why this is too slow? [...] What would be an acceptable time for you?

@mkoeppe
Copy link
Member

mkoeppe commented Nov 16, 2023

For reference: #36572 (comment)

@tobiasdiez
Copy link
Contributor Author

For reference: #36572 (comment)

I fail to see how this answers my questions.

@mkoeppe
Copy link
Member

mkoeppe commented Dec 7, 2023

As I explained elsewhere: labels "invalid" + "needs review" = motion to close.

@mkoeppe
Copy link
Member

mkoeppe commented Dec 21, 2023

As discussed in #36726 (comment)

@tobiasdiez
Copy link
Contributor Author

That doesn't address my question at all.

@tobiasdiez tobiasdiez added disputed PR is waiting for community vote, see https://groups.google.com/g/sage-devel/c/IgBYUJl33SQ and removed r: invalid labels Dec 23, 2023
Copy link

Documentation preview for this PR (built with commit 31f8802; changes) is ready! 🎉

@mkoeppe mkoeppe requested a review from kcrisman March 17, 2024 04:09
@mkoeppe
Copy link
Member

mkoeppe commented Mar 17, 2024

@kcrisman responding to your suggestion in https://groups.google.com/g/sage-devel/c/IgBYUJl33SQ/m/ciVrZ7x0AQAJ:

Confirming my vote of -1 on this PR.

@kcrisman
Copy link
Member

That doesn't address my question at all.

Perhaps surprisingly, here I agree more that it's reasonable to ask for a concrete number and have discussion about that before closing the ticket. See the end of my comment here: #36725 (comment) Am I correct in inferring that some of the CI limit tickets are, at their core, about a disagreement in philosophy on whether PRs or "standard" builds should get priority? I'm a little in the dark here (hence my initial reluctance to get drawn in ...).

But anyway, I feel like that is all a sideshow. It seems like this should be a number that people can come to some consensus on. Can we at least agree on a hard upper and lower bound imposed by hardware/price/something? (And then maybe make a random choice based on some interesting distribution ... I'm totally not joking, randomized elections of this type are definitely proposed from time to time, and maybe it will help end the bad feelings here a tiny bit.)

@jhpalmieri
Copy link
Member

-1 from me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: scripts disputed PR is waiting for community vote, see https://groups.google.com/g/sage-devel/c/IgBYUJl33SQ s: needs review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants