-
-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Linux: Increase max_parallel for standard-pre
, decrease for standard
, minimal-pre
#36616
CI Linux: Increase max_parallel for standard-pre
, decrease for standard
, minimal-pre
#36616
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks! |
The problem is not that the default image takes so long (that adds maybe 30 mins to the build & test workflow, see https://github.com/sagemath/sage/actions/runs/6703096803/job/18216119546). The problem is that the ci linux workflow spawns already to many runs. Just extract the "default" steps to a new workflow. |
No, Tobias, I explained it correctly in #36610 (comment), and, no, extracting the "default" step to another workflow will not change anything. |
As I don't know how github schedules runners, I cannot be sure. But decreasing "max-parallel" for "minimal", "maximal", "experimental" may allow rooms for other PR workflow runs. We have 7 days before "release". That would be enough time for runs of "minimal" and others will finish. |
That is part of what Tobias' PR does. |
Yes, I agree that the problem is that default is queued after the standard-pre jobs. This is not changed in this PR, but would be fixed by moving the default job to its own independent workflow. Together with a reasonable limit on the max jobs of the ci-linux workflow, you ensure that "default" is run more or less after "default-pre" (it might get delayed by other workflows, but the same problems happens also with the current setup of default after standard-pre). |
As I understand it, since "default-pre" is a job separate from "standard-pre", it is already queued separately from "standard-pre" even though they are in the same workflow. I guess that "same file or not" doesn't matter in how jobs get runners.
"default" is already run after "default-pre" by |
You might be right, maybe you don't need to move it to a separate file (I couldn't find good documentation about how github queues jobs). I still maintain that decreasing the max-parallel is the way to go because of how other jobs are added to the queue in the meantime. With max-parallel = 1 for standard-pre (1 for illustration, not as actual proposal): default-pre runs, and 1 standard-pre. Once default-pre is finished, default is added to the queue. At that point the queue contains the other 44 standard-pre and maybe a couple of other jobs (from say PRs). Now every available runner first picks up the other jobs and finally With max-parallel=45 for standard-pre (again for illustration as the other extreme): default-pre runs, and all standard-pre jobs. Since there are only 60 - 46 = 14 runners not busy with the ci-linux, there will be a huge queue of jobs that pill up during the default-pre run. So once that job finishes and default is added to the queue, it takes some time until the queue is minimized and default is picked up. Note that some other jobs might get the priority above ci-linux, so that not even all 46 standard-pre jobs are already executed at the beginning. In that case, an available runner first picks the remaining standard-pre job before default. So in this situation, at least every standard-pre job has to be started before default is executed. And based on looking at the logs, we are already in the later situation: |
That's correct. |
I've already explained in #36610 (comment) that this is a feature, not a bug. These jobs should really be waiting until the "default" job has pushed the new image. |
I would strongly suggest to generally include links to what you actually looked at so that we don't have to guess.
No, that's incorrect, as is easily verified. From "default", https://github.com/sagemath/sage/actions/runs/6700323366/job/18235514573 ("View Raw Logs"):
From "opensuse-tumbleweed", standard-pre:
|
standard-pre
, decrease for standard
, minimal-pre
I have now pushed a small reduction to correct for standard-sitepackages, as discussed in #36610 (comment) |
Documentation preview for this PR (built with commit 35910e2; changes) is ready! 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable to me.
In addition to that, I still think "minimal" and others should be put to the queue by smaller chunks, max-parallel 10 or something (instead of current default 30), to allow room for PRs. |
--> #36660 |
I don't know what what you take as your evaluation, but here are the average total execution times (wait + actual run) of the lint workflow runs 48h after the release: As you see, this week its quite bad. So can we please merge #36610 next to see how it compares. |
How should I read this
Is this about 14 hours? Does this mean that lint workflow took 14 hours to finish after it started waiting in the queue? Is it started waiting in the queue after 48 hours of the release? |
The first colum is the datetime of the release /tag, the second one is the average execution time of all runs in 48h after that date. |
Ah, thanks. So 11-05 record is pretty well compared with the previous 10-30 record. But as 11-05 record is in 10th place among 16 records, you may say it is still pretty bad. To improve the situation further, we may experiment with the idea of decreasing max-parallel of "minimal" and others. Would you rebase your PR and only do this, as Matthias is fine-tuning "standard-" things in #36660? |
Anyway, as we are in release-candidate state, I am not sure if we can compare the next record with others fairly... |
I agree that the time to completion of the short running jobs like the linter is a meaningful metric. The 0:27:38 time for the latest release seems consistent/plausible based on my observations. Perhaps Tobias could share the script with which he obtained this metric so that we can continue to monitor it. |
I'll also note that the specter raised in #36627 (comment) of about "20 hours" wait time didn't show |
The comparison with 10-30 is misleading since back then the ci-linux workflow has been restarted which completely overwhelmed the github runners. But you do see from that day that the more time you give to ci-linux early on the worse the other work flows behave. Better compare it to 14th or 21th.
The problem is the standard runs not minimal. You see this by looking at this week's runs, the first ones after a release are delayed for 30min (even after default is finished). |
Where can I see the start time (of the first one)? I see the duration time, but cannot find the time when it started... |
In the raw logs for example. |
OK. So this is when "default" started
and this
seems the first linter workflow started after it. So it waited for about 2 hours before starting. So this one was delayed by 2 hours. And do you claim that, including the above one, the first ones (for 48 hours after the release) were delayed by 30 minutes in average? |
Matthias thinks that PRs can wait for (be delayed) for ~4 hours after release since they should be checked by B&T after "default" is built. So we are allowing ~4 hours delay for PRs after release. During the ~4 hours, we aim at processing most of "standard-" jobs. I think this strategy is reasonable. |
Yes, they are delayed by about 25 mins (30 mins of overall execution time - about 5mins actual runtime). It's not possible to get the time of delay from the github api, only the start and end time.
And I'm not agreeing that this delay is helpful to reduce the average execution time of the PR workflows for the reasons explained in #36627 (comment). Again, compare the following options:
In both cases, PRs submitted after 4h are using the fresh docker image - so they have the same run time. So the most unfortunate situation is a PR submitted directly at 3.59. With option a) it would be delayed by 1min, and then finishes after say 4h run time, i.e. exactly 8h after the new release. With option b) it is started immediately, and then finishes after 4+2=6h run time, i.e. 10h after the release. A PR submitted at 2h after a release finishes in both cases at 8h, while a PR submitted directly at the release finishes after 8 and 6h, respectively. So with option a) one would only get a quicker drain of the queue between 8h to 10h after an release. But from the observation at #36616 (comment)
there was no shortage of runners after 4h, so the quicker drain at the 8h mark was without advantage. I go even so far to conjecture that with 10 (instead of 50) standard ci-runs, not a single workflow run would have been delayed this week. But as you said in #36616 (comment), there is no harm in simply trying it out with my alternative solution (after the next release, to still test out #36660) and then choose the one that worked best. |
It seems that you and Matthias have different priorities. After release, Matthias uses the delay (~4 hours long) to finish as much of the "standard-" jobs as possible so that the results can be used to fix things with enough time before the next release. But you are concerned of PR jobs getting run without much interruption after release. I am inclined to Matthias' strategy but want the "standard-" jobs not slip over the ~4 hours slot so as to hinder PR jobs started after the delay. It is also important that the ~4 hours delay is usually in the weekend, when few people work for sage. |
It's not like I'm proposing to disable the ci-linux work flow. With 10 parallel runs, the standard runs still complete in 16h. That should give plenty of time to discover issues. |
Sharing my observations (similar to #36616 (comment)) for the current run of 10.2.rc1 (started at T = Fri, 10 Nov 2023 16:14:50 GMT; note that the fine-tuning #36660 has not been merged yet):
|
... for constructive clogging, sagemath#36616 (comment) <!-- ^^^^^ Please provide a concise, informative and self-explanatory title. Don't put issue numbers in there, do this in the PR body below. For example, instead of "Fixes sagemath#1234" use "Introduce new method to calculate 1+1" --> <!-- Describe your changes here in detail --> <!-- Why is this change required? What problem does it solve? --> <!-- If this PR resolves an open issue, please link to it here. For example "Fixes sagemath#12345". --> <!-- If your change requires a documentation PR, please link it appropriately. --> ### 📝 Checklist <!-- Put an `x` in all the boxes that apply. --> <!-- If your change requires a documentation PR, please link it appropriately --> <!-- If you're unsure about any of these, don't hesitate to ask. We're here to help! --> <!-- Feel free to remove irrelevant items. --> - [x] The title is concise, informative, and self-explanatory. - [ ] The description explains in detail what this PR is about. - [x] I have linked a relevant issue or discussion. - [ ] I have created tests covering the changes. - [ ] I have updated the documentation accordingly. ### ⌛ Dependencies <!-- List all open PRs that this PR logically depends on - sagemath#12345: short description why this is a dependency - sagemath#34567: ... --> <!-- If you're unsure about any of these, don't hesitate to ask. We're here to help! --> URL: sagemath#36660 Reported by: Matthias Köppe Reviewer(s): Kwankyu Lee
The run for 10.2.rc2 (https://github.com/sagemath/sage/actions/runs/6842945144) was still busy with the last block ("experimental-p-z") of jobs when 10.2.rc3 was pushed. I've canceled it now. |
Run for 10.2.rc3 (https://github.com/sagemath/sage/actions/runs/6870325951) started at T = Tue, 14 Nov 2023 22:33:32 GMT; empty queue
|
So which B&T workflows actually profited from the clogging (i.e. got delayed long enough to run after the default workflow finished)? |
Here are the last recent runs:
The last column shows the number of runs (as an indication of how busy it was). Although every of these periods has seen less runs than at 10-21, the execution runs are a factor of 3 to 7 of the execution time before this PR was merged. |
<!-- ^^^^^ Please provide a concise, informative and self-explanatory title. Don't put issue numbers in there, do this in the PR body below. For example, instead of "Fixes sagemath#1234" use "Introduce new method to calculate 1+1" --> <!-- Describe your changes here in detail --> Alternative solution for sagemath#36726. Since the purpose of the "minimal" ci runs is to check the documentation of the minimal build requirements, it is overkill to run it for all systems. Thus, we only run it for "ubuntu- focal", which is also the default system for PRs. This also hugely reduces the load of onto the ci, somewhat leviating the issues introduced by sagemath#36616 (see eg sagemath#36616 (comment)). <!-- Why is this change required? What problem does it solve? --> <!-- If this PR resolves an open issue, please link to it here. For example "Fixes sagemath#12345". --> <!-- If your change requires a documentation PR, please link it appropriately. --> ### 📝 Checklist <!-- Put an `x` in all the boxes that apply. --> <!-- If your change requires a documentation PR, please link it appropriately --> <!-- If you're unsure about any of these, don't hesitate to ask. We're here to help! --> <!-- Feel free to remove irrelevant items. --> - [x] The title is concise, informative, and self-explanatory. - [x] The description explains in detail what this PR is about. - [x] I have linked a relevant issue or discussion. - [ ] I have created tests covering the changes. - [ ] I have updated the documentation accordingly. ### ⌛ Dependencies <!-- List all open PRs that this PR logically depends on - sagemath#12345: short description why this is a dependency - sagemath#34567: ... --> <!-- If you're unsure about any of these, don't hesitate to ask. We're here to help! --> URL: sagemath#36941 Reported by: Tobias Diez Reviewer(s): Dima Pasechnik
<!-- ^^^^^ Please provide a concise, informative and self-explanatory title. Don't put issue numbers in there, do this in the PR body below. For example, instead of "Fixes sagemath#1234" use "Introduce new method to calculate 1+1" --> <!-- Describe your changes here in detail --> Alternative solution for sagemath#36726. Since the purpose of the "minimal" ci runs is to check the documentation of the minimal build requirements, it is overkill to run it for all systems. Thus, we only run it for "ubuntu- focal", which is also the default system for PRs. This also hugely reduces the load of onto the ci, somewhat leviating the issues introduced by sagemath#36616 (see eg sagemath#36616 (comment)). <!-- Why is this change required? What problem does it solve? --> <!-- If this PR resolves an open issue, please link to it here. For example "Fixes sagemath#12345". --> <!-- If your change requires a documentation PR, please link it appropriately. --> ### 📝 Checklist <!-- Put an `x` in all the boxes that apply. --> <!-- If your change requires a documentation PR, please link it appropriately --> <!-- If you're unsure about any of these, don't hesitate to ask. We're here to help! --> <!-- Feel free to remove irrelevant items. --> - [x] The title is concise, informative, and self-explanatory. - [x] The description explains in detail what this PR is about. - [x] I have linked a relevant issue or discussion. - [ ] I have created tests covering the changes. - [ ] I have updated the documentation accordingly. ### ⌛ Dependencies <!-- List all open PRs that this PR logically depends on - sagemath#12345: short description why this is a dependency - sagemath#34567: ... --> <!-- If you're unsure about any of these, don't hesitate to ask. We're here to help! --> URL: sagemath#36941 Reported by: Tobias Diez Reviewer(s): Dima Pasechnik
…nd 1 Linux job <!-- ^^^^^ Please provide a concise, informative and self-explanatory title. Don't put issue numbers in there, do this in the PR body below. For example, instead of "Fixes sagemath#1234" use "Introduce new method to calculate 1+1" --> <!-- Describe your changes here in detail --> <!-- Why is this change required? What problem does it solve? --> as discussed in sagemath#36678 (comment), sagemath#36616 (comment) On pushes to a tag, all 3 Linux and 3 macOS jobs are still run, as demonstrated in https://github.com/mkoeppe/sage/actions/runs/6829619271 <!-- If this PR resolves an open issue, please link to it here. For example "Fixes sagemath#12345". --> <!-- If your change requires a documentation PR, please link it appropriately. --> After merging, the branch protection rule https://github.com/sagemath/sa ge/settings/branch_protection_rules/33965567 will need to be updated. It controls what workflows show as "Required" in the PR Checks. ### 📝 Checklist <!-- Put an `x` in all the boxes that apply. --> <!-- If your change requires a documentation PR, please link it appropriately --> <!-- If you're unsure about any of these, don't hesitate to ask. We're here to help! --> <!-- Feel free to remove irrelevant items. --> - [x] The title is concise, informative, and self-explanatory. - [ ] The description explains in detail what this PR is about. - [x] I have linked a relevant issue or discussion. - [ ] I have created tests covering the changes. - [ ] I have updated the documentation accordingly. ### ⌛ Dependencies <!-- List all open PRs that this PR logically depends on - sagemath#12345: short description why this is a dependency - sagemath#34567: ... --> <!-- If you're unsure about any of these, don't hesitate to ask. We're here to help! --> URL: sagemath#36694 Reported by: Matthias Köppe Reviewer(s): Kwankyu Lee, Tobias Diez
…nd 1 Linux job <!-- ^^^^^ Please provide a concise, informative and self-explanatory title. Don't put issue numbers in there, do this in the PR body below. For example, instead of "Fixes sagemath#1234" use "Introduce new method to calculate 1+1" --> <!-- Describe your changes here in detail --> <!-- Why is this change required? What problem does it solve? --> as discussed in sagemath#36678 (comment), sagemath#36616 (comment) On pushes to a tag, all 3 Linux and 3 macOS jobs are still run, as demonstrated in https://github.com/mkoeppe/sage/actions/runs/6829619271 <!-- If this PR resolves an open issue, please link to it here. For example "Fixes sagemath#12345". --> <!-- If your change requires a documentation PR, please link it appropriately. --> After merging, the branch protection rule https://github.com/sagemath/sa ge/settings/branch_protection_rules/33965567 will need to be updated. It controls what workflows show as "Required" in the PR Checks. ### 📝 Checklist <!-- Put an `x` in all the boxes that apply. --> <!-- If your change requires a documentation PR, please link it appropriately --> <!-- If you're unsure about any of these, don't hesitate to ask. We're here to help! --> <!-- Feel free to remove irrelevant items. --> - [x] The title is concise, informative, and self-explanatory. - [ ] The description explains in detail what this PR is about. - [x] I have linked a relevant issue or discussion. - [ ] I have created tests covering the changes. - [ ] I have updated the documentation accordingly. ### ⌛ Dependencies <!-- List all open PRs that this PR logically depends on - sagemath#12345: short description why this is a dependency - sagemath#34567: ... --> <!-- If you're unsure about any of these, don't hesitate to ask. We're here to help! --> URL: sagemath#36694 Reported by: Matthias Köppe Reviewer(s): Kwankyu Lee, Tobias Diez
As proposed in #36610 (comment)
📝 Checklist
⌛ Dependencies