-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkout suddenly much slower on windows environment #1186
Comments
I have observed this too. Example job: https://github.com/Tyrrrz/CliWrap/actions/runs/4271628026
Note that this behavior is pretty inconsistent, and usually |
I've encountered this and debugged it a bit, by heavily instrumenting the transpiled code in
all on this PowerShell call: (Get-CimInstance -ClassName Win32_OperatingSystem).caption totaling 1 minute and 48 whopping seconds just to determine a Windows release that the To make things even worse, the same PowerShell invocation also happens multiple times during the post-action: I do not really understand what changed between Feb 17th and 21st that would explain this slowdown. There has not been a new |
I have briefly seen success trying to hard-code the fix of sindresorhus/windows-release#18 into At this point, I have to tend to other things, but I thought I'd leave my current findings here in case somebody else can take over (and try things like overriding |
I made an attempt to avoid calling into If you want to try you can use - uses: BrettDong/checkout@octokit and see if there is any improvement in time stalled between starting the checkout action and performing actual git operations. |
@BrettDong excellent! I tested this and the times are back to decent levels: the entire |
The fix in #1246 reduces stalled time down to 3 seconds. During the 3 seconds the workflow is stalled on loading the |
I have had a similar issue with large runners with slow checkout and cleanup that I reported to GitHub Support. They concluded that it is related this issue, even though I am not completely convinced. The screenshot from @Tyrrrz earlier in this issue also shows a slow post checkout (cleanup). Workflow:To recreate as a minimum solution, I created a new repository with only a single workflow file, spinning up 3 jobs on default and large runners:
In addition to checkout the job has only one other step, and that is to sleep for 15 seconds. Results:The jobs executed 10 times, and shows that:
Findings:Finding 1:Every single checkout on the large runners is at least twice as slow as on regular runner, and all the times goes before the actual checkout starts: Finding 2:The post checkout (cleanup) is on average 15 times slower as on regular runner, and all the time also goes before any cleanup is started: Finding 3:The simple sleep-task on regular runner uses twice the time of the sleep interval. How is it even possible that a sleep for 15 seconds takes almost double the time? This was done with a simple run: |
You can build something like https://github.com/BrettDong/Cataclysm-DDA/blob/etw/.github/workflows/etw.yml to collect ETL traces in the runner to diagnose what is happening and spending time during |
This seems like an interesting point in and of itself. Because in actions/runner-images#7320 whilst we report So if I understand your conclusion, the narrowing down of this to being a disk access issue (or something that limits the effective disk access rate), this is consistent with what we are seeing. |
Same here - we've noticed that large GitHub-managed Windows runners are substantially slower during checkout. This is not a recent regression for us though - they've been (nearly unuseably) slow for months.
We also have a ticket with GitHub Support, and I've been running experiments for our repo / workflows at iree-org/iree#12051. |
…/github dependency (#1246) * Improve checkout performance on Windows runners by upgrading @actions/github dependency Re: #1186 @dscho discovered that the checkout action could stall for a considerable amount of time on Windows runners waiting for PowerShell invocations made from 'windows-release' npm package to complete. Then I studied the dependency chain to figure out where 'windows-release' was imported: '@actions/checkout'@main <- '@actions/github'@2.2.0 <- '@octokit/endpoint'@6.0.1 <- '@octokit/graphql'@4.3.1 <- '@octokit/request'@5.4.2 <- '@octokit/rest'@16.43.1 <- 'universal-user-agent'@4.0.1 <- 'os-name'@3.1.0 <- 'windows-release'@3.1.0 'universal-user-agent' package dropped its dependency on 'os-name' in https://github.com/gr2m/universal-user-agent/releases/tag/v6.0.0 . '@actions/github' v3 removed dependency on '@octokit/rest'@16.43.1 and allows users to move away from the old 'universal-user-agent' v4. (actions/toolkit#453) This pull request attempts to update the version of '@actions/github' used in the checkout action to avoid importing 'windows-release'. Based on testing in my own repositories, I can see an improvement in reduced wait time between entering the checkout action and git actually starts to do useful work. * Update .licenses * Rebuild index.js
Great news - build times on windows environment are back to normal! Thanks @BrettDong and @fhammerl |
FYI the issue that @ScottTodd and I are seeing on the large Windows managed runners was not fixed by this update. We tested it prior to release at the direction of GitHub support:
Seems like it may be a separate issue, but just wanted to call it out since these issues seem like they were maybe merged. Seems like this is also something that @Gakk is hitting. My understanding from support is that they're still investigating this other problem. It may be worth opening a separate issue for this or leaving this open. |
@GMNGeoffrey, I have done extensive testing and confirmed that my issues were resolved by actions/checkout version 3.5.1. |
Yep, I'm still seeing substantially slower checkouts on large runners (could break that out into a different issue, and we have a support ticket for it). Latest experiments on iree-org/iree#12051, logs at https://github.com/openxla/iree/actions/runs/4748455722/jobs/8434667258. Our repo depends on https://github.com/llvm/llvm-project/ (very large) and a few other submodules, and just the checkout takes ~1 minute on small Windows runners but 7+ minutes on large Windows runners. We've tried all sorts of ways to change the git commands used (sparse checkouts, shallow clones, caching of git files, etc.) but can't get past whatever the differences are between the runners themselves. |
As mentioned in a comment above, can you try this? https://github.com/BrettDong/Cataclysm-DDA/blob/etw/.github/workflows/etw.yml I don't have access to larger runners currently to test myself. |
Just a wild guess: could it be that large runners have slower D: drives than smaller runners? IIRC the hosted runners specifically have very fast D: drives. |
I've just verified that the issue is not with the runners themselves, but rather with actions/checkout. Using just the normal bash commands, I did full checkout with submodules in 1m30s, compared to almost 10m previously. By dropping |
You can collect ETW traces to help diagnose what's happening and taking time during checkout action. |
We have found the GitHub actions built-in caching mechanism to be extremely limiting: slow, small, and buggy. Switch instead to using our own remote ccache hosted on GCS. This matches our Linux builds on our self-hosted runners except that we have to do GCS auth through service account keys, unfortunately, which means that access is restricted to postsubmit runs. Luckily, for these builds we're generally doing everything in one job and just want caching (which we only write on postsubmit anyway) and don't need artifact storage (which we'd need on presubmit too). Tested: Ran on this PR (hacked the workflow a bit). An [initial run](https://github.com/openxla/iree/actions/runs/4750257226/jobs/8438272681) with an empty cache took 28m total, 15.5m of which was in the build step. This includes writing the remote cache (minor overhead). A [rerun](https://github.com/openxla/iree/actions/runs/4750257226/jobs/8438619413) with a now populated cache took 14m total, 6.5m of which was in the build step. 79% of compiler calls were cacheable and of those 99% were remote cache hits. Contrast with a [recent post-submit run](https://github.com/openxla/iree/actions/runs/4748717136/jobs/8435229260) that ran on a docs-only change (so should've had a maximally populated cache), which took 20m, 7m of which was the build step, 2m of which was fetching the cache, and 1m of which was saving the cache. That's setting aside [runs like this one](https://github.com/openxla/iree/actions/runs/4741863995/jobs/8419465087) where fetching the cache just times out entirely (with no alerting other than if you happen to look at the UI). Tragically, most of the time in all of these jobs is spent just checking out the repository and submodules (see actions/checkout#1186). Overall this seems like a marked improvement. The main wins are in avoiding tons of complexity futzing with cache compression levels and restoring and saving the cache (actual cached build time is ~unchanged). Part of #13028 skip-ci: Windows builds don't run on presubmit
Yeah, but presumably so can the GitHub engineers who support says are working to fix this. Like, IDK, it kind of seems to me that the people who wrote this code, control these VMs, and whom we are paying for this service, could maybe take a look at the issues with it. |
@GMNGeoffrey I would like to encourage you to consider the current macro-economical climate, and also just how large he roadmap is. And please also note how @BrettDong's effort was rewarded: I am sure that you also will get what you want much quicker if you dig a little deeper with those ETW traces. I would even consider to help, but I do not have access to those large runners; You do, though. |
…/github dependency (#1246) * Improve checkout performance on Windows runners by upgrading @actions/github dependency Re: actions/checkout#1186 @dscho discovered that the checkout action could stall for a considerable amount of time on Windows runners waiting for PowerShell invocations made from 'windows-release' npm package to complete. Then I studied the dependency chain to figure out where 'windows-release' was imported: '@actions/checkout'@main <- '@actions/github'@2.2.0 <- '@octokit/endpoint'@6.0.1 <- '@octokit/graphql'@4.3.1 <- '@octokit/request'@5.4.2 <- '@octokit/rest'@16.43.1 <- 'universal-user-agent'@4.0.1 <- 'os-name'@3.1.0 <- 'windows-release'@3.1.0 'universal-user-agent' package dropped its dependency on 'os-name' in https://github.com/gr2m/universal-user-agent/releases/tag/v6.0.0 . '@actions/github' v3 removed dependency on '@octokit/rest'@16.43.1 and allows users to move away from the old 'universal-user-agent' v4. (actions/toolkit#453) This pull request attempts to update the version of '@actions/github' used in the checkout action to avoid importing 'windows-release'. Based on testing in my own repositories, I can see an improvement in reduced wait time between entering the checkout action and git actually starts to do useful work. * Update .licenses * Rebuild index.js
We have found the GitHub actions built-in caching mechanism to be extremely limiting: slow, small, and buggy. Switch instead to using our own remote ccache hosted on GCS. This matches our Linux builds on our self-hosted runners except that we have to do GCS auth through service account keys, unfortunately, which means that access is restricted to postsubmit runs. Luckily, for these builds we're generally doing everything in one job and just want caching (which we only write on postsubmit anyway) and don't need artifact storage (which we'd need on presubmit too). Tested: Ran on this PR (hacked the workflow a bit). An [initial run](https://github.com/openxla/iree/actions/runs/4750257226/jobs/8438272681) with an empty cache took 28m total, 15.5m of which was in the build step. This includes writing the remote cache (minor overhead). A [rerun](https://github.com/openxla/iree/actions/runs/4750257226/jobs/8438619413) with a now populated cache took 14m total, 6.5m of which was in the build step. 79% of compiler calls were cacheable and of those 99% were remote cache hits. Contrast with a [recent post-submit run](https://github.com/openxla/iree/actions/runs/4748717136/jobs/8435229260) that ran on a docs-only change (so should've had a maximally populated cache), which took 20m, 7m of which was the build step, 2m of which was fetching the cache, and 1m of which was saving the cache. That's setting aside [runs like this one](https://github.com/openxla/iree/actions/runs/4741863995/jobs/8419465087) where fetching the cache just times out entirely (with no alerting other than if you happen to look at the UI). Tragically, most of the time in all of these jobs is spent just checking out the repository and submodules (see actions/checkout#1186). Overall this seems like a marked improvement. The main wins are in avoiding tons of complexity futzing with cache compression levels and restoring and saving the cache (actual cached build time is ~unchanged). Part of #13028 skip-ci: Windows builds don't run on presubmit
Ok that first update was all kinds of wrong let me try again! Sorry we haven't commented on this ticket still, we are tracking this internally. We are have made some changes to the Windows VM image but only recently and the don't appear to have helped. With everything else going on we have had to put this one aside as well for the last couple of weeks but we are committed to fixing this. I will re-open this ticket as this is linked in the issue we are tracking internally :) |
Thanks Ben. Further runs suggest that my switch to use git commands directly instead of actions/checkout was just lucky the first few times (or the computer learned what I was trying to do and put a stop to it 😛). Subsequent runs have had similar latency to before the switch, I think (I started hacking together a script to collect statistics for jobs over time, but got side-tracked, so pure anecdata right now). So I'm back to thinking it's the VM+git itself and not the action. I am sort of considering getting tarballs for all of the submodules instead of using git... I'll update if that seems to be faster somehow (which would suggest to me something git-specific and not just IO or network issues) |
Thanks @GMNGeoffrey for having a go! (and sorry that the computers are learning 😆 ) Let me know if it's faster and we will hopefully have our focus back on this in the next week or so as things settle (also turns out I am not a maintainer on this repo, I will find someone to re-open this for me :D) |
Just a reminder, given the evidence in actions/runner-images#7320, there is almost certainly an underlying issue that is not specific to |
Re-opening at @nebuk89's request, so we can track our efforts externally as we investigate further. Some valuable context built up in this thread 😄 . |
We have found the GitHub actions built-in caching mechanism to be extremely limiting: slow, small, and buggy. Switch instead to using our own remote ccache hosted on GCS. This matches our Linux builds on our self-hosted runners except that we have to do GCS auth through service account keys, unfortunately, which means that access is restricted to postsubmit runs. Luckily, for these builds we're generally doing everything in one job and just want caching (which we only write on postsubmit anyway) and don't need artifact storage (which we'd need on presubmit too). Tested: Ran on this PR (hacked the workflow a bit). An [initial run](https://github.com/openxla/iree/actions/runs/4750257226/jobs/8438272681) with an empty cache took 28m total, 15.5m of which was in the build step. This includes writing the remote cache (minor overhead). A [rerun](https://github.com/openxla/iree/actions/runs/4750257226/jobs/8438619413) with a now populated cache took 14m total, 6.5m of which was in the build step. 79% of compiler calls were cacheable and of those 99% were remote cache hits. Contrast with a [recent post-submit run](https://github.com/openxla/iree/actions/runs/4748717136/jobs/8435229260) that ran on a docs-only change (so should've had a maximally populated cache), which took 20m, 7m of which was the build step, 2m of which was fetching the cache, and 1m of which was saving the cache. That's setting aside [runs like this one](https://github.com/openxla/iree/actions/runs/4741863995/jobs/8419465087) where fetching the cache just times out entirely (with no alerting other than if you happen to look at the UI). Tragically, most of the time in all of these jobs is spent just checking out the repository and submodules (see actions/checkout#1186). Overall this seems like a marked improvement. The main wins are in avoiding tons of complexity futzing with cache compression levels and restoring and saving the cache (actual cached build time is ~unchanged). Part of iree-org#13028 skip-ci: Windows builds don't run on presubmit
I'm not using these runners, but if the OS is showing high CPU on disk access, perhaps its due to a Host Disk Caching setting set by Azure on the VM Disk (see https://learn.microsoft.com/en-us/azure/virtual-machines/disks-performance) as while the Host Disk cache benefits for some modes, it can also add a penalty. |
Not sure if it helps - since we are running self-hosted gitlab, but I started looking for solution since our Windows runners are incredibly slow. Simple build jobs (for instance just perform MSBuild on a solution) that finish in less than 1 minute when run manually on the same machine - take over an hour when run as gitlab-runner job. The very same script is executed, no manual deviation in the two procedures. Further potentially helpful details:
|
It almost sounds like the runner is maybe slow / blocking reading stderr/out of the console windows of the processes it’s launched, which is blocking the process advancing. |
We are discontinuing our use of GH managed windows runners. The costs were already beyond premium/sustainable, and the performance is so poor that the issue compounds out of control. I don't consider this a viable way to run CI for any business. I can tolerate a lot but not at massively inflated prices. |
Description
Since a few days, the duration of the
actions/checkout@v3
steps onwindows-2019
have dramatically increased.This behavior is seen on all my repos (all private). Below a table showing an example of before/after.
13s
1m35s
8s
47s
The result is a huge increase in build (and billable) time.
The github status page does show some issues around this time frame, but these were all solved:
Platforms affected
Runner images affected
Image version and build link
Private repo
Is it regression?
Yes, sorry private repos
Expected behavior
The build times should be fairly constant.
Actual behavior
Build times explode. Burning down our build minutes too fast.
Repro steps
Compare build times on any windows environment from before Feb 18th with today.
The text was updated successfully, but these errors were encountered: