Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkout suddenly much slower on windows environment #1186

Open
2 of 10 tasks
yeu-degroot opened this issue Feb 23, 2023 · 30 comments
Open
2 of 10 tasks

Checkout suddenly much slower on windows environment #1186

yeu-degroot opened this issue Feb 23, 2023 · 30 comments

Comments

@yeu-degroot
Copy link

Moved from actions/runner-images#7180

Description

Since a few days, the duration of the actions/checkout@v3 steps on windows-2019 have dramatically increased.
This behavior is seen on all my repos (all private). Below a table showing an example of before/after.

Checkout step Feb 17th Feb 21st
1 13s 1m35s
2 8s 47s

The result is a huge increase in build (and billable) time.

image

image

The github status page does show some issues around this time frame, but these were all solved:

image

Platforms affected

  • Azure DevOps
  • GitHub Actions - Standard Runners
  • GitHub Actions - Larger Runners

Runner images affected

  • Ubuntu 18.04
  • Ubuntu 20.04
  • Ubuntu 22.04
  • macOS 11
  • macOS 12
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

Current runner version: '2.301.1'
Operating System
  Microsoft Windows Server 2019
  10.0.17763

Private repo

Is it regression?

Yes, sorry private repos

Expected behavior

The build times should be fairly constant.

Actual behavior

Build times explode. Burning down our build minutes too fast.

Repro steps

Compare build times on any windows environment from before Feb 18th with today.

@Tyrrrz
Copy link

Tyrrrz commented Mar 1, 2023

I have observed this too. Example job: https://github.com/Tyrrrz/CliWrap/actions/runs/4271628026

  • Windows – checkout took 4m9s:

image

  • Ubuntu – checkout took <1s:

image

  • macOS – checkout took 3s:

image

Note that this behavior is pretty inconsistent, and usually checkout completes pretty fast on Windows. But in the event that it is slow, it's only ever slow on Windows.

@dscho
Copy link
Contributor

dscho commented Mar 3, 2023

I've encountered this and debugged it a bit, by heavily instrumenting the transpiled code in dist/index.js. Turns out that there are multiple calls to PowerShell that make a mess by essentially waiting for dozens of seconds. Here is the example run I analyzed:

  • here, 1 minute and 14 seconds are spent,
  • here, 11 seconds are spent,
  • here, 12 seconds are spent,
  • here, 11 seconds are spent,

all on this PowerShell call:

(Get-CimInstance -ClassName Win32_OperatingSystem).caption

totaling 1 minute and 48 whopping seconds just to determine a Windows release that the actions/checkout Action has no need to know.

To make things even worse, the same PowerShell invocation also happens multiple times during the post-action:

I do not really understand what changed between Feb 17th and 21st that would explain this slowdown. There has not been a new actions/checkout release since Jan 5th...

@dscho
Copy link
Contributor

dscho commented Mar 3, 2023

I have briefly seen success trying to hard-code the fix of sindresorhus/windows-release#18 into dist/index.js but it does not seem to work when I undelete all the core logic. sigh

At this point, I have to tend to other things, but I thought I'd leave my current findings here in case somebody else can take over (and try things like overriding os.release() to always return 2022 or something like that, before letting @octokit/endpoint initialize its User-Agent string).

@BrettDong
Copy link
Contributor

BrettDong commented Mar 26, 2023

I made an attempt to avoid calling into windows-releases from the checkout action in #1246.

If you want to try you can use

- uses: BrettDong/checkout@octokit

and see if there is any improvement in time stalled between starting the checkout action and performing actual git operations.

@dscho
Copy link
Contributor

dscho commented Mar 26, 2023

@BrettDong excellent! I tested this and the times are back to decent levels: the entire checkout step takes 11 seconds with your PR. Thank you so much!

@BrettDong
Copy link
Contributor

The fix in #1246 reduces stalled time down to 3 seconds.

During the 3 seconds the workflow is stalled on loading the node.exe from disk to memory for execution. I don't think there is anything I can do to get rid of it.

@Gakk
Copy link

Gakk commented Mar 27, 2023

I have had a similar issue with large runners with slow checkout and cleanup that I reported to GitHub Support. They concluded that it is related this issue, even though I am not completely convinced.

The screenshot from @Tyrrrz earlier in this issue also shows a slow post checkout (cleanup).

Workflow:

To recreate as a minimum solution, I created a new repository with only a single workflow file, spinning up 3 jobs on default and large runners:

name: Demo Slow Checkout

on:
  workflow_dispatch:

permissions:
  contents: read

jobs:

  doSomething:
  
    name: My Job

    strategy:
      matrix:
        os: [windows-latest, windows-latest-8core, windows-latest-16core]

    runs-on: ${{ matrix.os }}

    steps:
    - name: Checkout
      uses: actions/checkout@8f4b7f84864484a7bf31766abe9204da3cbe65b3 # v3.5.0
      # https://github.com/actions/checkout/releases
      with:
        # Fetch full depth
        fetch-depth: 0

    - name: Work for 15 seconds
      run: Start-Sleep -Seconds 15

In addition to checkout the job has only one other step, and that is to sleep for 15 seconds.

Results:

The jobs executed 10 times, and shows that:

  • Checkout
    • Normal time is 19 seconds
    • Large runners use more than twice the time (between 47 and 51 seconds)
  • Post Checkout
    • Normal time is 3 seconds
    • Large runners use 15 times more time (between 46 and 49 seconds)

image
ℹ️ The last row is median and not average, so that any single slow run should interfere with the result.

Findings:

Finding 1:

Every single checkout on the large runners is at least twice as slow as on regular runner, and all the times goes before the actual checkout starts:
image

Finding 2:

The post checkout (cleanup) is on average 15 times slower as on regular runner, and all the time also goes before any cleanup is started:
image

Finding 3:

The simple sleep-task on regular runner uses twice the time of the sleep interval. How is it even possible that a sleep for 15 seconds takes almost double the time? This was done with a simple run: Start-Sleep -Seconds 15.
image

@BrettDong
Copy link
Contributor

You can build something like https://github.com/BrettDong/Cataclysm-DDA/blob/etw/.github/workflows/etw.yml to collect ETL traces in the runner to diagnose what is happening and spending time during checkout.

@myitcv
Copy link

myitcv commented Mar 27, 2023

@BrettDong

During the 3 seconds the workflow is stalled on loading the node.exe from disk to memory for execution. I don't think there is anything I can do to get rid of it.

This seems like an interesting point in and of itself. Because in actions/runner-images#7320 whilst we report actions/checkout issues, we see much bigger problems during workflows which heavily rely on disk access (i.e. cache hits during test runs).

So if I understand your conclusion, the narrowing down of this to being a disk access issue (or something that limits the effective disk access rate), this is consistent with what we are seeing.

@ScottTodd
Copy link

I have had a similar issue with large runners with slow checkout and cleanup that I reported to GitHub Support. They concluded that it is related this issue, even though I am not completely convinced.

Same here - we've noticed that large GitHub-managed Windows runners are substantially slower during checkout. This is not a recent regression for us though - they've been (nearly unuseably) slow for months.

Runner Checkout time
Self hosted Linux 40s
small GitHub-managed windows (windows-2022) 2m30s
large GitHub-managed Windows 12m

We also have a ticket with GitHub Support, and I've been running experiments for our repo / workflows at iree-org/iree#12051.

fhammerl pushed a commit that referenced this issue Apr 12, 2023
…/github dependency (#1246)

* Improve checkout performance on Windows runners by upgrading @actions/github dependency

Re: #1186

@dscho discovered that the checkout action could stall for a
considerable amount of time on Windows runners waiting for PowerShell
invocations made from 'windows-release' npm package to complete.

Then I studied the dependency chain to figure out where
'windows-release' was imported:

'@actions/checkout'@main
  <- '@actions/github'@2.2.0
    <- '@octokit/endpoint'@6.0.1
    <- '@octokit/graphql'@4.3.1
    <- '@octokit/request'@5.4.2
    <- '@octokit/rest'@16.43.1
      <- 'universal-user-agent'@4.0.1
        <- 'os-name'@3.1.0
          <- 'windows-release'@3.1.0

'universal-user-agent' package dropped its dependency on 'os-name' in
https://github.com/gr2m/universal-user-agent/releases/tag/v6.0.0 .

'@actions/github' v3 removed dependency on '@octokit/rest'@16.43.1 and
allows users to move away from the old 'universal-user-agent' v4.
(actions/toolkit#453)

This pull request attempts to update the version of '@actions/github'
used in the checkout action to avoid importing 'windows-release'.

Based on testing in my own repositories, I can see an improvement in
reduced wait time between entering the checkout action and git actually
starts to do useful work.

* Update .licenses

* Rebuild index.js
@yeu-degroot
Copy link
Author

Great news - build times on windows environment are back to normal!
This was fixed in actions/checkout@3.5.1.

Thanks @BrettDong and @fhammerl

image

@GMNGeoffrey
Copy link

FYI the issue that @ScottTodd and I are seeing on the large Windows managed runners was not fixed by this update. We tested it prior to release at the direction of GitHub support:

Seems like it may be a separate issue, but just wanted to call it out since these issues seem like they were maybe merged. Seems like this is also something that @Gakk is hitting. My understanding from support is that they're still investigating this other problem. It may be worth opening a separate issue for this or leaving this open.

@ofek
Copy link

ofek commented Apr 19, 2023

#1246 (comment)

try: #1186 (comment)

@Gakk
Copy link

Gakk commented Apr 19, 2023

Seems like this is also something that @Gakk is hitting.

@GMNGeoffrey, I have done extensive testing and confirmed that my issues were resolved by actions/checkout version 3.5.1.

@ScottTodd
Copy link

Yep, I'm still seeing substantially slower checkouts on large runners (could break that out into a different issue, and we have a support ticket for it). Latest experiments on iree-org/iree#12051, logs at https://github.com/openxla/iree/actions/runs/4748455722/jobs/8434667258. Our repo depends on https://github.com/llvm/llvm-project/ (very large) and a few other submodules, and just the checkout takes ~1 minute on small Windows runners but 7+ minutes on large Windows runners. We've tried all sorts of ways to change the git commands used (sparse checkouts, shallow clones, caching of git files, etc.) but can't get past whatever the differences are between the runners themselves.

@ofek
Copy link

ofek commented Apr 19, 2023

As mentioned in a comment above, can you try this? https://github.com/BrettDong/Cataclysm-DDA/blob/etw/.github/workflows/etw.yml

I don't have access to larger runners currently to test myself.

@dscho
Copy link
Contributor

dscho commented Apr 19, 2023

I'm still seeing substantially slower checkouts on large runners

Just a wild guess: could it be that large runners have slower D: drives than smaller runners? IIRC the hosted runners specifically have very fast D: drives.

@GMNGeoffrey
Copy link

I've just verified that the issue is not with the runners themselves, but rather with actions/checkout. Using just the normal bash commands, I did full checkout with submodules in 1m30s, compared to almost 10m previously. By dropping actions/checkout and actions/cache and rolling our own pretty unsophisticated solutions, I've been able to drop a maximally cached build from 20m (and that's when fetching the cache doesn't just fail entirely) to 9m. That is just really pretty sad

@BrettDong
Copy link
Contributor

You can collect ETW traces to help diagnose what's happening and taking time during checkout action.

GMNGeoffrey added a commit to iree-org/iree that referenced this issue Apr 20, 2023
We have found the GitHub actions built-in caching mechanism to be
extremely limiting: slow, small, and buggy. Switch instead to using our
own remote ccache hosted on GCS. This matches our Linux builds on our
self-hosted runners except that we have to do GCS auth through service
account keys, unfortunately, which means that access is restricted to
postsubmit runs. Luckily, for these builds we're generally doing
everything in one job and just want caching (which we only write on
postsubmit anyway) and don't need artifact storage (which we'd need on
presubmit too).

Tested:
Ran on this PR (hacked the workflow a bit). An
[initial
run](https://github.com/openxla/iree/actions/runs/4750257226/jobs/8438272681)
with an empty cache took 28m total, 15.5m of which was in the build
step. This includes writing the remote cache (minor overhead). A

[rerun](https://github.com/openxla/iree/actions/runs/4750257226/jobs/8438619413)
with a now populated cache took 14m total, 6.5m of which was in the
build step. 79% of compiler calls were cacheable and of those 99%
were remote cache hits. Contrast with a
[recent post-submit
run](https://github.com/openxla/iree/actions/runs/4748717136/jobs/8435229260)
that ran on a docs-only change (so should've had a maximally populated
cache), which took 20m, 7m of which was the build step, 2m of which was
fetching the cache, and 1m of which was saving the cache. That's
setting aside
[runs like this
one](https://github.com/openxla/iree/actions/runs/4741863995/jobs/8419465087)
where fetching the cache just times out entirely (with no alerting
other than if you happen to look at the UI).

Tragically, most of the time in all of these jobs is spent just
checking out the repository and submodules (see
actions/checkout#1186).

Overall this seems like a marked improvement. The main wins are in
avoiding tons of complexity futzing with cache compression levels and
restoring and saving the cache (actual cached build time is
~unchanged).

Part of #13028

skip-ci: Windows builds don't run on presubmit
@GMNGeoffrey
Copy link

You can collect ETW traces to help diagnose what's happening and taking time during checkout action.

As mentioned in a comment above, can you try this? https://github.com/BrettDong/Cataclysm-DDA/blob/etw/.github/workflows/etw.yml

I don't have access to larger runners currently to test myself.

Yeah, but presumably so can the GitHub engineers who support says are working to fix this. Like, IDK, it kind of seems to me that the people who wrote this code, control these VMs, and whom we are paying for this service, could maybe take a look at the issues with it.

@dscho
Copy link
Contributor

dscho commented Apr 21, 2023

@GMNGeoffrey I would like to encourage you to consider the current macro-economical climate, and also just how large he roadmap is. And please also note how @BrettDong's effort was rewarded: I am sure that you also will get what you want much quicker if you dig a little deeper with those ETW traces. I would even consider to help, but I do not have access to those large runners; You do, though.

DymysClaflin3 pushed a commit to DymysClaflin3/checkout that referenced this issue Apr 29, 2023
…/github dependency (#1246)

* Improve checkout performance on Windows runners by upgrading @actions/github dependency

Re: actions/checkout#1186

@dscho discovered that the checkout action could stall for a
considerable amount of time on Windows runners waiting for PowerShell
invocations made from 'windows-release' npm package to complete.

Then I studied the dependency chain to figure out where
'windows-release' was imported:

'@actions/checkout'@main
  <- '@actions/github'@2.2.0
    <- '@octokit/endpoint'@6.0.1
    <- '@octokit/graphql'@4.3.1
    <- '@octokit/request'@5.4.2
    <- '@octokit/rest'@16.43.1
      <- 'universal-user-agent'@4.0.1
        <- 'os-name'@3.1.0
          <- 'windows-release'@3.1.0

'universal-user-agent' package dropped its dependency on 'os-name' in
https://github.com/gr2m/universal-user-agent/releases/tag/v6.0.0 .

'@actions/github' v3 removed dependency on '@octokit/rest'@16.43.1 and
allows users to move away from the old 'universal-user-agent' v4.
(actions/toolkit#453)

This pull request attempts to update the version of '@actions/github'
used in the checkout action to avoid importing 'windows-release'.

Based on testing in my own repositories, I can see an improvement in
reduced wait time between entering the checkout action and git actually
starts to do useful work.

* Update .licenses

* Rebuild index.js
jpienaar pushed a commit to iree-org/iree that referenced this issue May 1, 2023
We have found the GitHub actions built-in caching mechanism to be
extremely limiting: slow, small, and buggy. Switch instead to using our
own remote ccache hosted on GCS. This matches our Linux builds on our
self-hosted runners except that we have to do GCS auth through service
account keys, unfortunately, which means that access is restricted to
postsubmit runs. Luckily, for these builds we're generally doing
everything in one job and just want caching (which we only write on
postsubmit anyway) and don't need artifact storage (which we'd need on
presubmit too).

Tested:
Ran on this PR (hacked the workflow a bit). An
[initial
run](https://github.com/openxla/iree/actions/runs/4750257226/jobs/8438272681)
with an empty cache took 28m total, 15.5m of which was in the build
step. This includes writing the remote cache (minor overhead). A

[rerun](https://github.com/openxla/iree/actions/runs/4750257226/jobs/8438619413)
with a now populated cache took 14m total, 6.5m of which was in the
build step. 79% of compiler calls were cacheable and of those 99%
were remote cache hits. Contrast with a
[recent post-submit
run](https://github.com/openxla/iree/actions/runs/4748717136/jobs/8435229260)
that ran on a docs-only change (so should've had a maximally populated
cache), which took 20m, 7m of which was the build step, 2m of which was
fetching the cache, and 1m of which was saving the cache. That's
setting aside
[runs like this
one](https://github.com/openxla/iree/actions/runs/4741863995/jobs/8419465087)
where fetching the cache just times out entirely (with no alerting
other than if you happen to look at the UI).

Tragically, most of the time in all of these jobs is spent just
checking out the repository and submodules (see
actions/checkout#1186).

Overall this seems like a marked improvement. The main wins are in
avoiding tons of complexity futzing with cache compression levels and
restoring and saving the cache (actual cached build time is
~unchanged).

Part of #13028

skip-ci: Windows builds don't run on presubmit
@nebuk89
Copy link

nebuk89 commented May 22, 2023

👋 hey sorry for not seeing this one, we are tracking/providing commentary on our investigation on the windows larger runner checkout here: oops thats the internal issue and no good to anyone

Ok that first update was all kinds of wrong let me try again!

Sorry we haven't commented on this ticket still, we are tracking this internally. We are have made some changes to the Windows VM image but only recently and the don't appear to have helped. With everything else going on we have had to put this one aside as well for the last couple of weeks but we are committed to fixing this. I will re-open this ticket as this is linked in the issue we are tracking internally :)

@GMNGeoffrey
Copy link

Thanks Ben. Further runs suggest that my switch to use git commands directly instead of actions/checkout was just lucky the first few times (or the computer learned what I was trying to do and put a stop to it 😛). Subsequent runs have had similar latency to before the switch, I think (I started hacking together a script to collect statistics for jobs over time, but got side-tracked, so pure anecdata right now). So I'm back to thinking it's the VM+git itself and not the action. I am sort of considering getting tarballs for all of the submodules instead of using git... I'll update if that seems to be faster somehow (which would suggest to me something git-specific and not just IO or network issues)

@nebuk89
Copy link

nebuk89 commented May 23, 2023

Thanks @GMNGeoffrey for having a go! (and sorry that the computers are learning 😆 ) Let me know if it's faster and we will hopefully have our focus back on this in the next week or so as things settle (also turns out I am not a maintainer on this repo, I will find someone to re-open this for me :D)

@myitcv
Copy link

myitcv commented May 23, 2023

Just a reminder, given the evidence in actions/runner-images#7320, there is almost certainly an underlying issue that is not specific to git at all. Rather, all operations that place high demands on disk (and CPU).

@benwells
Copy link

Re-opening at @nebuk89's request, so we can track our efforts externally as we investigate further. Some valuable context built up in this thread 😄 .

@benwells benwells reopened this May 23, 2023
elopez added a commit to crytic/medusa that referenced this issue Jun 12, 2023
elopez added a commit to crytic/medusa that referenced this issue Jun 12, 2023
NatashaKnk pushed a commit to NatashaKnk/iree that referenced this issue Jul 6, 2023
We have found the GitHub actions built-in caching mechanism to be
extremely limiting: slow, small, and buggy. Switch instead to using our
own remote ccache hosted on GCS. This matches our Linux builds on our
self-hosted runners except that we have to do GCS auth through service
account keys, unfortunately, which means that access is restricted to
postsubmit runs. Luckily, for these builds we're generally doing
everything in one job and just want caching (which we only write on
postsubmit anyway) and don't need artifact storage (which we'd need on
presubmit too).

Tested:
Ran on this PR (hacked the workflow a bit). An
[initial
run](https://github.com/openxla/iree/actions/runs/4750257226/jobs/8438272681)
with an empty cache took 28m total, 15.5m of which was in the build
step. This includes writing the remote cache (minor overhead). A

[rerun](https://github.com/openxla/iree/actions/runs/4750257226/jobs/8438619413)
with a now populated cache took 14m total, 6.5m of which was in the
build step. 79% of compiler calls were cacheable and of those 99%
were remote cache hits. Contrast with a
[recent post-submit
run](https://github.com/openxla/iree/actions/runs/4748717136/jobs/8435229260)
that ran on a docs-only change (so should've had a maximally populated
cache), which took 20m, 7m of which was the build step, 2m of which was
fetching the cache, and 1m of which was saving the cache. That's
setting aside
[runs like this
one](https://github.com/openxla/iree/actions/runs/4741863995/jobs/8419465087)
where fetching the cache just times out entirely (with no alerting
other than if you happen to look at the UI).

Tragically, most of the time in all of these jobs is spent just
checking out the repository and submodules (see
actions/checkout#1186).

Overall this seems like a marked improvement. The main wins are in
avoiding tons of complexity futzing with cache compression levels and
restoring and saving the cache (actual cached build time is
~unchanged).

Part of iree-org#13028

skip-ci: Windows builds don't run on presubmit
@Plasma
Copy link

Plasma commented Aug 30, 2023

I'm not using these runners, but if the OS is showing high CPU on disk access, perhaps its due to a Host Disk Caching setting set by Azure on the VM Disk (see https://learn.microsoft.com/en-us/azure/virtual-machines/disks-performance) as while the Host Disk cache benefits for some modes, it can also add a penalty.

@StephanKirsch
Copy link

Not sure if it helps - since we are running self-hosted gitlab, but I started looking for solution since our Windows runners are incredibly slow. Simple build jobs (for instance just perform MSBuild on a solution) that finish in less than 1 minute when run manually on the same machine - take over an hour when run as gitlab-runner job. The very same script is executed, no manual deviation in the two procedures. Further potentially helpful details:

  • We have runners on virtual machines and on physical ones - makes no difference, both are concerned
  • Runners use Windows 10
  • There is no load to be seen, not on CPU, nor memory, nor disk, just all silent and idle
  • I so far failed to pinpoint the delays to any step. E. g. simply the output of MSBuild is tremendously slowed down. Lines that pass within fractions of a second normally drop in every other minute only

@Plasma
Copy link

Plasma commented Mar 27, 2024

It almost sounds like the runner is maybe slow / blocking reading stderr/out of the console windows of the processes it’s launched, which is blocking the process advancing.

@stellaraccident
Copy link

We are discontinuing our use of GH managed windows runners. The costs were already beyond premium/sustainable, and the performance is so poor that the issue compounds out of control. I don't consider this a viable way to run CI for any business.

I can tolerate a lot but not at massively inflated prices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

15 participants