Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GCS for Windows ccache #13183

Merged
merged 8 commits into from
Apr 20, 2023
Merged

Use GCS for Windows ccache #13183

merged 8 commits into from
Apr 20, 2023

Conversation

GMNGeoffrey
Copy link
Contributor

@GMNGeoffrey GMNGeoffrey commented Apr 20, 2023

We have found the GitHub actions built-in caching mechanism to be
extremely limiting: slow, small, and buggy. Switch instead to using our
own remote ccache hosted on GCS. This matches our Linux builds on our
self-hosted runners except that we have to do GCS auth through service
account keys, unfortunately, which means that access is restricted to
postsubmit runs. Luckily, for these builds we're generally doing
everything in one job and just want caching (which we only write on
postsubmit anyway) and don't need artifact storage (which we'd need on
presubmit too).

Tested:
Ran on this PR (hacked the workflow a bit). An
initial run
with an empty cache took 28m total, 15.5m of which was in the build
step. This includes writing the remote cache (minor overhead). A
rerun
with a now populated cache took 14m total, 6.5m of which was in the
build step. 79% of compiler calls were cacheable and of those 99%
were remote cache hits. Contrast with a
recent post-submit run
that ran on a docs-only change (so should've had a maximally populated
cache), which took 20m, 7m of which was the build step, 2m of which was
fetching the cache, and 1m of which was saving the cache. That's
setting aside
runs like this one
where fetching the cache just times out entirely (with no alerting
other than if you happen to look at the UI).

Tragically, most of the time in all of these jobs is spent just
checking out the repository and submodules (see
actions/checkout#1186).

Overall this seems like a marked improvement. The main wins are in
avoiding tons of complexity futzing with cache compression levels and
restoring and saving the cache (actual cached build time is
~unchanged).

Part of #13028

skip-ci: Windows builds don't run on presubmit

@GMNGeoffrey GMNGeoffrey added the infrastructure Relating to build systems, CI, or testing label Apr 20, 2023
@GMNGeoffrey GMNGeoffrey requested a review from ScottTodd as a code owner April 20, 2023 04:46
@GMNGeoffrey GMNGeoffrey added the platform/windows 🚪 Windows-specific build, execution, benchmarking, and deployment label Apr 20, 2023
@GMNGeoffrey GMNGeoffrey merged commit 0ab01b6 into main Apr 20, 2023
@GMNGeoffrey GMNGeoffrey deleted the gcmn-gcs-cache branch April 20, 2023 16:26
GMNGeoffrey added a commit that referenced this pull request Apr 20, 2023
The GitHub-provided `actions/checkout` action is for some reason
unusably slow on the large managed Windows runners. We assumed this was
because of some tricky IO issue or something, but I decide to just try
directly using `git` commands to see and lo the checkout time goes from
10 minutes to 1.5 🚀 

With the caching improvements from
#13183, this gets the Windows build
down under 10 minutes, which means we can run it on presubmit (left for
a future PR).

Part of #11009

Tested:
Enabled this workflow on push to my branch:
https://github.com/openxla/iree/actions/runs/4750681034/jobs/8439091687

skip-ci: this only affects the Windows job, which isn't run on presubmit
jpienaar pushed a commit that referenced this pull request May 1, 2023
We have found the GitHub actions built-in caching mechanism to be
extremely limiting: slow, small, and buggy. Switch instead to using our
own remote ccache hosted on GCS. This matches our Linux builds on our
self-hosted runners except that we have to do GCS auth through service
account keys, unfortunately, which means that access is restricted to
postsubmit runs. Luckily, for these builds we're generally doing
everything in one job and just want caching (which we only write on
postsubmit anyway) and don't need artifact storage (which we'd need on
presubmit too).

Tested:
Ran on this PR (hacked the workflow a bit). An
[initial
run](https://github.com/openxla/iree/actions/runs/4750257226/jobs/8438272681)
with an empty cache took 28m total, 15.5m of which was in the build
step. This includes writing the remote cache (minor overhead). A

[rerun](https://github.com/openxla/iree/actions/runs/4750257226/jobs/8438619413)
with a now populated cache took 14m total, 6.5m of which was in the
build step. 79% of compiler calls were cacheable and of those 99%
were remote cache hits. Contrast with a
[recent post-submit
run](https://github.com/openxla/iree/actions/runs/4748717136/jobs/8435229260)
that ran on a docs-only change (so should've had a maximally populated
cache), which took 20m, 7m of which was the build step, 2m of which was
fetching the cache, and 1m of which was saving the cache. That's
setting aside
[runs like this
one](https://github.com/openxla/iree/actions/runs/4741863995/jobs/8419465087)
where fetching the cache just times out entirely (with no alerting
other than if you happen to look at the UI).

Tragically, most of the time in all of these jobs is spent just
checking out the repository and submodules (see
actions/checkout#1186).

Overall this seems like a marked improvement. The main wins are in
avoiding tons of complexity futzing with cache compression levels and
restoring and saving the cache (actual cached build time is
~unchanged).

Part of #13028

skip-ci: Windows builds don't run on presubmit
jpienaar pushed a commit that referenced this pull request May 1, 2023
The GitHub-provided `actions/checkout` action is for some reason
unusably slow on the large managed Windows runners. We assumed this was
because of some tricky IO issue or something, but I decide to just try
directly using `git` commands to see and lo the checkout time goes from
10 minutes to 1.5 🚀 

With the caching improvements from
#13183, this gets the Windows build
down under 10 minutes, which means we can run it on presubmit (left for
a future PR).

Part of #11009

Tested:
Enabled this workflow on push to my branch:
https://github.com/openxla/iree/actions/runs/4750681034/jobs/8439091687

skip-ci: this only affects the Windows job, which isn't run on presubmit
NatashaKnk pushed a commit to NatashaKnk/iree that referenced this pull request Jul 6, 2023
We have found the GitHub actions built-in caching mechanism to be
extremely limiting: slow, small, and buggy. Switch instead to using our
own remote ccache hosted on GCS. This matches our Linux builds on our
self-hosted runners except that we have to do GCS auth through service
account keys, unfortunately, which means that access is restricted to
postsubmit runs. Luckily, for these builds we're generally doing
everything in one job and just want caching (which we only write on
postsubmit anyway) and don't need artifact storage (which we'd need on
presubmit too).

Tested:
Ran on this PR (hacked the workflow a bit). An
[initial
run](https://github.com/openxla/iree/actions/runs/4750257226/jobs/8438272681)
with an empty cache took 28m total, 15.5m of which was in the build
step. This includes writing the remote cache (minor overhead). A

[rerun](https://github.com/openxla/iree/actions/runs/4750257226/jobs/8438619413)
with a now populated cache took 14m total, 6.5m of which was in the
build step. 79% of compiler calls were cacheable and of those 99%
were remote cache hits. Contrast with a
[recent post-submit
run](https://github.com/openxla/iree/actions/runs/4748717136/jobs/8435229260)
that ran on a docs-only change (so should've had a maximally populated
cache), which took 20m, 7m of which was the build step, 2m of which was
fetching the cache, and 1m of which was saving the cache. That's
setting aside
[runs like this
one](https://github.com/openxla/iree/actions/runs/4741863995/jobs/8419465087)
where fetching the cache just times out entirely (with no alerting
other than if you happen to look at the UI).

Tragically, most of the time in all of these jobs is spent just
checking out the repository and submodules (see
actions/checkout#1186).

Overall this seems like a marked improvement. The main wins are in
avoiding tons of complexity futzing with cache compression levels and
restoring and saving the cache (actual cached build time is
~unchanged).

Part of iree-org#13028

skip-ci: Windows builds don't run on presubmit
NatashaKnk pushed a commit to NatashaKnk/iree that referenced this pull request Jul 6, 2023
…3186)

The GitHub-provided `actions/checkout` action is for some reason
unusably slow on the large managed Windows runners. We assumed this was
because of some tricky IO issue or something, but I decide to just try
directly using `git` commands to see and lo the checkout time goes from
10 minutes to 1.5 🚀 

With the caching improvements from
iree-org#13183, this gets the Windows build
down under 10 minutes, which means we can run it on presubmit (left for
a future PR).

Part of iree-org#11009

Tested:
Enabled this workflow on push to my branch:
https://github.com/openxla/iree/actions/runs/4750681034/jobs/8439091687

skip-ci: this only affects the Windows job, which isn't run on presubmit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Relating to build systems, CI, or testing platform/windows 🚪 Windows-specific build, execution, benchmarking, and deployment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants