Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI keeps outdated caches until it runs out of space, so we should delete them manually #79919

Closed
YuriSizov opened this issue Jul 26, 2023 · 8 comments

Comments

@YuriSizov
Copy link
Contributor

Godot version

n/a

System information

n/a

Issue description

Our GitHub Actions CI has a caching mechanism that reuses results of previous builds to speed up following builds. The caching itself works pretty well, however we do have an issue with the space allocated to our runners.

Take #76252 for example. It has been pushed several times in the last day, which created multiple cache records which all take spaces. Even though only the latest one is useful.

windows-editor-3.x-refs/pull/76252/merge-06d24b1a3376f6465fd0eaacad3d1abe5c5c4fd2       115.40 MB  refs/pull/76252/merge  3 hours ago
windows-editor-3.x-refs/pull/76252/merge-03f5e7b4093cb4aa8a1562e30677e7a1d44cb319       115.27 MB  refs/pull/76252/merge  3 hours ago
windows-editor-3.x-refs/pull/76252/merge-0fbd955076278d0112419b34b67731952992faf6       113.74 MB  refs/pull/76252/merge  3 hours ago
windows-editor-3.x-refs/pull/76252/merge-cff55472ae5752a1f9b3026a95c21485357697ab       113.68 MB  refs/pull/76252/merge  5 hours ago

In theory, there is an automatic mechanism for flushing outdated cache, but it's only triggered when the cache storage is about to run out of space. So until then 3 of these 4 cached packages lay idle and take up space. Space which we desperately need, as our builds require a lot of it.

Steps to reproduce

What we could do to fix this is to make sure we always clean up outdated caches.

First of all, we can safely remove any cache associated with a recently merged PR. In theory, these caches could be used by the master branch (and all other main branches) too, but in practice that would almost never happen. The lookup for a matching cache would first check for caches starting with windows-template-master-refs/heads/master. Only if those are missing it would take the latest cache starting with windows-template-master. Which isn't even guaranteed to be the one we need and can come from any recently built PR.

So all in all, PR caches are useless to us the moment we merge the PR.

Second, we should definitely remove previous caches associated with any PR once they are no longer useful. So in the situation demonstrated above we would start the first build without any caches and generate a cache labeled as windows-editor-3.x-refs/pull/76252/merge-cff55472ae5752a1f9b3026a95c21485357697ab. The second build would want to use that cache, so we start the build as normal, but once it's done and a new cache is available (windows-editor-3.x-refs/pull/76252/merge-0fbd955076278d0112419b34b67731952992faf6) we need to remove windows-editor-3.x-refs/pull/76252/merge-cff55472ae5752a1f9b3026a95c21485357697ab.

This would make sure we don't store anything that is no longer usable. The implementation should be pretty straightforward (store the old cache key, do the build, delete the old cache key, finish the action).

Minimal reproduction project

n/a

@YuriSizov
Copy link
Contributor Author

As a tangential note, we can also split up some builds and their tests. This would allow us to restart tests without restarting builds, if there are issues outside of our control (and perhaps that would clean up some space on runners too, between the build and the test).

We can do this by uploading the artifacts as early as we can (immediately after the build is done). This is something that have been requested for other reasons too, so a good improvement regardless. Then some of the tests can be moved to a dedicated job that depends on the build job. The test job would download the artifact and use it, since most of the time it doesn't need any source code access. We can also move building our test godot-cpp extension to a separate task to do it in parallel with everything else.

@RedworkDE
Copy link
Member

Are the caches created by PRs actually all that useful or should we maybe just disable writing them from PRs?
If the PR only contains minor changes using the cache created from it is really no different from using the cache from its base commit.
Only a PR that contains major core changes would benefit from using its own cache after a force-push (unless that push also contains a rebase), but the problem with that is that a PR doesn't seem to reliably use its own cache, instead the current strategy seems to be to pick an arbitrary cache from any build or PR against the target branch.
Even if we fixed that to use the cache from the previous commits, then that PRs cache wouldn't be considered because the commit of the previous build is no longer a parent of the current build.

TL/DR: I think just not caching things from PRs would make for a better cache selection and also better cache retention as not so many pretty much identical things end up in the cache filling it pretty much immediately.

@YuriSizov
Copy link
Contributor Author

YuriSizov commented Jul 26, 2023

Are the caches created by PRs actually all that useful or should we maybe just disable writing them from PRs?

That's what I've been thinking as well. Most PRs don't really benefit from having a cache. Sure, it would speed up consecutive builds of the same PR, but not a lot of PRs are that different from master (or other target branches), so it doesn't seem extremely beneficial to allocate disk space to such caches. We can survive those rare ones which may benefit from their own cache. And we can also add some kind of flag that can be passed via commit messages to force using cache in such cases.

Plus, we now have a multi-stage setup and the static checks action guards the CI from running many invalid PRs. Those would previously benefit from caches during reruns, but now they don't even build and cache anything. So having the cache enabled doesn't benefit them at all.

And finally, even in a hypothetical situation where some PR doesn't change anything about the existing cache it took, it still produces a new cache entry, wasting space needlessly.

So I agree, we could likely disable caches for everything other than main development branches.


One major takeaway from the whole thing is that it's not about how much space all caches take. It's about how much space individual caches take. The problem with the CI comes from the fact that one particular cache gets too big and then it gets loaded onto the runner and that makes it run out of allocated space. So keeping individual caches small is important.

The issue seems to be that one "bad" cache can poison the well. If one cache gets too big for some reason, or if caches bloat over time organically, at some point the cache is going to be too large and cause this kind of outage.

the problem with that is that a PR doesn't seem to reliably use its own cache, instead the current strategy seems to be to pick an arbitrary cache from any build or PR against the target branch.

That's not entirely true. Current strategy is to find the closest match using the action name, the target (base) branch, the PR branch, and the PR SHA hash.

        restore-keys: |
          ${{inputs.cache-name}}-${{env.GODOT_BASE_BRANCH}}-${{github.ref}}-${{github.sha}}
          ${{inputs.cache-name}}-${{env.GODOT_BASE_BRANCH}}-${{github.ref}}
          ${{inputs.cache-name}}-${{env.GODOT_BASE_BRANCH}}

This basically means that it will check for the exact match (same hash and all), which it is unlikely to find. Then it will check a match that has the same PR branch. This should fetch the old cache from the same PR, if one exists. If it does not, it will pick the most recent cache based on the target/base branch only.

This is where things get tricky. Technically all PRs would match this. So an arbitrary cache could be used here. One way to solve this would be to add a direct look up for the base branch, e.g.:

          ${{inputs.cache-name}}-${{env.GODOT_BASE_BRANCH}}-${{github.ref}}-${{github.sha}}
          ${{inputs.cache-name}}-${{env.GODOT_BASE_BRANCH}}-${{github.ref}}
          ${{inputs.cache-name}}-${{env.GODOT_BASE_BRANCH}}-refs/heads/master
          ${{inputs.cache-name}}-${{env.GODOT_BASE_BRANCH}}

We'd need to keep those in sync with the name of the branch the CI itself is stored in (so we'd need to change it for the 4.1 branch, 4.0 branch, 3.x branch, 3.6 branch, etc).

Another option is, once again, to only keep the master/3.x/etc build result cached.

@YuriSizov
Copy link
Contributor Author

It's a shame we don't have any historic data on this, since it's hard to identify a pattern. We cache the entire .scons-cache folder, and I'm not entirely sure what goes into it. But observing fresh caches that appear today after I have purged all of them yesterday I notice that caches tend to get very big if changes in their commit touch some of the more core/shared code. This feels like a normal behavior, but for some reason we are getting huge caches.

But as I've mentioned, we don't have any historic data. We might've been getting huge caches for some time already. And just reached an unfortunate mix of increased runner requirements (from SCU builds perhaps) and having a big cache already there.

Whatever it is, master just doubled in size for the clang sanitizers build:

image

Run actions/cache@v3
Received 113246208 of 614024093 (18.4%), 107.8 MBs/sec
Received 19[29](https://github.com/godotengine/godot/actions/runs/5681190556/job/15396934416#step:4:31)37984 of 614024093 ([31](https://github.com/godotengine/godot/actions/runs/5681190556/job/15396934416#step:4:33).4%), 91.5 MBs/sec
Received 394264576 of 614024093 (64.2%), 124.9 MBs/sec
Received 5[32](https://github.com/godotengine/godot/actions/runs/5681190556/job/15396934416#step:4:34)676608 of 614024093 (86.8%), 126.7 MBs/sec
Received 614024093 of 614024093 (100.0%), 123.4 MBs/sec
Cache Size: ~586 MB (614024093 B)
/usr/bin/tar -xf /home/runner/work/_temp/5a041cf5-9932-47f7-9a55-0be8cbc8f5f5/cache.tzst -P -C /home/runner/work/godot/godot --use-compress-program unzstd
Cache restored successfully
Cache restored from key: linux-editor-llvm-sanitizers-master-refs/heads/master-41a7f6b3804777[33](https://github.com/godotengine/godot/actions/runs/5681190556/job/15396934416#step:4:35)86710d5e49b64b173a3198de

@RedworkDE
Copy link
Member

That's not entirely true.

Right, I missed to github.ref bit.


To keep the size of the caches down we should probably investigate trimming the cache entries more aggressively based on atime so that is only items actually used in the current build get uploaded again. This should prevent the cache from growing any larger than absolutely necessary to effectively have incremental builds in CI.


Alternatively (or for later) we could also ditch githubs caching entirely and use the scons cache as intended by mounting a network share with the cache into the runner, which would mean that a) the cache itself doesn't take any disk space on the runner and also only artifacts that are actually used/produced by that run are transferred over the network. Of course this approach comes with a whole lot of other complications, so I don't think this is a good short term goal.


re: cache size, that is kinda to be expected when requires rebuilding the entire engine as the cache will then contain both all the previous object files (in case the ever become useful again; not really ever the case for the current setup) and the newly build versions of the same files (very useful for the next build) doubling its size.


As a very quick workaround the space issues, we could probably also just disable the cache for the GCC SAN run entirely, which will mean that it always takes an hour but at least the runner would have 1 GB more free disk space and not run out all the time.

akien-mga added a commit to akien-mga/godot that referenced this issue Aug 1, 2023
Removing the Android toolchain saves 14 GiB, which gives us more room
for growth and to avoid running into out-of-space errors in the Linux
sanitizers + debug symbols builds.

Related to godotengine#79919, though the caches were just one part of the problem,
the real issue is that our Linux sanitizers builds take 12 GiB, and
adding godot-cpp on top with 2 GiB leaves only a few GiB left for the
cache itself.
@YuriSizov
Copy link
Contributor Author

YuriSizov commented Aug 1, 2023

We've discussed this a bit with @akien-mga and measured the impact of each build step.

It seems that the Ubuntu runner comes with 84 GB of space total, with around 64 GB taken by the OS and other prerequisites. While our own dependencies, depending on the build configuration, can take a gig or two, for the most part this is the amount of space that is deterministically taken every time we spin up an Ubuntu image.

This leaves us with around 20 GB left for the build. With the san builds (whether clang or GCC) we are using around 12-13 GB, without any cache. The following godot-cpp build takes another 2.1 GB. Checking out and running the test project and other tests also contributes some to the used up disk space. If the limit was 14 GB, as we assumed before, we would be well over the budget with this. Then you add the cache that casually takes 500-1000 MB zipped for Linux builds. Unzipped it easily throws us over board. Thankfully, it seems that the actual limit is way higher, it's just that the OS takes a lot of space.

There is no way to use a different distribution for GitHub-hosted runners, it's only Ubuntu (latest, 22.04, or 20.04). However, there are certainly ways to debloat it. After all, 60 gigs for Linux looks like a lot! We are not the first to come up with such an idea, as there are a couple of actions publicly available that do just that. They remove unnecessary pre-installed tools and dependencies.

So that's something we could do too, as the results of these tools look very promising. (Edit: Here's a PR with more details: #80115)

For the reference, here are some of the test runs, you can see the disk usage stats from their logs:

YuriSizov pushed a commit to YuriSizov/godot that referenced this issue Aug 2, 2023
Removing the Android toolchain saves 14 GiB, which gives us more room
for growth and to avoid running into out-of-space errors in the Linux
sanitizers + debug symbols builds.

Related to godotengine#79919, though the caches were just one part of the problem,
the real issue is that our Linux sanitizers builds take 12 GiB, and
adding godot-cpp on top with 2 GiB leaves only a few GiB left for the
cache itself.

(cherry picked from commit 611123f)
YuriSizov pushed a commit to YuriSizov/godot that referenced this issue Aug 2, 2023
Removing the Android toolchain saves 14 GiB, which gives us more room
for growth and to avoid running into out-of-space errors in the Linux
sanitizers + debug symbols builds.

Related to godotengine#79919, though the caches were just one part of the problem,
the real issue is that our Linux sanitizers builds take 12 GiB, and
adding godot-cpp on top with 2 GiB leaves only a few GiB left for the
cache itself.

(cherry picked from commit 611123f)
IntangibleMatter pushed a commit to IntangibleMatter/godot that referenced this issue Aug 13, 2023
Removing the Android toolchain saves 14 GiB, which gives us more room
for growth and to avoid running into out-of-space errors in the Linux
sanitizers + debug symbols builds.

Related to godotengine#79919, though the caches were just one part of the problem,
the real issue is that our Linux sanitizers builds take 12 GiB, and
adding godot-cpp on top with 2 GiB leaves only a few GiB left for the
cache itself.
akien-mga added a commit to akien-mga/godot that referenced this issue Aug 29, 2023
Removing the Android toolchain saves 14 GiB, which gives us more room
for growth and to avoid running into out-of-space errors in the Linux
sanitizers + debug symbols builds.

Related to godotengine#79919, though the caches were just one part of the problem,
the real issue is that our Linux sanitizers builds take 12 GiB, and
adding godot-cpp on top with 2 GiB leaves only a few GiB left for the
cache itself.

(cherry picked from commit 611123f)
mandryskowski pushed a commit to mandryskowski/godot that referenced this issue Oct 11, 2023
Removing the Android toolchain saves 14 GiB, which gives us more room
for growth and to avoid running into out-of-space errors in the Linux
sanitizers + debug symbols builds.

Related to godotengine#79919, though the caches were just one part of the problem,
the real issue is that our Linux sanitizers builds take 12 GiB, and
adding godot-cpp on top with 2 GiB leaves only a few GiB left for the
cache itself.
@brno32
Copy link
Contributor

brno32 commented Nov 27, 2023

I'd like to work on this. Would calling a python script that implements some of the logic at the end of the workflows be an acceptable solution?

@YuriSizov
Copy link
Contributor Author

@brno32 I don't think there is anything that needs to be worked on here right now. #80115 resolved most of the pressing issues and we haven't had CI troubles with space ever since.

We are kind of wasting space with old caches, but that doesn't seem to bother GitHub much. With changes in #80091 we also use more relevant caches than when this issue was submitted. And we don't compile Godot and godot-cpp on the same runner anymore. So while all these findings are still interesting and educational, I think the issue can be closed as resolved.

@akien-mga akien-mga added this to the 4.2 milestone Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants