Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate caching .git data on Travis/AppVeyor #40772

Closed
alexcrichton opened this issue Mar 23, 2017 · 23 comments
Closed

Investigate caching .git data on Travis/AppVeyor #40772

alexcrichton opened this issue Mar 23, 2017 · 23 comments
Labels
C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC E-help-wanted Call for participation: Help is requested to fix this issue. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.

Comments

@alexcrichton
Copy link
Member

The rust-lang/rust repo itself takes awhile to clone but we somewhat mitigate that with --depth=1 clones. Our submodules, however, are much larger and unfortunately cannot be cloned with a --depth argument due to how the branches work. This typically means that cloning the LLVM repo takes quite a long time! Unfortunately this also increases our chances to network problems by requiring a lot of data to move over the network.

When playing around with CircleCI recently I found that they automatically cached git repository data which greatly sped up cloning the repository and checking out submodules. Overall it felt quite nifty! We should investigate to see if a similar strategy can apply to Travis and/or AppVeyor. I'm not personally familiar with how CircleCI's git caching works, so some investigation there would be needed (and comments if you're familiar with it would be most welcome!)

Overall I would expec this change to:

  • Reduce timeouts when cloning LLVM because very little data would transfer over the network
  • Speed up builds (slightly). It'd save ~10 minutes on AppVeyor and ~2 minutes on Travis from the look of today's timings.

Any help to implement this would be very much appreciated!

@alexcrichton alexcrichton added A-infrastructure E-help-wanted Call for participation: Help is requested to fix this issue. labels Mar 23, 2017
@sfackler
Copy link
Member

On the Travis side, it should just be a matter of enumerating all of the directories you want: https://docs.travis-ci.com/user/caching/#Arbitrary-directories

@alexcrichton
Copy link
Member Author

Yeah unfortunately I don't actually know what directories are cached here. Is it just .git? (or maybe more?)

If it's just .git then this would make a great PR!

@aidanhs
Copy link
Member

aidanhs commented Mar 24, 2017

I've made a stab in #40780. It's not as simple as caching the .git directory - by the time the script runs, you already have a branch checked out and it's not really clear how you'd move the objects around. It's also not as simple as just copying .git/modules - I'd be pretty paranoid about that breaking if someone updates the branch of a gitmodule, as I doubt git is robust at handling people messing around in there.

Instead, I've tried to go the route of using the --reference flag, which explicitly uses another repo as a cache. Unfortunately this isn't particularly well integrated with submodules, so I've had to do some workarounds.

As an aside, I'm a bit suspicious that depth: 1 (added in #35085) doesn't do anything useful. Here's the initial clone log from my travis build:

$ git clone --depth=1 https://github.com/rust-lang/rust.git rust-lang/rust
Cloning into 'rust-lang/rust'...
remote: Counting objects: 9861, done.
remote: Compressing objects: 100% (8740/8740), done.
remote: Total 9861 (delta 5307), reused 2301 (delta 1034), pack-reused 0
Receiving objects: 100% (9861/9861), 8.01 MiB | 10.89 MiB/s, done.
Resolving deltas: 100% (5307/5307), done.
$ cd rust-lang/rust
$ git fetch origin +refs/pull/40780/merge:
remote: Counting objects: 627478, done.
remote: Compressing objects: 100% (117810/117810), done.
remote: Total 627478 (delta 510690), reused 622389 (delta 505626), pack-reused 0
Receiving objects: 100% (627478/627478), 285.82 MiB | 24.97 MiB/s, done.
Resolving deltas: 100% (510690/510690), completed with 3765 local objects.
From https://github.com/rust-lang/rust
 * branch                  refs/pull/40780/merge -> FETCH_HEAD
$ git checkout -qf FETCH_HEAD

Seems like it has to fetch the whole thing anyway in order to get my branch? Maybe worth raising an issue on the travis issue tracker to ask if this is expected?

@aidanhs
Copy link
Member

aidanhs commented Mar 24, 2017

FWIW, CircleCI is a bit vague about how their git caching works:

https://circleci.com/docs/1.0/how-cache-works/#git-cache

There’s another type of caching that CircleCI does. This is to cache the Git repository of your project. This feature allows for speedier git clones of your project. Control of this cache isn’t user accessible and shouldn’t need to be. In rare scenarios, if something seems to be wrong with your source cache (for example, certain files are in your repo that shouldn’t be), please contact support.

This opacity combined with complaints about the caching ([1], [2]) is not reassuring. I could hazard some guesses at how they're doing it, but it's likely going to be more complicated and therefore less reliable than an approach tailored to the rust repo.

@nagisa
Copy link
Member

nagisa commented Mar 24, 2017

We used to do this on buildbot didn't we? IIRC corrupted .git was pretty common occurrence.

@aidanhs
Copy link
Member

aidanhs commented Mar 24, 2017

I assume you're referring to #34595, which looks like it happens because the cache may contain a partially cloned/corrupt git submodule. This was a problem even without caching (network failure then retry is unhappy because of this same corrupted state) and was seemingly resolved by deinit before update (#39055) - the same fix may have fixed the buildbots (had they still been around).

That said, I can think of two approaches for validating before caching off the top of my head (the second problem with the buildbots, aside from not doing deinit, was caching continuously rather than only when the cache was valid) - if we see issues (or you want a more paranoid approach to begin with) I'll be able to put something in place fairly quickly.

@alexcrichton
Copy link
Member Author

Thanks for the investigation @aidanhs! I'll take a look at #40780 soon.

Also yeah @nagisa I'd want to be careful about a solution here. Bad caching can cause unending problems, so I'd want to make sure we're always in a situation where the tool we're caching is relatively robust to odd cache entries. For example docker I trust to do the right thing with caching (and naturally rebuilding when necessary), and same with sccache. I was hoping that this would be solved by just persisting some git database that's resilient to naturally being updated and having old cache entries, but we'll see soon!

@notriddle
Copy link
Contributor

Call me stupid (since I probably am missing something obvious), but why do Travis CI and Circle CI workers even download the commit history? GitHub has the "download a tarball from https://github.com/rust-lang/rust/archive/master.tar.gz" option, which is probably way faster.

@aidanhs
Copy link
Member

aidanhs commented Mar 28, 2017

@notriddle if you already have a repo of N commits, downloading the N+1th commit with git pull will be a few KB. Downloading the entire repo every time is a constant ~6MB. Regardless, we need some kind of git clone for submodules to work (as the commitish they're pinned to lives in the .git directory and downloaded .tar.gz contain neither .git nor checked out submodules).

That said, Travis definitely has something sub-optimal with cloning - it should be doing a shallow clone of the branch, but instead it does a shallow clone of master and then a full clone of PR branches (defeating the point of the shallow clone).

frewsxcv added a commit to frewsxcv/rust that referenced this issue Mar 29, 2017
…lexcrichton

Attempt to cache git modules

Partial resolution of rust-lang#40772, appveyor remains to be done once travis looks like it's working ok.

The approach in this PR is based on the `--reference` flag to `git-clone`/`git-submodule --update` and is a compromise based on the current limitations of the tools we're using.

The ideal would be:
1. have a cached pristine copy of rust-lang/rust master in `$HOME/rustsrc` with all submodules initialised
2. clone the PR branch with `git clone --recurse-submodules --reference $HOME/rustsrc git@github.com:rust-lang/rust.git`

This would (in the nonexistent ideal world) use the pristine copy as an object cache for the top level repo and all submodules, transferring over the network only the changes on the branch. Unfortunately, a) there is no way to manually control the initial clone with travis and b) even if there was, cloned submodules don't use the submodules of the reference as an object cache. So the steps we end up with are:

1. have a cached pristine copy of rust-lang/rust master in `$HOME/rustsrc` with all submodules initialised
2. have a cloned PR branch
3. extract the path of each submodule, and explicitly `git submodule update --init --reference $HOME/rustsrc/$module $module` (i.e. point directly to the location of the pristine submodule repo) for each one

I've also taken some care to make this forward compatible, both for adding and removing submodules.

r? @alexcrichton
@aidanhs
Copy link
Member

aidanhs commented Mar 30, 2017

After a bit of a shaky start (merge completed successfully, then subsequent merges failed because appveyor caching was broken) just the appveyor part was rolled back and so just travis builds are currently using the new repo caching and it seems to be working ok.

In the middle of trying to fix appveyor last night, I realised that there is a way for me to test appveyor - fork rust, then comment out all CI that actually does any rust compilation etc. That way I'll be able to just test the caching. I'll be back soon with a tested PR for appveyor.

@alexcrichton
Copy link
Member Author

Thanksd for the continuing investigation @aidanhs!

@alexcrichton
Copy link
Member Author

(oops didn't mean to close)

@aidanhs
Copy link
Member

aidanhs commented Apr 4, 2017

Unfortunately, I was completely unable to reproduce the cache restore failure on appveyor, despite trying a number of times. However, after reviewing the logs again, I don't actually think the cache restore failure was the issue, I think it was a different buggy part of the appveyor.yml. There's a new PR to re-enable appveyor at #41075.


Cache aside, as part of this issue I do think it's worth looking into this sequence of commands on travis:

$ git clone --depth=1 https://github.com/rust-lang/rust.git rust-lang/rust
Cloning into 'rust-lang/rust'...
remote: Counting objects: 10010, done.
remote: Compressing objects: 100% (8825/8825), done.
remote: Total 10010 (delta 5369), reused 2433 (delta 1096), pack-reused 0
Receiving objects: 100% (10010/10010), 8.04 MiB | 10.93 MiB/s, done.
Resolving deltas: 100% (5369/5369), done.
$ cd rust-lang/rust
$ git fetch origin +refs/pull/41074/merge:
remote: Counting objects: 630073, done.
remote: Compressing objects: 100% (118309/118309), done.
remote: Total 630073 (delta 512723), reused 625020 (delta 507693), pack-reused 0
Receiving objects: 100% (630073/630073), 286.78 MiB | 25.92 MiB/s, done.
Resolving deltas: 100% (512723/512723), completed with 3750 local objects.
From https://github.com/rust-lang/rust
 * branch                  refs/pull/41074/merge -> FETCH_HEAD

I described this above:

That said, Travis definitely has something sub-optimal with cloning - it should be doing a shallow clone of the branch, but instead it does a shallow clone of master and then a full clone of PR branches (defeating the point of the shallow clone).

The extra time spent here makes the cache a little less effective (40s wasted), and consumes about 60-75% of the time on the no-op PR builds - that's ~ 35x45s across Linux no-op builds and ~ 5x190s across OSX no-op builds, totalling 40-45mins of dead time per PR push! Seems massively wasteful, even if it is in parallel.

I've stumbled across something that looks like the ideal solution - appveyor lets you implement a custom clone_script, we just need to convince travis to implement it or find a workaround to do something equivalent.

@aidanhs
Copy link
Member

aidanhs commented Apr 4, 2017

Ok, the inefficiency in travis has been spotted before - travis-ci/travis-ci#6183, travis-ci/travis-build#747. I guess it just needs resurrecting and fixing. Appveyor seems to do it like so:

git clone -q --depth=1 --branch=auto https://github.com/rust-lang/rust.git C:\projects\rust
git checkout -qf 2564711e803f62e04bebf10408cc1c11297c0caf

@alexcrichton
Copy link
Member Author

@aidanhs thanks for the investigation! Should we file an upstream travis bug for that?

@aidanhs
Copy link
Member

aidanhs commented Apr 5, 2017

@alexcrichton nah I'll just make a PR (resurrecting the one I linked above) at some point in the next week or so. Well, you can raise an issue if you want one for tracking purposes :)

@alexcrichton
Copy link
Member Author

Sounds good to me, thanks!

@aidanhs
Copy link
Member

aidanhs commented Apr 10, 2017

Some updates:

  1. Appveyor caching is merged, but the caching is currently consistently not working (the paranoia I've put in place means it's not doing any damage, don't worry!) and I'm trying to investigate using Add a comment for disabling errexit, try to debug appveyor cache #41157. Currently it looks like the created cache zip file is corrupt, so I'm thinking I may need to get in touch with appveyor.
  2. I raised an issue on the travis-ci repo about fetch inefficiency, just to provide some context of the issue to the travis people since it turns out there are a few different facets of this problem. Still working on the PR for them.
  3. New item: I noticed that the appveyor cache isn't uniquely failing on the git stuff, it can apparently also fail on llvm. This concerns me a bit, since my experience with the git stuff leads me to believe that the failure can result in a partially restored set of files. Since LLVM builds are file based this may not actually be an issue, but it's a bit concerning nonetheless. Currently wondering if it's something to do with synchronisation - it seems that builds for open source projects happen in series, whereas rust is in parallel, so perhaps saving multiple items to the cache at a time causes corruption?

@aidanhs
Copy link
Member

aidanhs commented Apr 22, 2017

Some updates since I've paused work on this (partially to rethink, see 1b):

  1. Both Appveyor and travis caching have been removed in Disable git caches again #41398.
    a. Appveyor caching wasn't working anyway due to Add a comment for disabling errexit, try to debug appveyor cache #41157 (comment), which I've not yet got round to ask the appveyor people about. As noted above, I'm currently vaguely suspicious of concurrent cache saves causing corruption but have no evidence.
    b. Travis was timing out and killing cache restores, leaving the git repo in an inconsistent state which wasn't being picked up by the paranoia checks I added. More paranoia is possible, but we probably need to consider a better solution if cache restores are timing out.
  2. Following the issue I raised on the travis-ci repo (mentioned in previous comment), I've made a PR on the travis-build repo to correctly do shallow clones on branches, which would make our no-op builds faster.
  3. The init_repo.sh script is still in place so intermittent git failures should be correctly handled - the caching I was attempting was just to speed it up, especially on appveyor.

@aidanhs
Copy link
Member

aidanhs commented May 25, 2017

Raised http://help.appveyor.com/discussions/problems/6735-corrupt-caches about the corrupt caches.

We've had to implement @notriddle's suggestion to 'temporarily' use .tar.gz files for the llvm submodule in #42211 since it was causing builds to timeout when combined with the current appveyor network issues. I'm not delighted, but while caches don't work there aren't really any other great options I'm aware of.

@Mark-Simulacrum Mark-Simulacrum added T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. and removed A-infrastructure labels Jun 25, 2017
@aidanhs
Copy link
Member

aidanhs commented Jun 28, 2017

Appveyor have replied to the issue effectively acknowledging the problem and saying it'll be fixed with cache "v2".

@Mark-Simulacrum Mark-Simulacrum added the C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC label Jul 27, 2017
@steveklabnik
Copy link
Member

Triage: not aware of any movement here

@bjorn3
Copy link
Member

bjorn3 commented May 15, 2022

We now use Github Actions instead of Travis and AppVeyor. For git submodules we do some magic whereby we download an archive containing just the right version we need and the trick git into thinking it is a regular submodule I believe. Can this issue be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC E-help-wanted Call for participation: Help is requested to fix this issue. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants