Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow use of sparse checkout when vendoring from git #400

Open
reegnz opened this issue Oct 1, 2024 · 7 comments · May be fixed by #402
Open

Allow use of sparse checkout when vendoring from git #400

reegnz opened this issue Oct 1, 2024 · 7 comments · May be fixed by #402
Labels
carvel-triage This issue has not yet been reviewed for validity enhancement This issue is a feature request

Comments

@reegnz
Copy link
Contributor

reegnz commented Oct 1, 2024

Describe the problem/challenge you have
Vendoring from big repositories takes a long time currently. It would be great if that could be improved when using includePath and excludePath

Describe the solution you'd like

When the user declares includePath or excludePath, and explicitly asks for a sparse checkout, perform a git sparse checkout for them using includePaths and excludePaths (with globs you'd use no cone mode).

Example new config:

apiVersion: 
---
apiVersion: vendir.k14s.io/v1alpha1
kind: Config
directories:
- path: vendor
  contents:
  - path: kubernetes
    git:
      url: https://github.com/yannh/kubernetes-json-schema
      ref: master
      depth: 1
      sparseCheckout: true
    includePaths:
    - v1.30.5-standalone/**

Anything else you would like to add:


Vote on this request

This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.

👍 "I would like to see this addressed as soon as possible"
👎 "There are other more important things to focus on right now"

We are also happy to receive and review Pull Requests if you want to help working on this issue.

@reegnz reegnz added carvel-triage This issue has not yet been reviewed for validity enhancement This issue is a feature request labels Oct 1, 2024
@joaopapereira
Copy link
Member

Thanks for the issue.
From my understanding sparse checkout only allows the inclusion of folders, how would excludePaths work?
About behavior, what do you think should happen if you provide the sparseCheckout key but do not includePaths?
Are you suggesting that if includePaths is a glob we should set no cone mode and if not a glob we should set cone mode on?

@reegnz
Copy link
Contributor Author

reegnz commented Oct 4, 2024

Yes, I'm suggesting using non-cone mode.

Maybe it could be a flag to switch between cone mode and non-cone mode and don't try to infer it. Inferring could be too error prone and unnecessarily complex. If git doesn't infer it we shouldn't either.

There are some gotchas but I think that doesn't apply here.
https://git-scm.com/docs/git-sparse-checkout/2.37.0#_internalsnon_cone_problems

I think sparse checkout without includes should not be accepted (eg shema-wise only allow declaration of sparse checkout when include or exclude are also set).

@100mik
Copy link
Contributor

100mik commented Oct 25, 2024

I think sparse checkout without includes should not be accepted (eg shema-wise only allow declaration of sparse checkout when include or exclude are also set).

I think we could allow this to be a flag-enabled behaviour and consider defaulting it eventually if it makes sense.
@reegnz would you be interested in taking a shot at this? Maybe a PoC of sorts and calling out the gotchas in a summary maybe? You could also hop on to one of our community meetings if you'd like to discuss it with us as well!

@joaopapereira what do you think?

reegnz added a commit to reegnz/vendir that referenced this issue Oct 25, 2024
Uses includePaths and excludePaths to configure non-cone sparse checkout
for a git repo.

Fixes carvel-dev#400

Signed-off-by: Zoltán Reegn <zoltan.reegn@gmail.com>
@reegnz reegnz linked a pull request Oct 25, 2024 that will close this issue
@reegnz
Copy link
Contributor Author

reegnz commented Oct 25, 2024

I gave it a shot, see PR.

@joaopapereira
Copy link
Member

Sorry I dropped the ball on this but I have been busy lately.
I like the idea of move this behind a flag. I will look at the PR in the next couple of days
@Zebradil let me know if you have any concerns here

@Zebradil
Copy link
Member

Zebradil commented Nov 3, 2024

I'm not sure why to hide this functionality behind a flag, if it is supposed to be enabled only with a configuration key. Adding the key won't change anything for existing configurations, and users can enable the new behavior by adding the key to their configurations.

As for the feature itself, I like the idea, it'll definitely help in some cases. But we need to make sure this doesn't affect caching of git repos we already have in place.

@Zebradil
Copy link
Member

Zebradil commented Nov 4, 2024

I've checked the PR and ran some tests, which shown that even though the sparse checkout does improve the checkout command, there are more effective ways to speed up syncing of git sources.

Vendir configuration:

  • Using ref with the origin/ prefix, e.g. origin/v1.2.5. This way, git only fetches objects related to the specified reference, skipping everything else.
  • Specifying depth: 1. This reduces number of objects further.

Git configuration:

  • Using --filter=blob:none during fetching.
    • Using sparse checkout is noticeably beneficial in this case.

The last two options aren't available in vendir at the moment and can be potentially implemented.

Test details

Here are some combinations of flags and configuration options I tested.

Base

$ git init
$ git remote add origin git@github.com:something/something.git
$ time git fetch origin
remote: Enumerating objects: 833783, done.
remote: Counting objects: 100% (17084/17084), done.
remote: Compressing objects: 100% (6853/6853), done.
remote: Total 833783 (delta 10856), reused 15068 (delta 9320), pack-reused 816699 (from 1)
Receiving objects: 100% (833783/833783), 1.66 GiB | 10.56 MiB/s, done.
Resolving deltas: 100% (608199/608199), done.
From github.com:something/something
... skip many many branches and tags ...
git fetch origin   160.33s  user 16.39s system 93% cpu 3:09.13 total

$ time git checkout main
branch 'main' set up to track 'origin/main'.
Already on 'main'
git checkout main   0.29s  user 0.20s system 99% cpu 0.495 total

$ time git checkout some-branch
... skip ...
git checkout some-branch   0.48s  user 0.21s system 100% cpu 0.698 total

Fetching is slow, checkout is very fast.

Sparse checkout

The same as the base case, but with sparse checkout configuration set.

$ git init
$ git remote add origin git@github.com:something/something.git
$ git sparse-checkout set --no-cone 'ci/**'
$ time git fetch origin
remote: Enumerating objects: 833783, done.
remote: Counting objects: 100% (17154/17154), done.
remote: Compressing objects: 100% (6896/6896), done.
remote: Total 833783 (delta 10922), reused 15089 (delta 9347), pack-reused 816629 (from 1)
Receiving objects: 100% (833783/833783), 1.66 GiB | 11.03 MiB/s, done.
Resolving deltas: 100% (608222/608222), done.
From github.com:something/something
... skip many many branches and tags ...
git fetch origin   165.51s  user 16.36s system 99% cpu 3:03.42 total

$ time git checkout main
branch 'main' set up to track 'origin/main'.
Already on 'main'
git checkout main   0.04s  user 0.00s system 98% cpu 0.041 total

$ time git checkout some-branch
... skip ...
git checkout some-branch   0.03s  user 0.03s system 96% cpu 0.060 total

Fetching is slow, checkout is ×10 times faster than in the base case,
but in comparison with the fetching time it's not noticeable.

Fetch specific branch

The same as the base case, but fetching a particular branch.

$ git init
$ git remote add origin git@github.com:something/something.git
$ time git fetch origin main
remote: Enumerating objects: 692145, done.
remote: Counting objects: 100% (67579/67579), done.
remote: Compressing objects: 100% (9598/9598), done.
remote: Total 692145 (delta 64168), reused 58331 (delta 57953), pack-reused 624566 (from 1)
Receiving objects: 100% (692145/692145), 1.63 GiB | 11.04 MiB/s, done.
Resolving deltas: 100% (526154/526154), done.
From github.com:something/something
 * branch                  main       -> FETCH_HEAD
 * [new branch]            main       -> origin/main
git fetch origin main   162.29s  user 15.93s system 100% cpu 2:57.28 total

$ time git checkout main
branch 'main' set up to track 'origin/main'.
Already on 'main'
git checkout main   0.30s  user 0.19s system 98% cpu 0.505 total

$ time git checkout some-branch
... error, as the branch is not known ...

Fetching is slow almost as in the base case, checkout is very fast.

Use --depth=1

The same as the base case, but limiting the depth.

$ git init
$ git remote add origin git@github.com:something/something.git
$ time git fetch origin --depth=1
remote: Enumerating objects: 144986, done.
remote: Counting objects: 100% (144986/144986), done.
remote: Compressing objects: 100% (73804/73804), done.
remote: Total 144986 (delta 105240), reused 99615 (delta 67498), pack-reused 0 (from 0)
Receiving objects: 100% (144986/144986), 501.38 MiB | 11.00 MiB/s, done.
Resolving deltas: 100% (105240/105240), done.
From github.com:something/something
... skip many many branches and tags ...
git fetch origin --depth=1   43.44s  user 4.30s system 68% cpu 1:09.79 total

$ time git checkout main
branch 'main' set up to track 'origin/main'.
Already on 'main'
git checkout main   0.35s  user 0.26s system 98% cpu 0.611 total

$ time git checkout some-branch
... skip ...
git checkout some-branch   0.84s  user 0.31s system 100% cpu 1.136 total

Fetching is ×3 times faster, checkout is fast.

Use --depth=1 and fetch a particular branch

Combination of the two previous cases.

$ git init
$ git remote add origin git@github.com:something/something.git
$ time git fetch origin main --depth=1
remote: Enumerating objects: 12727, done.
remote: Counting objects: 100% (12727/12727), done.
remote: Compressing objects: 100% (11019/11019), done.
remote: Total 12727 (delta 1808), reused 7559 (delta 1120), pack-reused 0 (from 0)
Receiving objects: 100% (12727/12727), 27.36 MiB | 10.19 MiB/s, done.
Resolving deltas: 100% (1808/1808), done.
From github.com:something/something
 * branch            main       -> FETCH_HEAD
 * [new branch]      main       -> origin/main
git fetch origin main --depth=1   1.06s  user 0.31s system 25% cpu 5.291 total

$ time git checkout main
branch 'main' set up to track 'origin/main'.
Already on 'main'
git checkout main   0.25s  user 0.18s system 99% cpu 0.432 total

$ time git checkout some-branch
... error, as the branch is not known ...

Fetching is fast (×60), checkout is fast.

Use --filter=blob:none, --depth=1 and fetch a particular branch

$ git init
$ git remote add origin git@github.com:something/something.git
$ time git fetch origin main --depth=1 --filter=blob:none
remote: Enumerating objects: 2609, done.
remote: Counting objects: 100% (2609/2609), done.
remote: Compressing objects: 100% (2396/2396), done.
remote: Total 2609 (delta 2), reused 1840 (delta 1), pack-reused 0 (from 0)
Receiving objects: 100% (2609/2609), 484.53 KiB | 2.47 MiB/s, done.
Resolving deltas: 100% (2/2), done.
From github.com:something/something
 * branch            main       -> FETCH_HEAD
 * [new branch]      main       -> origin/main
git fetch origin main --depth=1 --filter=blob:none   0.06s  user 0.02s system 4% cpu 1.680 total

$ time git checkout main
remote: Enumerating objects: 10118, done.
remote: Counting objects: 100% (10118/10118), done.
remote: Compressing objects: 100% (8631/8631), done.
remote: Total 10118 (delta 1270), reused 5727 (delta 1111), pack-reused 0 (from 0)
Receiving objects: 100% (10118/10118), 26.81 MiB | 9.85 MiB/s, done.
Resolving deltas: 100% (1270/1270), done.
Updating files: 100% (10217/10217), done.
branch 'main' set up to track 'origin/main'.
Already on 'main'
git checkout main   1.41s  user 0.46s system 24% cpu 7.691 total

$ time git checkout some-branch
... error, as the branch is not known ...

Use sparse checkout, --filter=blob:none, --depth=1 and fetch a particular branch

$ git init
$ git remote add origin git@github.com:something/something.git
$ git sparse-checkout set --no-cone 'ci/**'
$ time git fetch origin main --depth=1 --filter=blob:none
remote: Enumerating objects: 2609, done.
remote: Counting objects: 100% (2609/2609), done.
remote: Compressing objects: 100% (2395/2395), done.
remote: Total 2609 (delta 2), reused 1845 (delta 2), pack-reused 0 (from 0)
Receiving objects: 100% (2609/2609), 484.53 KiB | 2.33 MiB/s, done.
Resolving deltas: 100% (2/2), done.
From github.com:something/something
 * branch            main       -> FETCH_HEAD
 * [new branch]      main       -> origin/main
git fetch origin main --depth=1 --filter=blob:none   0.06s  user 0.02s system 4% cpu 1.767 total

$ time git checkout main
remote: Enumerating objects: 102, done.
remote: Counting objects: 100% (102/102), done.
remote: Compressing objects: 100% (77/77), done.
remote: Total 102 (delta 22), reused 68 (delta 18), pack-reused 0 (from 0)
Receiving objects: 100% (102/102), 47.88 KiB | 544.00 KiB/s, done.
Resolving deltas: 100% (22/22), done.
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 1 (delta 0), pack-reused 0 (from 0)
Receiving objects: 100% (1/1), 114 bytes | 114.00 KiB/s, done.
Updating files: 100% (107/107), done.
branch 'main' set up to track 'origin/main'.
Already on 'main'
git checkout main   0.09s  user 0.04s system 5% cpu 2.497 total

$ time git checkout some-branch
... error, as the branch is not known ...

Fetching is fast, checkout is ×3 times faster than with normal checkout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
carvel-triage This issue has not yet been reviewed for validity enhancement This issue is a feature request
Projects
Status: To Triage
Development

Successfully merging a pull request may close this issue.

4 participants