-
Notifications
You must be signed in to change notification settings - Fork 33
How to Fetch a Commit
An early design decision we made for Buckaroo was that the package installation should only require the lock-file. This guarantees that everyone (and your CI server) is always working from the same exact versions of you dependencies.
But it also leads to a a challenge: how do we fetch the package given only a Git URL and a commit hash?
The simplest solution is the following:
git clone $GIT_URL $PACKAGE_DIR
cd $PACKAGE_DIR
git checkout $GIT_COMMIT
But this is inefficient. Cloning an entire repository can take anywhere from a few seconds to hours, depending on the number of commits. We only need the code at one commit, not the entire history.
Many of you are probably already thinking: why not do a shallow clone?
git clone --depth=1 -n $GIT_URL $PACKAGE_DIR
cd $PACKAGE_DIR
git checkout $GIT_COMMIT
But this only works if $GIT_COMMIT
is the first commit on the default branch.
What about fetch
?
git fetch origin $GIT_COMMIT
The problem is that to fetch an arbitrary commit from a shallow clone, the Git server must enable this feature on their side. GitHub, our most common package host, does not.
error: Server does not allow request for unadvertised object...
An advertised object is one at the tip of a branch or a tag. We can query these using ls-remote
:
git ls-remote $GIT_URL
What's really nice about ls-remote
is that we don't even need to clone first!
This gives us a more efficient procedure:
- Use
ls-remote
to determine if the commit is advertised 1.1. If so, do a shallow clone and fetch 1.2. If not, do a full clone - Checkout the commit
This solution is pretty good, but can we do better?
It turns out that git fetch
has an option called --deepen
. This will fetch additional commits beyond those from previous fetches. With this, we can reach commits older than the tip in a shallow clone.
- Do a shallow clone
- Does the commit exist?
2.1. Yes, do a checkout
2.1. No,
git fetch --deepen n
, go to 2.
In this way, we can walk backwards through the commit history from the branch tips to the commit that we need. In a very large repository, this can save us considerable fetching time.
Of course, --deepen
assumes we know which branch we want to fetch the commit from. Where can we get this information? Well, during the resolution process (generating the lock-file), Buckaroo will have explored the branches and tags that contain the commit we locked down to. We store this information in the lock-file, allowing us to use it as a hint for later installs.
[lock."github.com/buckaroo-pm/boost-config"]
versions = [ "branch=master" ]
revision = "4392ed19b232ed2dde7623843d7e30ef669d860e"
Here we know that 4392ed
will likely be found on master
.
With this in place, we now have a series of increasingly expensive strategies to try:
ls-remote
- Shallow clone
- Deepen the expected branch
- Full clone
- Fail
We have one last trick for speeding things up. Each package in the packages folder (/buckaroo
), is actually a Git repository. This allows us to upgrade a package very cheaply, since we can fetch the diff from the current version to the next.
Even better, the remote we fetch from can be a local Git repository! Buckaroo will try to fetch from its global Git cache before making requests to the remote.
- Clones are expensive for large repositories
- Shallow clones only work for commits near the branch tip
- GitHub does not allow you to fetch arbitrary commits
- Use
ls-remote
to discover the latest commits - Use
--deepen
to expand a shallow clone - Git is singly-linked, so save information when you have it