jbosboom/partial-clone-sparse-checkout-bare-repository.md

## partial-clone-sparse-checkout-bare-repository.md

      
    Raw
  

              partial-clone-sparse-checkout-bare-repository.md
            
          
    Simulating sparse checkout in a partial clone bare repository

Background

Partial clone is a git feature allowing a local repository to contain only a subset of a remote repository's trees and blobs, fetching missing objects lazily.  Sparse checkout is a separate git feature allowing a working tree to contain only a subset of the files tracked by the repository.  Used together, partial clone and sparse checkout allow working with large multi-project repositories ("monorepos") and repositories containing large binary files without having to download and store a full copy of the data in the repository.  (Shallow clone is a distinct git feature that limits the commits stored in the local repository.)
For example, consider jbosboom/test-partial-clone-sparse-checkout, which tracks a few text files and some large images.  If we want to work on the text files, but don't need the images, we can avoid downloading and storing them in our local repository using partial clone and sparse checkout, as shown in the following commands:
# EXAMPLE 1
[jbosboom@docks tmp]$ git clone --no-checkout --filter=blob:none https://github.com/jbosboom/test-partial-clone-sparse-checkout.git normal-repo && cd normal-repo
Cloning into 'normal-repo'...
remote: Enumerating objects: 12, done.
remote: Counting objects: 100% (12/12), done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 12 (delta 0), reused 12 (delta 0), pack-reused 0
Receiving objects: 100% (12/12), done.
[jbosboom@docks normal-repo]$ git sparse-checkout init
[jbosboom@docks normal-repo]$ echo -e '/*\n!/data/images/' > .git/info/sparse-checkout
[jbosboom@docks normal-repo]$ git checkout
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 6 (delta 0), reused 6 (delta 0), pack-reused 0
Receiving objects: 100% (6/6), 853 bytes | 853.00 KiB/s, done.
Your branch is up to date with 'origin/trunk'.
[jbosboom@docks normal-repo]$ git status
On branch trunk
Your branch is up to date with 'origin/trunk'.

You are in a sparse checkout with 75% of tracked files present.

nothing to commit, working tree clean
[jbosboom@docks normal-repo]$ ls data/
code  text
[jbosboom@docks normal-repo]$ git checkout HEAD^
remote: Enumerating objects: 2, done.
remote: Counting objects: 100% (2/2), done.
remote: Total 2 (delta 0), reused 1 (delta 0), pack-reused 0
Receiving objects: 100% (2/2), 295 bytes | 295.00 KiB/s, done.
HEAD is now at e52df10 Initial commit

Our local repository contains only the blobs necessary to populate our sparse checkout.  When we checked out the parent commit on the current branch, we had to fetch the blobs containing the previous contents of the files modified by the branch tip commit.  This is roughly what we want if we're working in an environment with good connectivity.
Bare repositories

But what if instead of working on a repository, we just want to maintain a partial backup/archival clone?  Then we don't need a working copy, so we want our local repository to be a bare repository.  Unfortunately, sparse checkout doesn't work in a bare repository.  (Also, we probably want to immediately fetch blobs for past versions of the files tracked in the portion of the repository we're mirroring.  We'll come back to this.)
# EXAMPLE 2
[jbosboom@docks tmp]$ git clone --bare --filter=blob:none https://github.com/jbosboom/test-partial-clone-sparse-checkout.git bare-repo.git && cd bare-repo.git
Cloning into bare repository 'bare-repo.git'...
remote: Enumerating objects: 12, done.
remote: Counting objects: 100% (12/12), done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 12 (delta 0), reused 12 (delta 0), pack-reused 0
Receiving objects: 100% (12/12), done.
[jbosboom@docks bare-repo.git]$ git sparse-checkout init
fatal: this operation must be run in a work tree

Instead, we need to use the sparse:oid filter described in the man page for git rev-list.  This filter reads a sparse checkout specification from a blob in the repository.
Fetching with sparse:oid (probably not supported)

The man page for git clone implies we can pass any filter to --filter, but because sparse:oid references an object in the repository, we need to create an empty repository, write the sparse checkout specification into the repository, and add our remote manually.
# EXAMPLE 3
[jbosboom@docks tmp]$ git init --bare bare-repo.git && cd bare-repo.git
Initialized empty Git repository in /tmp/bare-repo.git/
[jbosboom@docks bare-repo.git]$ git update-ref refs/sparse-spec `echo -e '/*\n!/data/images/' | git hash-object -w --stdin`
[jbosboom@docks bare-repo.git]$ git remote add origin https://github.com/jbosboom/test-partial-clone-sparse-checkout.git
[jbosboom@docks bare-repo.git]$ git config remote.origin.promisor true
[jbosboom@docks bare-repo.git]$ git config remote.origin.partialclonefilter sparse:oid=refs/sparse-spec
[jbosboom@docks bare-repo.git]$ git fetch
fatal: remote error: filter 'sparse:oid' not supported

Whether fetching with thesparse:oid filter works depends on server configuration (in addition to the configuration required for partial clone and sparse checkout).  Evaluating the filter over large repositories can be expensive, so public or commercial repository hosting is likely to disable it.
Listing missing blobs with git rev-list

If the server won't apply our sparse filter server-side, we can use git rev-list to apply it client-side after fetching commits and trees, then explicitly fetch the missing blobs.
# EXAMPLE 4
[jbosboom@docks tmp]$ git clone --bare --filter=blob:none https://github.com/jbosboom/test-partial-clone-sparse-checkout.git bare-repo.git && cd bare-repo.git
Cloning into bare repository 'bare-repo.git'...
remote: Enumerating objects: 12, done.
remote: Counting objects: 100% (12/12), done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 12 (delta 0), reused 12 (delta 0), pack-reused 0
Receiving objects: 100% (12/12), done.
[jbosboom@docks bare-repo.git]$ git update-ref refs/sparse-spec `echo -e '/*\n!/data/images/' | git hash-object -w --stdin`
[jbosboom@docks bare-repo.git]$ git rev-list --objects --filter=sparse:oid=refs/sparse-spec --missing=print --no-object-names --all | cut -d '?' -s -f2 | git -c fetch.negotiationAlgorithm=noop fetch origin --no-tags --no-write-fetch-head --recurse-submodules=no --filter=blob:none --stdin
remote: Enumerating objects: 8, done.
remote: Counting objects: 100% (8/8), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 8 (delta 1), reused 8 (delta 1), pack-reused 0
Receiving objects: 100% (8/8), 910 bytes | 910.00 KiB/s, done.
Resolving deltas: 100% (1/1), done.

In Example 1, we fetched 6 blobs when checking out the branch tip, then another 2 blobs when checking out the parent (previous) commit.  In Example 4, we fetched all 8 blobs at once, so we have all of the history for the files matching our sparse checkout specification.
Staying up to date

To keep our local repository up to date with the remote repository as it changes, we can simply git fetch origin to fetch new commits and trees, then repeat the above rev-list | fetch operation to fetch any new blobs not excluded by our filter.  For large repositories, computing the missing blobs may take significant time.  We can speed the rev-list up by maintaining remote tracking branches, even though they are not normally used in bare repositories, and limiting the rev-list to blobs referenced by commits not already in the tracking branches.  After fetching the missing blobs, we update our local branches to match the tracking branches.
# EXAMPLE 5
[jbosboom@docks tmp]$ git init --bare bare-repo.git && cd bare-repo.git
Initialized empty Git repository in /tmp/bare-repo.git/
[jbosboom@docks bare-repo.git]$ git update-ref refs/sparse-spec `echo -e '/*\n!/data/images/' | git hash-object -w --stdin`
[jbosboom@docks bare-repo.git]$ git remote add origin https://github.com/jbosboom/test-partial-clone-sparse-checkout.git
[jbosboom@docks bare-repo.git]$ git config remote.origin.promisor true
[jbosboom@docks bare-repo.git]$ git config remote.origin.partialclonefilter blob:none
[jbosboom@docks bare-repo.git]$ git config remote.origin.fetch '+refs/heads/*:refs/remotes/origin/*'
[jbosboom@docks bare-repo.git]$ git fetch
remote: Enumerating objects: 12, done.
remote: Counting objects: 100% (12/12), done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 12 (delta 0), reused 12 (delta 0), pack-reused 0
Receiving objects: 100% (12/12), 1.25 KiB | 1.25 MiB/s, done.
From https://github.com/jbosboom/test-partial-clone-sparse-checkout
 * [new branch]      trunk      -> origin/trunk
[jbosboom@docks bare-repo.git]$ git rev-list --objects --filter=sparse:oid=refs/sparse-spec --missing=print --no-object-names --remotes --not --branches | cut -d '?' -s -f2 | git -c fetch.negotiationAlgorithm=noop fetch origin --no-tags --no-write-fetch-head --recurse-submodules=no --filter=blob:none --stdin
remote: Enumerating objects: 8, done.
remote: Counting objects: 100% (8/8), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 8 (delta 1), reused 8 (delta 1), pack-reused 0
Receiving objects: 100% (8/8), 910 bytes | 910.00 KiB/s, done.
Resolving deltas: 100% (1/1), done.
[jbosboom@docks bare-repo.git]$ git for-each-ref --shell --format 'git update-ref refs/heads/%(refname:lstrip=3) $(git show-ref --verify --hash %(refname))' refs/remotes/origin/ | sh -x
++ git show-ref --verify --hash refs/remotes/origin/trunk
+ git update-ref refs/heads/trunk aed61b3f0917e3809f92e0a96db88b423d0a49cc

In Example 5, we started with an empty bare repository and added our remote manually, but that isn't necessary.  If we already have a clone set up like Example 4 and we've already fetched all the blobs referred to by the local branches, we can switch to maintaining tracking branches by starting from the git config remote.origin.fetch command in Example 5.
(Strictly speaking, these branches are not tracking branches in the git branch --track sense.  We can make them so with git for-each-ref --shell --format 'git branch -u %(refname) %(refname:lstrip=3)' refs/remotes/origin | sh.  This changes the output of git branch -vv, but not much else.)
Fetching fewer trees (not yet supported?)

In the above examples, we fetched only the blobs we care about, but we fetched all the tree objects.  If we care about only a small part of a very large repository (e.g., a minor project in a monorepo), we might end up fetching lots of trees we don't need.  We can fetch trees lazily by using the tree:0 filter in the same places we used the blob:none filter, but when adapting the examples above in this way, we end up fetching the trees for data/images/ in both normal repositories (using sparse checkout) and bare repositories (using rev-list with the sparse:oid filter).  I find this surprising, as it seems to render the tree:0 filter useless.  Maybe it only works with "cone" sparse checkouts?  We could write our own sparse checkout specification evaluator and walk the trees manually if we really needed to limit tree fetching.
Partial clone pack management

(This section isn't specific to bare repositories.)
When using partial clone (with or without sparse checkout), each nonempty fetch results in git writing a new pack under .git/objects/pack.  Beside the .pack file itself and the .idx pack index file, git writes a .promisor file to record that the pack may reference objects not stored in the repository; otherwise, git might consider such a pack to be corrupt.  As of this writing (git version 2.31.1), git only cares about the existence of this file, not its contents.  For a pack containing commit objects, git sometimes writes each fetched ref name and its target (a commit hash) to the .promisor file as an aid to debugging git.  These files don't consume much space and may be useful.  For a pack containing only blobs, however, git occasionally gets confused and writes a .promisor file containing each blob's hash twice per line (mapping the hash to itself).  These files can consume a lot of space and contain no useful information, so can be truncated (not deleted).
Splitting objects across many packs is inefficient in both space (because on-disk packs contain deltas only against objects in the same pack) and in time (because multiple packs must be searched when retrieving an object).  When the number of packs exceeds the value of the gc.autoPackLimit config variable, git gc --auto (which runs automatically after many git commands) will repack them into one large pack.  We can also explicitly run git repack to reclaim space earlier.  As of git version 2.31.1, git repack will explode promisor packs into loose objects but then immediately delete them, temporarily consuming lots of space (because loose objects are never deltas).  A commit fixing this has been merged and should make it into git version 2.32.0.