Skip to content

Instantly share code, notes, and snippets.

@infotroph
Last active February 14, 2019 20:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save infotroph/2914d60599616c7dfa75f54a78dea130 to your computer and use it in GitHub Desktop.
Save infotroph/2914d60599616c7dfa75f54a78dea130 to your computer and use it in GitHub Desktop.
Find the largest objects in a Git repository

So your Git repository is getting ungainly large. What's causing it? Are the problem files still being updated, or are they the deleted ghosts of old binaries?

git cat-file --batch-check --batch-all-objects --unordered \
  | sort --numeric-sort --key=3 \
  | tail -n10 \
  | xargs -L1 sh -c \
    'pth=`git describe --always $0`; \
    printf "%s %s %s %s\n" $0 $1 $2 $pth'

This retrieves the 40 digit hash of every object Git tracks, sorts them by object size, then for each of the top ten largest objects it uses git describe to look up what filepath, if any, it lived at when it was first committed. Note that if your history contains identical files at different paths (same hash = Git treats them as a single object), this will probably report only the first of those paths it encounters.

Sample output from a fresh clone of the PEcAn project, showing the lingering effect of large files that have been git rm'd but never purged from history:

ef43914c34358313ba9000e0ca89d08d6637dd6e blob 10330616 v1.4.10-4312-g3c46c8b12:book_source/04_advanced_user_guide/02_adding_to_pecan/01_case_studies/images/data-ingest/D1Ingest-4.gif
4b2bb916ce93534d75fb17f1cea3db95ff95eba3 blob 10399744 ffcba4a4e:data/FIA/data/FIADB_version5_1.accdb
5dc65da1b8d07b2881caf80e73c682fed15a3ecd blob 10613926 v1.4.10-4312-g3c46c8b12:book_source/04_advanced_user_guide/02_adding_to_pecan/01_case_studies/images/data-ingest/D1Ingest-9.gif
cd8ec003b7ed3495b57c3069df104c88c3c42003 blob 14343300 v1.4.10-5339-g9a4121fde:modules/assim.sequential/inst/Dashboard/Utilities/eco-region2.json
ef8989594379f217e2a49cd671b93095b8dc8581 blob 16206088 4622f71c3:RHIN_forcing_h.nc
7686e4088efbd20af93daa2b1c1e6d6a4a247814 blob 21623287 v1.4.10-4312-g3c46c8b12:book_source/04_advanced_user_guide/02_adding_to_pecan/01_case_studies/images/data-ingest/D1Ingest-6.gif
f05884847f63de41c75d6a32cc59a70db657f016 blob 24763368 v1.4.10-4312-g3c46c8b12:book_source/04_advanced_user_guide/02_adding_to_pecan/01_case_studies/images/data-ingest/D1Ingest-5.gif
3d21fbcffb553fc66180d1eb33c1242316a0d712 blob 24882210 v1.4.10-4312-g3c46c8b12:book_source/04_advanced_user_guide/02_adding_to_pecan/01_case_studies/images/data-ingest/D1Ingest-8.gif
82d82811e1a037727e12c6725446ea321d5b3782 blob 29209012 v1.4.10-5339-g9a4121fde:modules/assim.sequential/inst/Dashboard/Utilities/eco-region.shp
2e727dd54362c00f62d531c54ae82f4b739d75f7 blob 42682371 0822af4d6:modules/data.atmosphere/inst/extdata/urbana_subdaily_narr_test.nc

This method is almost equivalent to another pattern that I've seen recommended in many places, but is very slow in repositories with any substantial history:

for sha in $(git rev-list --all); do
  git ls-tree -r --long "$sha";
done \
  | sort --key=3 \
  | uniq \
  | sort --key=4 --numeric-sort \
  | tail

The difference is that the first approach (git cat-file) works upward from objects, so it finds everything (including dangling blobs and loose objects that will be repacked in the next git GC cycle), and then finds a corresponding path in the working tree if there ever was one. Meanwhile the second approach works from commits downward, so it only finds objects that were in the tree during a reachable commit -- these are the ones that are making your collaborators ask why the repository is so big, but git cat-file ... will run in seconds where git rev-list ... might take minutes.

For an example of the difference, here's the output from the same git cat-file shown above, run in my working PEcAn tree rather than a fresh clone. The unlabeled loose objects are a bit annoying, but we can see that the large deleted-but-reachable files are the same here as in the clean clone above. By checking these paths, we can verify that most of these large objects are legacies from objects that were already removed or replaced with smaller versions, and that any further trimming would require rewriting history.

~/projects/pecan> time git cat-file --batch-check --batch-all-objects --unordered \
  | sort --numeric-sort --key=3 \
  | tail -n10 \
  | xargs -L1 sh -c \
    'pth=`git describe --always $0`; \
    printf "%s %s %s %s\n" $0 $1 $2 $pth'
ef8989594379f217e2a49cd671b93095b8dc8581 blob 16206088 4622f71c3:RHIN_forcing_h.nc
7686e4088efbd20af93daa2b1c1e6d6a4a247814 blob 21623287 v1.4.10-4312-g3c46c8b12:book_source/04_advanced_user_guide/02_adding_to_pecan/01_case_studies/images/data-ingest/D1Ingest-6.gif
152e8caac4b8f2d497d00ed6666845a8b11a191f blob 24029073 
40920cb1e0e445bbe7823a87215f5a2fd14d97ed blob 24086402 
f05884847f63de41c75d6a32cc59a70db657f016 blob 24763368 v1.4.10-4312-g3c46c8b12:book_source/04_advanced_user_guide/02_adding_to_pecan/01_case_studies/images/data-ingest/D1Ingest-5.gif
3d21fbcffb553fc66180d1eb33c1242316a0d712 blob 24882210 v1.4.10-4312-g3c46c8b12:book_source/04_advanced_user_guide/02_adding_to_pecan/01_case_studies/images/data-ingest/D1Ingest-8.gif
82d82811e1a037727e12c6725446ea321d5b3782 blob 29209012 v1.4.10-5339-g9a4121fde:modules/assim.sequential/inst/Dashboard/Utilities/eco-region.shp
3cbc286fbf45c7676137639c16cbbcc5fbd2197e blob 30855985 
6675f299b2892ac4cacc1f2614a66b148e0d01d8 blob 30933885 
2e727dd54362c00f62d531c54ae82f4b739d75f7 blob 42682371 0822af4d6:modules/data.atmosphere/inst/extdata/urbana_subdaily_narr_test.nc
git cat-file --batch-check --batch-all-objects --unordered  0.30s user 0.05s system 94% cpu 0.373 total
sort --numeric-sort --key=3  0.66s user 0.02s system 62% cpu 1.092 total
tail -n10  0.26s user 0.00s system 23% cpu 1.092 total
xargs -L1 sh -c   3.78s user 0.47s system 78% cpu 5.401 total

That's ~5.5 seconds in a repository containing 110152 objects from 21898 commits. I set out to time the git rev-list approach for comparison, but got bored and killed the process after 20 minutes.

Now let's look at a different repository. Here's one with 'merely' 7782 objects from 592 commits, but that distributes precompiled binaries in the repository and updates them each release:

~/projects/OpenSimRoot> time git cat-file --batch-check --batch-all-objects --unordered \
  | sort --numeric-sort --key=3 \
  | tail -n10 \
  | xargs -L1 sh -c \
    'pth=`git describe --always $0`; \
    printf "%s %s %s %s\n" $0 $1 $2 $pth'
773ea5c271fc930e95f61d60be6dee86ec53f3b9 blob 5283863 5550193:public/executables/OpenSimRoot_Win_x64.exe
ec6ad788fdebd2672df3564056b9fd9af343040f blob 8375432 61c570b:public/executables/OpenSimRoot_Linux_x64
8aad2a934d5236c71cedd3ce014e40032f5be0b7 blob 8476888 00bea4a:public/executables/OpenSimRoot_Linux_x64
f35e04a86af4fef8d50ccf25195dd00d3d97f101 blob 8504504 5550193:public/executables/OpenSimRoot_Linux_x64
d5c22344e9bfe7538c72425266609322cf3f9f36 blob 10771496 
4246e6789a710f9455b9217feb1ef7a9c6681836 blob 10771696 
b1b84c7459d9d40c2465ef62f1c8f6ee9106909e blob 10771696 
469ff5f877d6bbb45e7aacbc27fdc8ae7bc13e1c blob 16159800 61c570b:public/executables/OpenSimRoot_Win_x64.exe
c1e5be7e82a437bbd70672b7884b627f40852eab blob 16607769 00bea4a:public/executables/OpenSimRoot_Win_x64.exe
4b64179768446af6d06f77c01a6ae2d982545a34 blob 35155538 
git cat-file --batch-check --batch-all-objects --unordered  0.03s user 0.01s system 91% cpu 0.047 total
sort --numeric-sort --key=3  0.03s user 0.00s system 44% cpu 0.079 total
tail -n10  0.02s user 0.00s system 25% cpu 0.080 total
xargs -L1 sh -c   0.71s user 0.13s system 82% cpu 1.015 total
~/projects/OpenSimRoot> time for sha in $(git rev-list --all); do git ls-tree -r --long "$sha"; done \
  | sort --key=3 \               
  | uniq \     
  | sort --key=4 --numeric-sort \
  | tail                              
100755 blob 44cb32a64947038f7c492370e8870b01e39d34e7 4738421    OpenSimRoot/StaticBuild_win64/OpenSimRoot_win64
100755 blob 61b1914840b1d06f88a53909b6b54b893df881a5 4738421    public/executables/OpenSimRoot_Win_x64.exe
100644 blob e94744147af0ce2b471ffb43e8bdaccb823877a2 4963762    OpenSimRoot/tests/engine/refTestResults/ResultSimulaStochastic.tab
100644 blob e94744147af0ce2b471ffb43e8bdaccb823877a2 4963762    OpenSimRoot/tests/engine/testResults/ResultSimulaStochastic.tab
100755 blob 773ea5c271fc930e95f61d60be6dee86ec53f3b9 5283863    public/executables/OpenSimRoot_Win_x64.exe
100755 blob ec6ad788fdebd2672df3564056b9fd9af343040f 8375432    public/executables/OpenSimRoot_Linux_x64
100755 blob 8aad2a934d5236c71cedd3ce014e40032f5be0b7 8476888    public/executables/OpenSimRoot_Linux_x64
100755 blob f35e04a86af4fef8d50ccf25195dd00d3d97f101 8504504    public/executables/OpenSimRoot_Linux_x64
100755 blob 469ff5f877d6bbb45e7aacbc27fdc8ae7bc13e1c 16159800   public/executables/OpenSimRoot_Win_x64.exe
100755 blob c1e5be7e82a437bbd70672b7884b627f40852eab 16607769   public/executables/OpenSimRoot_Win_x64.exe
for sha in $(git rev-list --all); do; git ls-tree -r --long "$sha"; done  3.27s user 2.27s system 80% cpu 6.858 total
sort --key=3  12.08s user 0.14s system 64% cpu 18.925 total
uniq  0.93s user 0.01s system 4% cpu 18.924 total
sort --key=4 --numeric-sort  0.01s user 0.00s system 0% cpu 18.938 total
tail  0.02s user 0.00s system 0% cpu 18.940 total

By contrast to the PEcAn repository above, this project has an ongoing object-bloat problem: Revisions of currently-existing binary files already constitute the majority of the repository size, and will continue to grow unless commit behavior changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment