Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
List all Git repository objects by size.

List all Git repository objects by size

Summary

Bash script to:

  • Iterate all commits made within a Git repository.
  • List every object at each commit.
  • Order unique objects in descending size order.

Useful for removing large resources from a Git repository, for instance with migrations into GitHub where individual objects are limited to 100MB maximum.

Example

$ ./gitlistobjectbysize.sh

100644 blob de6bdeaefebec0bff53d4859833caddba635609c    123452290	something/really/large.iso
100644 blob 946488f3c2ab8abf5d36b88f9018af77dceda12d         2290	path/to/script.js
100644 blob 2e234e61460f2fa087f9aebbfee2f6b524bc38fe         1724	README.md
100644 blob 1807d789603ae1038985f76c54e6de3b093da761         1710	README.md
100644 blob 7b5071e880f1abed9191fb34425157901c0a51a7         1083	LICENSE
100755 blob ef377e40d54365c814b9324ab4001455f4b5d4d8          651	bashscript.sh
100644 blob 08ca429f5434247f12f503dd69df244399d4ef83           19	.gitignore
100644 blob 8a52f946a9aed2c242cbe8891b3510f750527bb2           18	.gitignore

If we now wish to remove something/really/large.iso we can rewrite history using git filter-branch:

$ git filter-branch \
	--tree-filter "rm -f something/really/large.iso" \
	-- --all

Ref 'refs/heads/master' was rewritten
#!/bin/bash -e
function main {
local tempFile=$(mktemp)
# work over each commit and append all files in tree to $tempFile
local IFS=$'\n'
local commitSHA1
for commitSHA1 in $(git rev-list --all); do
git ls-tree -r --long "$commitSHA1" >>"$tempFile"
done
# sort files by SHA1, de-dupe list and finally re-sort by filesize
sort --key 3 "$tempFile" | \
uniq | \
sort --key 4 --numeric-sort --reverse
# remove temp file
rm "$tempFile"
}
main
@Murphydbuffalo

This comment has been minimized.

Copy link

@Murphydbuffalo Murphydbuffalo commented Dec 16, 2017

+1000000

@kosmodisk

This comment has been minimized.

Copy link

@kosmodisk kosmodisk commented Feb 23, 2018

fantastic stuff

@johnwake

This comment has been minimized.

Copy link

@johnwake johnwake commented May 4, 2018

Very nice! 💯 👍

@hrvoj3e

This comment has been minimized.

Copy link

@hrvoj3e hrvoj3e commented Sep 7, 2018

I get this message when I try to remove more than one object from repo.

Cannot create a new backup.
A previous backup already exists in refs/original/
Force overwriting the backup with -f

Is it safe to add -f to git filter-branch -f --tree-filter when I need to remove multiple files from repo.

@magnetikonline

This comment has been minimized.

Copy link
Owner Author

@magnetikonline magnetikonline commented Sep 11, 2018

Hey @hrvoj3e - yeah, safe to use -f in those instances. It's just Git being nice and warning you a previous rewrite history task was run and you're going to blow away it's backup if you proceed.

As always - do the right thing and backup/tar the entire repo before you start to give yourself a quick escape hatch! 👍

@inkwisit

This comment has been minimized.

Copy link

@inkwisit inkwisit commented Oct 28, 2018

It worked for me ! Thank you very much for help ..

@romain-dartigues

This comment has been minimized.

Copy link

@romain-dartigues romain-dartigues commented Dec 11, 2018

Just because, oneliner: git rev-list --all | xargs -rL1 git ls-tree -r --long | sort -uk3 | sort -rnk4.

Thanks for the script.

@bohan0

This comment has been minimized.

Copy link

@bohan0 bohan0 commented Feb 25, 2019

@romain-dartigues I get error that "-r" is not a supported option:

$ git rev-list --all | xargs -rL1 git ls-tree -r --long | sort -uk3 | sort -rnk4
xargs: illegal option -- r
usage: xargs [-0opt] [-E eofstr] [-I replstr [-R replacements]] [-J replstr]
[-L number] [-n number [-x]] [-P maxprocs] [-s size]
[utility [argument ...]]

@codeharrier

This comment has been minimized.

Copy link

@codeharrier codeharrier commented Mar 4, 2019

-r is "--no-run-if-empty", telling xargs not to run "git ls-tree" on an empty input, which should only happen if you have nothing in the rev-list.
So you can try the command without it or install a version of xargs that has it.

@n-mam

This comment has been minimized.

Copy link

@n-mam n-mam commented May 8, 2019

@magnetikonline Great script. My repo on windows (I see that you try to avoid) had some accidental, big vscode ipch file commits. So I adapted it a bit:

for /f %i in ('git rev-list --all') do git ls-tree -r --long "%i" >> x.txt

Then I imported x.txt in excel, delimited the import by spaces and then sorted the size column :)

@magnetikonline

This comment has been minimized.

Copy link
Owner Author

@magnetikonline magnetikonline commented Aug 6, 2019

@GilesBathgate i'd probably look at something other than Bash at this point for an implement (probably Python?).

@Julian

This comment has been minimized.

Copy link

@Julian Julian commented Nov 11, 2019

Similar to https://gist.github.com/magnetikonline/dd5837d597722c9c2d5dfa16d8efe5b9#gistcomment-2782586,

git rev-list --all | parallel git ls-tree -r --long "{}" | sort --key 3 | uniq | sort --key 4 --numeric-sort --reverse

@akostadinov

This comment has been minimized.

Copy link

@akostadinov akostadinov commented Jan 15, 2020

How do you find out which commits/branches have the objects? I did this:

#!/bin/bash -e
  
# work over each commit and append all files in tree to $tempFile
tempFile=$(mktemp)
IFS=$'\n'
for commitSHA1 in $(git rev-list --all); do
        git ls-tree -r --long "$commitSHA1" | \
        sed -e "s/^/$commitSHA1 /" >> "$tempFile"
done

# sort files by SHA1, de-dupe list and finally re-sort by filesize
sort --key 4 "$tempFile" | \
        uniq -f 1 | \
        sort --key 5 --numeric-sort --reverse

# remove temp file
rm "$tempFile"

But I think there should be a better way to check where these objects were introduced.

@magnetikonline

This comment has been minimized.

Copy link
Owner Author

@magnetikonline magnetikonline commented Jan 15, 2020

Hey @akostadinov - that's actually quite a nice addition. You could also find the branch as a cherry pick after the fact.

git branch --contains SHA1
@Maxattax97

This comment has been minimized.

Copy link

@Maxattax97 Maxattax97 commented Oct 13, 2020

Made some modifications: it de-duplicates the file paths, shows file sizes in human readable format, reduces to human-friendly columns, and sorts it so you'll quickly see the largest objects next to your prompt (without scrolling or less).

Sample output:

...
46 KiB		clib/docs/kbuild/makefiles.txt
58 KiB		clib/scripts/kconfig/zconf.lex.c_shipped
71 KiB		clib/lib/cmocka/cmocka.h
75 KiB		clib/scripts/kconfig/zconf.tab.c_shipped
110 KiB		clib/lib/cmocka/cmocka.c
$ # your shell prompt starts here

https://gist.github.com/Maxattax97/f566fdf67ac4ad2492ea1c732f5afdda

@magnetikonline

This comment has been minimized.

Copy link
Owner Author

@magnetikonline magnetikonline commented Oct 13, 2020

I like this @Maxattax97 - probably at the point I'd convert this to a Python script... 😄

@kenorb

This comment has been minimized.

Copy link

@kenorb kenorb commented Nov 9, 2020

Add these aliases into ~/.gitconfig file:

[alias]
  big-files    = !"git rev-list --objects --all \
                 | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
                 | sed -n 's/^blob //p' \
                 | sort -nk2 \
                 | cut -c 1-12,41- \
                 | $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest"
  big-objects = !"git rev-list --all \
                | parallel git ls-tree -r --long "{}" \
                | sort -uk3 \
                | sort -nk4"

Then run git big-files or git big-objects. It'll show the biggest files/objects at the bottom (tested on macOS).

@magnetikonline

This comment has been minimized.

Copy link
Owner Author

@magnetikonline magnetikonline commented Nov 10, 2020

Nice additions @kenorb - thanks! 👍

@voiski

This comment has been minimized.

Copy link

@voiski voiski commented Dec 24, 2020

If I'm not wrong, it only affects your local. If you run a fresh clone it will also download orphan commits - those ones you detached. The only way to make it fully work is to push your local to a new repo and replace the original with that new one. Another way different from that is if you are an admin in the server and manually drop those orphan commits - then, not possible in GH.

You can push to the server, it will rewrite the branch, but the orphan commit will be still there.

Another solution for this script, but with the same issue I pointed, https://rtyley.github.io/bfg-repo-cleaner/

@magnetikonline

This comment has been minimized.

Copy link
Owner Author

@magnetikonline magnetikonline commented Dec 28, 2020

@voiski that's not quite correct. To clarify:

  • Using git filter-branch or BFG/etc. will remove the commit(s) against the objects locally.
  • Force pushing then to a remote will create a rewritten commit log which does not reference the offending commit(s), but existing commit(s) will still exist in the remote as unreachable/orphan commits.
  • If someone then does a fresh git clone they will not receive the orphan commit(s) - only reachable commits based on the commit log are fetched - so the fresh clone will not receive the removed objects.
  • If you have direct access to the remote repo - you could issue a git gc --aggressive --prune=now - which will nuke the orphan commit(s) in question from the remote entirely.
  • In the case of GitHub, that's a bit of a black box - but we can assume it somewhat follows the rules that are outlined against the gc.auto setting, which determines when a Git repository (both local and remote) will self-clean/prune/gc. It's fair to say at some point it will git gc based on number of commits/pushes/etc.
@voiski

This comment has been minimized.

Copy link

@voiski voiski commented Dec 29, 2020

@magnetikonline Thanks for the clarification. I got confused because I had tried it before, and it didn't work, but mine was to redact secrets in the source code. I guess if you don't fully delete the file, it will still keep the orphan commits. But, for the gist intention here, it looks fine.

@pansila

This comment has been minimized.

Copy link

@pansila pansila commented Feb 18, 2021

python version

import subprocess
from tqdm import tqdm

files = []

commitSHA1 = subprocess.check_output(['git', 'rev-list', '--all'], text=True)
for c in tqdm(commitSHA1.splitlines()):
    files.extend(subprocess.check_output(['git', 'ls-tree', '-r', '--long', c], text=True).splitlines())


files.sort(key=lambda x: x.split()[2])
files = list(set(files))
files.sort(key=lambda x: int(x.split()[3]), reverse=True)
print('\n'.join(files[:100]))
@tombohub

This comment has been minimized.

Copy link

@tombohub tombohub commented Mar 19, 2021

I waited more than 5min to see any result, and when i pressed the key results came.
If i didnt press the key i woudl have waited for eternity.

WSL1

@Jan-Bruun-Andersen

This comment has been minimized.

Copy link

@Jan-Bruun-Andersen Jan-Bruun-Andersen commented Jul 26, 2021

I probably went a bit overboard, but here is my version:

https://github.com/Jan-Bruun-Andersen/git-ls-blobs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment