On a recent grails project, we're using a git repo that was originally converted from a SVN repo with a ton of large binary objects in it (lots of jar files that really should come from an ivy/maven repo). The .git
directory was over a gigabyte in size and this made it very cumbersome to clone and manipulate.
We decided to leverage git's history rewriting capabilities to make a much smaller repository (and kept our previous repo as a backup just in case).
Here are a few questions/answers that I figured out how to answer with git and some shell commands:
Git has a unique SHA that it associates with each object (such as files which it calls blobs) throughout it's history.
This helps us find that object and decide whether it's worth deleting later on:
git rev-list --objects --all | sort -k 2 > allfileshas.txt
Take a look at the resulting allfileshas.txt
file for the full list.
If you want to see the unique files throughout the history of your git repo (such as to grep for .jar files that you might have committed a while ago):
git rev-list --objects --all | sort -k 2 | cut -f 2 -d | uniq
We can find the big files in our repo by doing a git gc
which makes git compact the archive and stores an index file that we can analyse.
Get the last object SHA for all committed files and sort them in biggest to smallest order:
git gc && git verify-pack -v .git/objects/pack/pack-*.idx | egrep "^w+ blobW+[0-9]+ [0-9]+ [0-9]+$" | sort -k 3 -n -r > bigobjects.txt
Take that result and iterate through each line of it to find the SHA, file size in bytes, and real file name (you also need the allfileshas.txt output file from above):
for SHA in `cut -f 1 -d > bigtosmall.txt
done;
(there's probably a more efficient way to do this, but this was fast enough for my purposes with ~50k files in our repo)
Then, just take a look at the bigtosmall.txt file to see your biggest file culprits.