Skip to content

Instantly share code, notes, and snippets.

@magnetikonline
Last active June 19, 2024 00:00
Show Gist options
  • Save magnetikonline/dd5837d597722c9c2d5dfa16d8efe5b9 to your computer and use it in GitHub Desktop.
Save magnetikonline/dd5837d597722c9c2d5dfa16d8efe5b9 to your computer and use it in GitHub Desktop.
List all Git repository objects by size.

List all Git repository objects by size

Summary

Bash script which will:

  • Iterate all commits made within a Git repository.
  • List every object at each commit.
  • Order unique objects in descending size order.

Useful for removing large resources from a Git repository, for instance with migrations into GitHub where individual objects are limited to 100MB maximum.

Example

$ ./gitlistobjectbysize.sh

100644 blob de6bdeaefebec0bff53d4859833caddba635609c    123452290	something/really/large.iso
100644 blob 946488f3c2ab8abf5d36b88f9018af77dceda12d         2290	path/to/script.js
100644 blob 2e234e61460f2fa087f9aebbfee2f6b524bc38fe         1724	README.md
100644 blob 1807d789603ae1038985f76c54e6de3b093da761         1710	README.md
100644 blob 7b5071e880f1abed9191fb34425157901c0a51a7         1083	LICENSE
100755 blob ef377e40d54365c814b9324ab4001455f4b5d4d8          651	bashscript.sh
100644 blob 08ca429f5434247f12f503dd69df244399d4ef83           19	.gitignore
100644 blob 8a52f946a9aed2c242cbe8891b3510f750527bb2           18	.gitignore

Note

For git version 2.38.0 and above, the git ls-tree --format argument will provide a more succinct report output, used via the gitlistobjectbysize-git2.38.0.sh script variant.

If we now wish to remove something/really/large.iso we can rewrite history using git filter-branch:

$ git filter-branch \
  --tree-filter "rm -f something/really/large.iso" \
  -- --all

Ref 'refs/heads/main' was rewritten
#!/bin/bash -e
function main {
local tempFile=$(mktemp)
# work over each commit and append all files in tree to $tempFile
local IFS=$'\n'
local commitSHA1
for commitSHA1 in $(git rev-list --all); do
git ls-tree \
--format="%(objectname) %(objectsize:padded) %(path)" \
-r \
"$commitSHA1" >>"$tempFile"
done
# sort files by SHA-1, de-dupe list and finally re-sort by filesize
sort --key 1 "$tempFile" | \
uniq | \
sort --key 2 --numeric-sort --reverse
# remove temp file
rm "$tempFile"
}
main
#!/bin/bash -e
function main {
local tempFile=$(mktemp)
# work over each commit and append all files in tree to $tempFile
local IFS=$'\n'
local commitSHA1
for commitSHA1 in $(git rev-list --all); do
git ls-tree -r --long "$commitSHA1" >>"$tempFile"
done
# sort files by SHA-1, de-dupe list and finally re-sort by filesize
sort --key 3 "$tempFile" | \
uniq | \
sort --key 4 --numeric-sort --reverse
# remove temp file
rm "$tempFile"
}
main
@tombohub
Copy link

I waited more than 5min to see any result, and when i pressed the key results came.
If i didnt press the key i woudl have waited for eternity.

WSL1

@Jan-Bruun-Andersen
Copy link

I probably went a bit overboard, but here is my version:

https://github.com/Jan-Bruun-Andersen/git-ls-blobs

@kraduk
Copy link

kraduk commented Feb 22, 2022

Removing the temp file is so sub optimal, when you have to wait ages for it to run, it lists the biggest 1st, and then scrolls off the terminal and you don't have enough history for the complete feed 8(

#!/bin/bash -e

function main {
	local tempFile=$(mktemp)

	# work over each commit and append all files in tree to $tempFile
	local IFS=$'\n'
	local commitSHA1
	for commitSHA1 in $(git rev-list --all); do
		git ls-tree -r --long "$commitSHA1" >>"$tempFile"
	done

	# sort files by SHA1, de-dupe list and finally re-sort by filesize
	sort --key 3 "$tempFile" | \
		uniq | \
		sort --key 4 --numeric-sort --reverse

	# remove temp file
	#rm "$tempFile"
}


main

@samzmann
Copy link

Thanks for this!
gitlistobjectbysize.sh works well, though kinda slow.

Then, the git filter-branch ... command is extremely slow: +30 minutes expected to remove one large file from history of a relatively small repo (~1000 commits)

I found https://github.com/newren/git-filter-repo (which is also recommended when running git filter-branch).
It provides a super fast and in-depth way to analyze the repo:

git filter-repo --analyze

And then provides all kinds of ways to filter/clean a repo, all super fast.

I can recommend!

@pauljohn32
Copy link

I have git lfs holding a lot of large files. I want to find the large files that are only in local ".git" folder or history. Can you discuss this question?

Example issues:

  1. User put in a file as ordinary git file. Then later I changed it to lfs. Does the original commit stay somewhere in the ".git/" folder. I guess yes.
  2. Do you know of a way to scan for everything that is correctly in LFS on the current master branch, and then check for those same named files in ".git" history and delete them. All versions?
  3. I delete a file from the master branch, but copies of it are still sitting about in history. Does each version show up as a separate file when we do this kind of search?

@magnetikonline
Copy link
Author

magnetikonline commented Nov 30, 2022

@pauljohn32 can probably only answer the first with any confidence - haven't really played with LFS (yet).

User put in a file as ordinary git file. Then later I changed it to lfs. Does the original commit stay somewhere in the ".git/" folder. I guess yes.

yes it does 👍

I delete a file from the master branch, but copies of it are still sitting about in history. Does each version show up as a separate file when we do this kind of search?

each modified version of the same object/path will - yes.

@zhaopan
Copy link

zhaopan commented Mar 1, 2023

+10086 👍

@Stonks3141
Copy link

A nice one-line version that doesn't use a tempfile or GNU-specific flags and puts the largest files at the bottom:

git rev-list --all | xargs -n1 git ls-tree --long -r | sort -k3 | uniq | sort -nk4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment