Skip to content

Instantly share code, notes, and snippets.

@jcfr
Last active March 28, 2018 03:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jcfr/4348af13d2c8931daeab4ff9ab73e14b to your computer and use it in GitHub Desktop.
Save jcfr/4348af13d2c8931daeab4ff9ab73e14b to your computer and use it in GitHub Desktop.
Shell script listing the N largest file found in the history of a git-versioned project
#!/bin/bash
set -eo pipefail
#
# This script will list the N largest files found in the history of a Git project.
#
# References:
# * https://docs.acquia.com/article/removing-large-files-git-without-losing-history
# * https://stackoverflow.com/questions/10622179/how-to-find-identify-large-files-commits-in-git-history
# * https://git-scm.com/book/en/v2/Git-Internals-Packfiles
#
PROG=$(basename $0)
#-------------------------------------------------------------------------------
err() { echo -e >&2 ERROR: $@\\n; }
die() { err $@; exit 1; }
help() {
cat >&2 <<ENDHELP
Usage: $PROG N [options]
List the N largest files found in the history of a Git project.
Options:
-h, --human-readable print human readable sizes(e.g., 1K 234M 2G)
-t, --table pretty print results as a table instead of comma separated lines
ENDHELP
}
#-------------------------------------------------------------------------------
if [ ! -d .git ]; then
err "Execute the script at the root of a git-versioned project"
fi
table=0
human_readable=0
while [[ $# != 0 ]]; do
case $1 in
--human-readable|-h)
human_readable=1
shift 1
;;
--table|-t)
table=1
shift 1
;;
-*)
err Unknown option \"$1\"
help
exit 1
;;
*)
break
;;
esac
done
N=$1
if [ "$N" == "" ]; then
err Missing N option
help
exit 1
fi
format_size(){
if [ $human_readable == 1 ]; then
echo $1 | numfmt --to=iec-i --suffix=B --padding=12
else
echo $1
fi
}
display_line(){
if [ $table == 1 ]; then
printf "%12s %12s %-40s %s\n" $@
else
printf "%s,%s,%s,%s\n" $@
fi
}
all_objects=`git rev-list --all --objects`
display_line "size" "pack_size" "sha" "location"
git verify-pack -v .git/objects/pack/pack-*.idx | sort -k 3 -nr | head -n $N | while read line
do
sha=$(echo $line | cut -f1 -d" ");
size=$(format_size $(echo $line | cut -f3 -d" "));
compressed_size=$(format_size $(echo $line | cut -f4 -d" "));
sha_and_location=$(echo "${all_objects}" | grep $sha 2>&1)
sha=$(echo ${sha_and_location} | sed "r/ +//" | cut -d" " -f1)
location=$(echo ${sha_and_location} | sed "r/ +//" | cut -d" " -f2)
display_line $size $compressed_size $sha $location
done
@jcfr
Copy link
Author

jcfr commented Mar 27, 2018

Example of output:

$ git clone git://github.com/kitware/VTK
$ cd VTK
$ git_list_largest_file_from_history.sh -h -t 20
        size    pack_size sha                                      location
      6.9MiB       1.8MiB fe3023115a6155fbac7f7b216ba868e7874e01d6 ThirdParty/sqlite/vtksqlite/sqlite3.c
      6.6MiB       509KiB 033424897aa75ff484601bcfc3798d419ef11799 ThirdParty/mpi4py/vtkmpi4py/src/mpi4py.MPI.c
      6.2MiB       508KiB 72968114e1e66ac70fe37d77720b0ecf51ef562d ThirdParty/mpi4py/vtkmpi4py/src/mpi4py.MPI.c
      4.4MiB       342KiB c2820aac0375282a8e201b21ef88d99f34db6eb5 ThirdParty/mpi4py/vtkmpi4py/src/mpi4py.MPI.c
      3.8MiB       1.3MiB 7e6fcacfc123786e19250466368257d3810b4c93 ThirdParty/sqlite/vtksqlite/vtk_sqlite3.c
      3.8MiB       996KiB df739472a0225a11c3949776ef803bf5560d0e92 Utilities/vtksqlite/vtk_sqlite3.c
      2.5MiB       2.3MiB a7718104f7feca3386ff22f44342f4af8c1e6b3b Wrapping/Java/FastInfoset.jar
      1.9MiB        60KiB d851eae85c2e250783edb8ac4c598a3a13f6fad7 graphics/targets.make
      1.6MiB        90KiB d3a380f9d80e87331da53c24562a3ea5ccf07956 ThirdParty/netcdf/vtknetcdf/libdispatch/utf8proc_data.c
      1.4MiB        85KiB d5b7e2bd7352d4ee9820f1540ab14e5d5860ccab ThirdParty/netcdf/vtknetcdf/libdispatch/u8.c
      1.3MiB       1.3MiB af5891eb1dd2753af9fd2eb98402694cd9adcc98 Utilities/verdict/docs/VerdictUserManual2007/png/tri4qualVR-bq2.png
      1.3MiB        80KiB 4bc4fcedfa49268673ee9f3faa098b286a5d62df ThirdParty/xdmf2/vtkxdmf2/libsrc/XdmfPython.cxx
      1.2MiB        79KiB bf289531b23772bee901d9d4575b1615cdc5228d ThirdParty/xdmf2/vtkxdmf2/libsrc/XdmfPythonNoMpi.cxx
      1.2MiB        76KiB 285ce354d8f5ac3703b124ba81ee0d919af48c8a Utilities/vtknetcdf/libsrc/utf8proc_data.h
      1.2MiB        76KiB f3aa244beae4182b10e8519943707163e8b47491 Utilities/vtknetcdf/utf8proc_data.h
      1.1MiB        41KiB 37d8702d5b4e76bd0d8cd6a8299429f76d26535f examplesTcl/RTest.pdf
      1.1MiB        63KiB 8c7e79b034c71de56f8c98b7a03c23850b4402dc ThirdParty/eigen/vtkeigen/eigen/src/misc/lapacke.h
      967KiB       223KiB 4ac6b39f6e5794d52080ec10a4aaa1696958480b Utilities/vtklibxml2/xmlschemas.c
      967KiB       165KiB ae0d4063ff3ed7a278168a47998f90e315ed0502 Utilities/vtklibxml2/xmlschemas.c
      958KiB       168KiB 840d41ac4365a855179d84849939a8e824e6afce ThirdParty/glew/vtkglew/include/GL/glew.h

This will most likely change after the Slicer project move way from git-svn and use only git. Indeed, the history will be trimmed and filtered

$ git clone git://github.com/Slicer/Slicer
$ cd Slicer
$ git_list_largest_file_from_history.sh -h -t 20
        size    pack_size sha                                      location
       55MiB        55MiB 2cadcca7dd65c008c03ca0786ed3b02ad86cc900 Modules/Meshing/Testing/Data/CA05042124RFinal.img.gz
       22MiB       4.9MiB 56760af8680e9ff7e857a97b384fb9c0b974ea11 Modules/Meshing/Testing/Data/lumbar_smoothed_04075.stl
       15MiB       4.6MiB 0b64a9fbd9a0b02f65c726da71b7128013b94e3a Modules/CLI/RigidRegistration/Data/Baseline/RigidRegistrationTest02.nrrd
       13MiB        13MiB d8ffda6d56ebd27758e54d3550dd64ff362a9589 Applications/CLI/BRAINSTools/BRAINSCommonLib/TestData/OutDefField_orientedImage.nii.gz
       13MiB       7.0MiB f56677b93dfcf31975dd86e2b4cff954795467e5 Modules/CLI/ModelToLabelMap/Data/Input/OAS10001.img
       12MiB        11MiB 82101881013d74236903d672b5e85b138cd96acd Modules/TumorGrowth/Testing/Script-Results/TG_Deformable_Deformation_Inverse.mha
       12MiB        11MiB 3d8a4983f4df3b7c5f77f6d945ce415d0495c1c7 Modules/TumorGrowth/Testing/GUI-Results/TG_Deformable_Deformation_Inverse.mha
       11MiB        11MiB 972e9ff3645fe8c78c5f4e5559cda7f8ce0cd3e5 Modules/CLI/HessianRecursiveGaussianImage/Data/Baseline/HessianRecursiveGaussianImageTest.raw.gz
      8.8MiB       8.4MiB 3668a53e33be68022d7f5adfc94241ed948d62a8 Modules/CLI/Hessian3DToVesselnessMeasureImage/Data/Input/CTHeadAxialTensor.raw
      8.6MiB       8.6MiB f6d1d802bc05d3d8b0334c71d67931892bcead10 QTModules/EMSegment/Tasks/MRI-Human-Brain-Full-Parcellation/atlas_t1.nrrd
      8.4MiB       2.5MiB ed4e982d4d983f6bbaef6a6dfdd3d38fe99a03ee Modules/Meshing/Testing/Data/1P-SC05030303R-ANNFinal.stl
      7.5MiB       7.5MiB faf64d0af9b5913e86cd9a87af08f2b6faf3fa3b Applications/GUI/Testing/TestData/volone.nrrd
      6.6MiB       6.6MiB 7583599db5b930cc0c2969a2a0812024952782cc Modules/CLI/RobustStatisticsSegmenter/Data/Input/grayscale.nrrd
      6.4MiB       6.4MiB 7d2ff2f27a002558b1014637ceb0dd3bf1539cfe Applications/GUI/Testing/TestData/voltwo.nrrd
      6.2MiB        55KiB 0c5fcd649db216c2136c243be1424c0d4cd85f79 Testing/Data/Baseline/CLI/OAS10001.mha
      6.1MiB       4.3MiB 0c16e331ffdb05e946aff7dea782a8c603db42e7 Testing/Data/Baseline/CLI/SparseFieldLevelSetContourTest.vtp
      6.1MiB       6.1MiB 4adb89cf0e913ba35d4103f2b1b1d6d48c0dfc8e Modules/ChangeTracker/Testing/scan2.raw.gz
      5.9MiB       1.6MiB 77e219b1ffd96be9d26c7e2126a6bdf8002d37c5 Modules/Meshing/Testing/Data/1P-SC05030303R-Scanner.stl
      5.9MiB       5.9MiB 39b8d4f4dc1cb7a2d9ff38e3868feb2e28ce8b89 Testing/Data/Input/MRMeningioma1.nrrd
      5.6MiB       5.6MiB 40cab1f2b230a6d52d9db3bfa6e9bc82cf08ecca Modules/TumorGrowth/Testing/GUI-Results/TG_Deformable_Deformation.mha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment