Skip to content

Instantly share code, notes, and snippets.

@amano-takahisa
Created December 28, 2022 16:09
Show Gist options
  • Save amano-takahisa/0e822886eb4c2ceaacc9b4c6f38ecf13 to your computer and use it in GitHub Desktop.
Save amano-takahisa/0e822886eb4c2ceaacc9b4c6f38ecf13 to your computer and use it in GitHub Desktop.

How to keep repositories small

Avoid committing unnecessary files by mistake

Once inappropreate files have been committed and pushed to Git server, it is difficult to remove them completely from git. Therefore, it is important to setup first a development environment that does not accidentally commit such files.

Inappropreate files are including following files.

  • Crazy big files
  • Secret files (Passwords, Credentials & other Private data etc.)
  • Auto generated files (cache, build images, log files etc.)
  • Backup copies

While Git is suitable for sharing documentation and code, it is not suitable for managing. The following should also not be shared.

  • Generated files (compiled file, cache, logs. output files from scripts)
  • Large binary files (images, models, zip files, PDF files etc.)

Setup gitignore

"gitignore" file can prevent to add inappropreate files to commits.

Files that should be ignored in common by all users of the project, e.g. __pycache__/, *.egg, build/ etc. are listed in .gitignore file, located in the root directory of the repository. Typical .gitignore files for each languages are available in https://github.com/github/gitignore.

Files that generated by users development environment can be listed in $HOME/.config/git/ignore. These are including files generated by OS, editors like .DS_Store, Thumbs.db, .idea/ etc.

Usually use git add -u, instead of git add .

git add . will add every files under current working directory, and easy to add unintended files to your commits. On the othere hand, git add -u will add only files that are already indexed by git.

Set git config --global commit.verbose true and not use git commit -m

git commit -m <commit message> is quick way to make a commit from staged files. But I recommend not to use. Instead of that, do always commit with git commit -v. This will open editor to make commit message followed by diffs which shows what would be committed in the commit.

This is a good practice not only to keep repositories small but also write meaningful commit messages.

Following code set -v deafult to git-commit command.

git config --global commit.verbose true

or you can add following lines to your $HOME/.config/git/config

[commit]
    verbose = true

Install pre-commit

pre-commit is a tools which check and modify staged files before you commit files. Install pre-commit to your system with pip install pre-commit, install to your local repository by running pre-commit install in repository.

There is a pre-commit hook which check file size before commuit. Add following in .pre-commit-config.yaml.

repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.3.0
    hooks:
      - id: check-added-large-files

Then, it prevent more than 500 kB files from being committed.

https://github.com/pre-commit/pre-commit-hooks#check-added-large-files

Use git mv when rename files

I have found operations in existing repositories that split the file renaming operation into two commits, and a git add commit and a git rm commit. If these operations are in the same commit, git try to find which files are renamed to which, but to specify more explicitly, use git mv.

Reducing the size of bloated repositories

It is difficult, but not impossible, to reduce the size of a repository once it has become huge.

Finding too big files

There are some tools that can figure out what makes repositories large.

List largest blobs with one liner

Stack overflow user "raphinesse" introduces a useful one liner command which list up largest n files. https://stackoverflow.com/a/46085465/9326457

$ git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| sort --numeric-sort --key=2 \
| tail -n 10 \
| cut -c 1-12,41- \
| $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest
4c92e037f578   52MiB Internal/2021/trimming/results_old/bhattacharyya_clusters.csv
06cb35b5444f   60MiB tags
de5af7a5b159   60MiB tags
8c1ec9964ba7   60MiB tags
a85b8db7207f   60MiB tags
2e008590be9b   60MiB tags
885c5765713d   66MiB Internal/2021/raster_correlation/max_albedo_b04.tif
d8eb49ec4ad0   67MiB Internal/2021/raster_correlation/max_albedo_b08.tif
551967325fc0   76MiB Internal/2021/optram/optram_model.zip
d4878b0325be   81MiB Internal/2021/optram/optram_model.zip

The hash value of each object can be obtained with the above command. By giving this hash value to the following command, the history of each object can be checked. For example:

$ git whatchanged --all --find-object=d8eb49ec4ad0
commit 6ff69d1fac57c084bc55c833c4318b7b0f66ce00
Author: Michael Hornacek <56019650+m-hornacek@users.noreply.github.com>
Date:   2021-06-08 09:22:34 +0200

    Delete max_albedo_b08.tif

:100644 000000 d8eb49ec 00000000 D      Internal/2021/raster_correlation/max_albedo_b08.tif

commit ba4463730b55713cacbda5317f668479b037b13f
Author: Michael Hornacek <michael.hornacek@mantle-labs.com>
Date:   2021-06-02 09:45:09 +0200

    example for computing correlation between a pair of rasters

:000000 100644 00000000 d8eb49ec A      Internal/2021/raster_correlation/max_albedo_b08.tif

The above shows that the file Internal/2021/raster_correlation/max_albedo_b08.tif was added to the repository on 2021-06-02 by Michael Hornacek, and deleted on the 2021-06-08. Actually, this is a good example of how simply deleting a file from the work tree never delete the data in the repository and keep the file accessible, and bloating the size of the repository, as Git records all operations on the file.

Git sizer

git-sizer is a tool to analyze what makes a repository large.

The tool check not only files, but also references, tags, gigantic trees etc which reduce git performances.

git-resizer make a report like followings.

$ ./git-sizer --verbose
Processing blobs: 12744
Processing trees: 21960
Processing commits: 4695
Matching commits to trees: 4695
Processing annotated tags: 0
Processing references: 11
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Commits                    |           |                                |
|   * Count                    |  4.70 k   |                                |
|   * Total size               |  2.77 MiB |                                |
| * Trees                      |           |                                |
|   * Count                    |  22.0 k   |                                |
|   * Total size               |  6.49 MiB |                                |
|   * Total tree entries       |   167 k   |                                |
| * Blobs                      |           |                                |
|   * Count                    |  12.7 k   |                                |
|   * Total size               |  3.06 GiB |                                |
| * Annotated tags             |           |                                |
|   * Count                    |     0     |                                |
| * References                 |           |                                |
|   * Count                    |    11     |                                |
|     * Branches               |     1     |                                |
|     * Remote-tracking refs   |    10     |                                |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Commits                    |           |                                |
|   * Maximum size         [1] |   926 B   |                                |
|   * Maximum parents      [2] |     2     |                                |
| * Trees                      |           |                                |
|   * Maximum entries      [3] |  1.56 k   | *                              |
| * Blobs                      |           |                                |
|   * Maximum size         [4] |  80.6 MiB | ********                       |
|                              |           |                                |
| History structure            |           |                                |
| * Maximum history depth      |  4.31 k   |                                |
| * Maximum tag depth          |     0     |                                |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Number of directories  [5] |   573     |                                |
| * Maximum path depth     [6] |     9     |                                |
| * Maximum path length    [6] |   242 B   | **                             |
| * Number of files        [5] |  7.94 k   |                                |
| * Total size of files    [5] |  1.02 GiB | *                              |
| * Number of symlinks     [7] |     1     |                                |
| * Number of submodules   [8] |     1     |                                |

[1]  f77633e3db5f74113935164fdd89f55d520a2f0e
[2]  0a637315079aa662575c3e69356b8598a4d2300f
[3]  a3c5636bf52ad5b373329c63dcf27b1b79b60336 (refs/heads/master:Internal/2020/soil_moisture/code/messages/s1_zonal)
[4]  d4878b0325bea0f20ee8f2f71b1c9f4903b2f873 (c50d74a1e9fa2dbf3edb607a79a4aa28048347bd:Internal/2021/optram/optram_model.zip)
[5]  aec5f7cc086a8f07c48fb472a6b2b9e437bc3841 (a13cfa5c4e54338b2a9d35a480252af905022a24^{tree})
[6]  dd5dfdcd1de17ad1035e79e59310d83ce3c726b6 (refs/heads/master^{tree})
[7]  0f665a4c4b15c846e66a01ba494db27e1f0d9c3a (4953e74df5997b305a90879b90043ccd63c48262:Internal/2021/hadi/requests_from_others/time_series_analysis/notebooks/example/data)
[8]  83498071b9ae8a090eec3824cbc8bb87e17a9520 (refs/remotes/origin/soc_modeling#35:Internal/2021/hadi/requests_from_others/earthengine_repos_cloned)

Not only how to use the tools, they are providing many practices to keep the repository healthy on the website.

Removing files from a repository's history

If you added unexpected file to a most resent commit and NOT pushed yet, easy to fix.

git rm --cached GIANT_FILE
git commit --amend -CHEAD

If you want to remove files from pushed commits, this operation is basically change commit histories by git push --force and affects all users who already cloned the repository. Therefore, before proceed the process, notify all users in advance that you will be changing the history and ask them to clean their local worktrees. Otherwise, they may face conflicts or lose uncommitted files they are working on.

Follow the specific procedure for deleting files as follows.

https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository#purging-a-file-from-your-repositorys-history

@amano-takahisa
Copy link
Author

I uploaded this to my gist since we don't have proper locations to store such technical notes.

@EllenB
Copy link

EllenB commented Jan 4, 2023

Hi Taka @amano-takahisa

Thanks again for this.

I was again reading through it this morning.

I was thinking that in the .gitignore of the mantle-projects:, we could add perhaps something like:

*.zip
*.tif
*.csv
etc.

I have done it in a repo that I am keeping out of mantle-projects for the reason this is the repo to try out new things and mess up as this will not affect others. Here is that .gitignore of this project: https://github.com/mantlelabs/Yield_Prediction/blob/main/.gitignore

I need to scrub it a bit and probably add some Python stuff using the "Python" gitignore of GH itself that has a long list of things.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment