Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?

Why Git Media

Git does not deal well with bunches of huge binary files, for a few reasons. It kills memory to add and commit them, it is pretty inefficient to transfer them, and it basically forces you to download all of them with each clone.

I thought sparse and narrow checkouts would be the answer, but it looks like it may be too difficult to rewrite other tools to deal with the missing blobs before they are fetched, as well as it still taking server resources to transfer them.

Perhaps a better way would be to hack the Git client slightly to store binary files over a certain size via a protocol that is better for that, like SFTP or HTTP over S3, and then fetch them only when necessary. Instead of the whole blob, just keep a blob that contains a pointer to the URL and SHA of the original content, or possibly just a SHA of the original content, since we could probably depend on the URL being guessable.

This is actually relatively similar to how git submodules does it - not keeping the whole subdirectory of contents of the submodule, but instead a different mode and 'type' and a SHA of the commit to checkout. Perhaps saving it in the tree as a 'tag' or even a new type ('media') would do the trick. Perhaps the fifth Git type should be 'media' for offline large media pointers?

Git Media Extension

For each binary file above 10M (changeable), do not store this in Git.

Create a blob of a different mode where the content of the blob is the SHA of the binary.

Post checkout, find any files of this mode in the source tree and GET them.

Pre-commit, scan the index for any SHAs that have changed, put them in (.git/media)

Pre-push (or 'git media-sync'), upload all new media files to asset server (S3/SFTP/GH) (and update blobs with new URLs if we can't pre-calculate) and remove files in (.git/media).

  • git media config - set config values for media asset server

    • server
    • username
    • password
    • max-size
  • git media filter - go back in history and replace large blobs with media-pointer blobs

  • git media status - view which media is not yet synced upstream

  • git media sync - upload all new media files

  • Change index operations to compare to SHA in blob, not SHA of blob

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment