Skip to content

Instantly share code, notes, and snippets.

@NemoAndrea
Last active May 15, 2024 20:21
Show Gist options
  • Save NemoAndrea/86f93e55a579ad4e2c7e8fea4603c1c2 to your computer and use it in GitHub Desktop.
Save NemoAndrea/86f93e55a579ad4e2c7e8fea4603c1c2 to your computer and use it in GitHub Desktop.
google cloud storage as a git-annex special remote

Google Cloud Storage ☁ & git-annex

git-annex is a great way to manage large binary files without making your git sluggish due to the large files, and at the same time minimising the initial and total download size.

The large files are stored in 'special remotes', of which many kinds are supported. Many of these remotes, however, require some authentication or are rather involved to set up. The nicest workflow (for users, not contributors per se) is if files can be downloaded from a simple URL with no further plugins beyond git-annex itself. Google cloud storage (or amazon S3) is a good option for this.

Setup

This is complementary to the instructions at the git-annex page on the topic

  1. Get credentials for your google cloud storage project by going to the project settings page and create an access key under the interoperability tab
  2. Enter the key and secret into your environment (e.g. git bash on windows):
export AWS_ACCESS_KEY_ID="YOUR-KEY"
export AWS_SECRET_ACCESS_KEY="YOUR-SECRET"

Creating a new bucket for the annex and uploading files ↗

In our git repository, we run

git annex initremote <remote-name> type=S3 encryption=none host="storage.googleapis.com"  autoenable=true bucket=<a-unique-bucket-name> datacenter=<datacenter-location-name> publicurl="https://storage.googleapis.com/<a-unique-bucket-name>" public=yes 

Note that:

  • <remote-name> can be chosen to personal preference
  • <a-unique-bucket-name> is the name of the bucket as it will show up in your project. It must be a unique name, so I would lean towards longer names. It will let you know if the name is unavailable.
  • <datacenter-location-name> is the location of the datacenter as listed in google cloud storage (e.g. europe-west1). You can find the list of possible values when creating a bucket through the web interface of google cloud.

If everything went well you should now see the remote when calling git-annex list. You should also see that a new bucket has appearead in google cloud storage.

At this point the bucket is still set to private, so we need to address that. In the web interface we go to permission and (1) switch access type to "uniform" (2) Go to grant access and create a new "principal" with name allUsers and give that the role Storage Object Viewer. Now anyone should have read access to the directory and contents if they have the url.

Now we can actually upload the annexed files to google cloud storage with

git annex copy --to <remote-name>

Finally, after adding and setting up the special remote, don't forget to also commit and push the changes to the remote, otherwise people pulling the project will not 'know' about the special remote yet.

Note

You can find information about remotes before cloning by looking at the remote.log file in the git-annex branch.

Cloning the remote for users ↘

For users, the process for cloning and getting the annexed files is straightforward. You would clone the repository as normal:

git clone <url>

Which would pull the repository, symlinks and the git-annex branch. To actually pull the files, you simply run

git annex get .

And files should be downloaded and ready for use. On windows (without WSL) files will be automatically unlocked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment