git-annex is a great way to manage large binary files without making your git sluggish due to the large files, and at the same time minimising the initial and total download size.
The large files are stored in 'special remotes', of which many kinds are supported. Many of these remotes, however, require some authentication or are rather involved to set up. The nicest workflow (for users, not contributors per se) is if files can be downloaded from a simple URL with no further plugins beyond git-annex itself. Google cloud storage (or amazon S3) is a good option for this.
This is complementary to the instructions at the git-annex page on the topic
- Get credentials for your google cloud storage project by going to the
project
settings page and create anaccess key
under theinteroperability
tab - Enter the key and secret into your environment (e.g. git bash on windows):
export AWS_ACCESS_KEY_ID="YOUR-KEY"
export AWS_SECRET_ACCESS_KEY="YOUR-SECRET"
In our git repository, we run
git annex initremote <remote-name> type=S3 encryption=none host="storage.googleapis.com" autoenable=true bucket=<a-unique-bucket-name> datacenter=<datacenter-location-name> publicurl="https://storage.googleapis.com/<a-unique-bucket-name>" public=yes
Note that:
<remote-name>
can be chosen to personal preference<a-unique-bucket-name>
is the name of the bucket as it will show up in your project. It must be a unique name, so I would lean towards longer names. It will let you know if the name is unavailable.<datacenter-location-name>
is the location of the datacenter as listed in google cloud storage (e.g.europe-west1
). You can find the list of possible values when creating a bucket through the web interface of google cloud.
If everything went well you should now see the remote when calling git-annex list
. You should also see that a new bucket has appearead in google cloud storage.
At this point the bucket is still set to private, so we need to address that. In the web interface we go to permission
and (1) switch access type to "uniform" (2) Go to grant access
and create a new "principal" with name allUsers
and give that the role Storage Object Viewer
. Now anyone should have read access to the directory and contents if they have the url.
Now we can actually upload the annexed files to google cloud storage with
git annex copy --to <remote-name>
Finally, after adding and setting up the special remote, don't forget to also commit and push the changes to the remote, otherwise people pulling the project will not 'know' about the special remote yet.
Note
You can find information about remotes before cloning by looking at the remote.log
file in the git-annex
branch.
For users, the process for cloning and getting the annexed files is straightforward. You would clone the repository as normal:
git clone <url>
Which would pull the repository, symlinks and the git-annex branch. To actually pull the files, you simply run
git annex get .
And files should be downloaded and ready for use. On windows (without WSL) files will be automatically unlocked.