You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This gist explains how to commit large database dumps into a git repository without bloating.
We assume that your use case is as follows:
At a given point, there's only one "master" Drupal database (it could be the primary development site, or it could be the prod site)
Only the "master" site's DB needs to be checked into git, and needs to be accessible by the secondary sites
All Drupal sites need to be backed up regularly to local filesystem, but this doesn't need to be checked into git
There's no shortage of storage space in the upstream, to keep all revisions of the master DB
We only want git clone to pull in the latest revision of the master DB. Downloading the whole history takes too long.
Given this, we propose the following approach:
On any machine, regularly call drush sql-dump --gzip --result-file to automatically store timestamped SQL dumps in ~/drush-backups. These are not stored in git.
In your Drupal project root,
First initialize an empty upstream repository (assume it's accessible at localhost:~/upstream.git):
Then have the master site clone it, and make a few DB dumps:
# clone but only fetches the last 2 commits on master branch
git clone --single-branch --depth=1 localhost:~/db-upstream.git ~/db-master
# ensure that `git pull` or `git fetch` only work on master; this is unnecessary in git 1.8+
git config remote.origin.fetch '+refs/heads/master:refs/remotes/origin/master'# dump filename should always be the same (not timestamped)
drush sql-dump --gzip --result-file=~/db-master/database.sql.gz
cd~/db-master
git add database.sql.gz
git commit database.sql.gz -m "Database dump made on master"
git push -u origin master
Now a second developer (with his own, non-master Drupal site) can easily clone the repo, to pull the DB dump from it:
# clone but only fetches the last 2 commits on master branch
git clone --single-branch --depth=1 localhost:~/db-upstream.git ~/db-dev2
# ensure that `git pull` or `git fetch` only work on master; this is unnecessary in git 1.8+
git config remote.origin.fetch '+refs/heads/master:refs/remotes/origin/master'cd~/db-dev2
du -sh .# should be only 3M (1M for working copy, 2M for 2 revisions pulled in via --depth 1)
ls # contains database.sql.gz
This git clone should be relatively fast, because:
--single-branch ensures only the master branch commits are cloned
--depth 1 ensures that only the latest commit and it's parent are downloaded
Generally the second developer should not commit his dumps, since they're not in the master DB.
Instead, run drush sql-dump --result-file --gzip to create a timestamped file in ~/vagrant-backups.
If it's really essential to commit dev1's dumps for posterity, this needs to be done on a separate branch:
Now clone it, to have the first developer start using it:
git clone localhost:~/upstream ~/db # simple clone since the repo has no large files yetcd~/db
git checkout -b dev1 # create a branch for the first developer
du -sh .# a few kilobytes
Then generate 3 random 10 megabyte dummy files in dev1 branch, and commit each:
dd if=/dev/urandom of=fake-sql.bin bs=1024 count=10000
du -sh .# 10M
git add fake-sql.bin
git commit fake-sql.bin -m "First db dump"
du -sh .# 20M
dd if=/dev/urandom of=fake-sql.bin bs=1024 count=10000
git commit fake-sql.bin -m "Second db dump"
du -sh .# 30M
dd if=/dev/urandom of=fake-sql.bin bs=1024 count=10000
git commit fake-sql.bin -m "Third db dump"
du -sh .# 40M = 10M working copy + 3 commits on dev1 branch
Now let's push our ~/db repo (with dev1 branch) to ~/upstream repo:
git push origin dev1
du -sh ~/upstream # 30M (3 commits on dev1 branch)
Now we'd like to introduce a 2nd developer to the project. She'll need to get the most recent DB dump from dev1 branch, and her own branch to commit backups to. This can be achieved as follows:
# "depth 0" isn't supported yet
git clone --single-branch --branch dev1 --depth 1 vagrant@localhost:~/upstream ~/db2
cd~/db2
du -sh .# 30 MB (10 MB for working copy, 20 MB for last 2 commits in history)
git log # shows the last 2 commits on dev1: Third db dump, Second db dump
git checkout -b dev2
Now let's make 3 more fake dumps on dev2 branch, and push to upstream:
# first note the "before" sizes
du -sh ~/upstream # 30M
du -sh .# 30M (20M for 2 dev commits, 10M working copy)
dd if=/dev/urandom of=fake-sql.bin bs=1024 count=10000
git commit fake-sql.bin -m "First db dump on dev2"
dd if=/dev/urandom of=fake-sql.bin bs=1024 count=10000
git commit fake-sql.bin -m "Second db dump on dev2"
dd if=/dev/urandom of=fake-sql.bin bs=1024 count=10000
git commit fake-sql.bin -m "Third db dump on dev2"# push the new branch to upstream
git push origin dev2
# then compare the after sizes
du -sh .# 60M (10M for working copy, 30M for dev2 commits, 20M dev1 commits)
du -sh ~/upstream # 60M (3 commits on dev1, 3 commits on dev2)
At this point everyone's commits are in upstream. What happens if a third developer wants to join the team?
Naively, we would try cloning just 2 commits from branch dev1, but this has a problem:
git clone --single-branch --branch dev1 --depth 1 vagrant@localhost:~/upstream ~/db3.dev1
du -sh ~/db3.dev1 # 50M (seems to include 2 commits from dev and dev2 branches, plus 10M working copy)
It seems that because dev2 is a child of dev1 this gets pulled in too, and so we get 2 extra commits.
# to confirm that we're pulling in two extra dev2 commits:
cd ~/db3.dev1
git show origin/dev2 # 2 commits
# to confirm this doesn't happen when cloning dev2
git clone --single-branch --branch dev2 --depth 1 vagrant@localhost:~/upstream ~/db3.dev2
du -sh ~/db3.dev2 # 30M (seems to only include 2 commits from dev)
git show origin/dev1 # fatal error
# From http://stackoverflow.com/a/14930421/9621 and git 1.7.10 CHANGELOG:
# "git clone" learned "--single-branch" option to limit cloning to a single branch (surprise!);
# tags that do not point into the history of the branch are not fetched.
# now experiment with cloning dev2 (only 30M!!!):
git clone --single-branch --branch dev2 --depth 1 vagrant@localhost:~/upstream ~/db3.dev2
du -sh ~/db3.dev2 # 30M (seems to only include 2 commits from dev)
# now experiment with cloning master (right size!!)
git clone --single-branch --branch master --depth 1 vagrant@localhost:~/upstream ~/db3.master
du -sh ~/db3.master # a few kilobytes
# note that ~/upstream is not a bare repo
cd ~/upstream
git branch # master
# as an experiment, switch to dev1
git checkout dev1
# then try again
rm -Rf ~/db3.*
git clone --single-branch --branch dev1 --depth 1 vagrant@localhost:~/upstream ~/db3.dev1
git clone --single-branch --branch dev2 --depth 1 vagrant@localhost:~/upstream ~/db3.dev2
git clone --single-branch --branch master --depth 1 vagrant@localhost:~/upstream ~/db3.master
du -sh ~/db3.* # 49M, 49M, 20M respecitvely. WTF?
##
## Now try with a bare repo.... observe that no improvement is made
##
git clone --bare --branch master vagrant@localhost:~/upstream ~/upstream-bare.git
cd ~/upstream-bare.git
du -sh . # 60M from 6 commits
git branch -av # dev1 is checked out; could have used `git clone --branch master` to avoid
git clone --single-branch --branch dev1 --depth 1 vagrant@localhost:~/upstream-bare.git ~/db3.bare-dev1
git clone --single-branch --branch dev2 --depth 1 vagrant@localhost:~/upstream-bare.git ~/db3.bare-dev2
git clone --single-branch --branch master --depth 1 vagrant@localhost:~/upstream-bare.git ~/db3.bare-master
du -sh ~/db3.bare* # 49M, 49M, 20M respecitvely... no improvement
##
## Now observe how upstream HEAD affects things
##
rm -Rf ~/db3*
cd ~/upstream
git checkout dev1
git clone --single-branch --branch dev1 --depth 1 vagrant@localhost:~/upstream ~/db3.dev1-dev1
git clone --single-branch --branch dev2 --depth 1 vagrant@localhost:~/upstream ~/db3.dev1-dev2
git clone --single-branch --branch master --depth 1 vagrant@localhost:~/upstream ~/db3.dev1-master
git clone --single-branch --depth 1 vagrant@localhost:~/upstream ~/db3.dev1-nobranch
du -sh ~/db3.dev1* # 49M, 49M, 20M, 30M respectively
git checkout master
git clone --single-branch --branch dev1 --depth 1 vagrant@localhost:~/upstream ~/db3.master-dev1
git clone --single-branch --branch dev2 --depth 1 vagrant@localhost:~/upstream ~/db3.master-dev2
git clone --single-branch --branch master --depth 1 vagrant@localhost:~/upstream ~/db3.master-master
git clone --single-branch --depth 1 vagrant@localhost:~/upstream ~/db3.master-nobranch
du -sh ~/db3.master* # 49M, 30M, 160K, 160K respectively
git checkout dev2
git clone --single-branch --branch dev1 --depth 1 vagrant@localhost:~/upstream ~/db3.dev2-dev1
git clone --single-branch --branch dev2 --depth 1 vagrant@localhost:~/upstream ~/db3.dev2-dev2
git clone --single-branch --branch master --depth 1 vagrant@localhost:~/upstream ~/db3.dev2-master
git clone --single-branch --depth 1 vagrant@localhost:~/upstream ~/db3.dev2-nobranch
du -sh ~/db3.dev2* # 49M, 30M, 20M, 30M respectively
# note that the size of ~/db3.dev{1,2}-master is surprising... let's investigate
cd ~/db3.dev1-master
git branch -av # neither dev1 nor dev2 are present
git fsck --full # reports a dangling commit (which references 2 objects)
git gc --prune=now --aggressive # cleans up the 2 objects it references
du -sh . # now 160K as expected
##
## TODO: instead try cloning master (0M), then `git fetch origin +dev1:remote/dev1` (WORKS!!)
##
So we discovered there's a conflict between git clone arguments --depth and --branch.
Therefore the best work-around for our scenario is to avoid --branch when cloning, instead relying on upstream HEAD to be set correctly.
Therefore the following should work for developer 3:
git clone --single-branch --depth 1 vagrant@localhost:~/upstream ~/db3
cd~/db3
du -sh .# 30M, as expected
git checkout -b dev3
dd if=/dev/urandom of=fake-sql.bin bs=1024 count=10000
git commit fake-sql.bin -m "1st db dump on dev3"
dd if=/dev/urandom of=fake-sql.bin bs=1024 count=10000
git commit fake-sql.bin -m "2nd db dump on dev3"
dd if=/dev/urandom of=fake-sql.bin bs=1024 count=10000
git commit fake-sql.bin -m "3rd db dump on dev3"
du -sh .# 60M, as expected
git push -u origin dev3
du -sh ~/upstream # 128MB ... too much!cd~/upstream
git gc --prune=now --aggressive
du -sh ~/upstream # 100M ... as expected!
So we're good, it seems.
Misc notes
Gotchas:
To avoid being prompted for a password whenever running ssh localhost in a VM, run this:
git remote set-head origin trunk will connect to upstream repo and set it's head to trunk
git checkout --orphan newbranch will make an orphan branch with current working copy
git pull $ARGS is the same as git fetch $ARGS ; git merge $REMOTE-BRANCH
$ARGS are passed verbatim to git fetch, including --depth and --single-branch
$REMOTE_BRANCH is determined via config value of branch.CURRENTBRANCH.origin and branch.CURRENTBRANCH.merge.
These config settings are written during git push -u origin CURRENTBRANCH or git checkout -b CURRENTBRANCH origin/CURRENTBRANCH, or can be manually set via git config branch.master.remote origin and git config branch.master.merge refs/heads/master
git merge (and so also git pull) only merge into the currently checked out branch
git fetch refspec syntax is different than git push refspec syntax. Both are src:dest which are clearly reversed in push/fetch.