Skip to content

Instantly share code, notes, and snippets.

@dergachev
Last active December 19, 2015 07:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dergachev/f8d81909174463c1a3e1 to your computer and use it in GitHub Desktop.
Save dergachev/f8d81909174463c1a3e1 to your computer and use it in GitHub Desktop.

Git DB workflow

This gist explains how to commit large database dumps into a git repository without bloating.

We assume that your use case is as follows:

  • At a given point, there's only one "master" Drupal database (it could be the primary development site, or it could be the prod site)
  • Only the "master" site's DB needs to be checked into git, and needs to be accessible by the secondary sites
  • All Drupal sites need to be backed up regularly to local filesystem, but this doesn't need to be checked into git
  • There's no shortage of storage space in the upstream, to keep all revisions of the master DB
  • We only want git clone to pull in the latest revision of the master DB. Downloading the whole history takes too long.

Given this, we propose the following approach:

  • On any machine, regularly call drush sql-dump --gzip --result-file to automatically store timestamped SQL dumps in ~/drush-backups. These are not stored in git.
  • In your Drupal project root, First initialize an empty upstream repository (assume it's accessible at localhost:~/upstream.git):
mkdir ~/db-upstream.git
cd ~/db-upstream.git
git --bare init

Then have the master site clone it, and make a few DB dumps:

# clone but only fetches the last 2 commits on master branch
git clone --single-branch --depth=1 localhost:~/db-upstream.git ~/db-master

# ensure that `git pull` or `git fetch` only work on master; this is unnecessary in git 1.8+
git config remote.origin.fetch '+refs/heads/master:refs/remotes/origin/master'

# dump filename should always be the same (not timestamped)
drush sql-dump --gzip --result-file=~/db-master/database.sql.gz

cd ~/db-master
git add database.sql.gz
git commit database.sql.gz -m "Database dump made on master"

git push -u origin master

Now a second developer (with his own, non-master Drupal site) can easily clone the repo, to pull the DB dump from it:

# clone but only fetches the last 2 commits on master branch
git clone --single-branch --depth=1 localhost:~/db-upstream.git ~/db-dev2

# ensure that `git pull` or `git fetch` only work on master; this is unnecessary in git 1.8+
git config remote.origin.fetch '+refs/heads/master:refs/remotes/origin/master'

cd ~/db-dev2
du -sh . # should be only 3M (1M for working copy, 2M for 2 revisions pulled in via --depth 1)
ls # contains database.sql.gz

This git clone should be relatively fast, because:

  • --single-branch ensures only the master branch commits are cloned
  • --depth 1 ensures that only the latest commit and it's parent are downloaded

Generally the second developer should not commit his dumps, since they're not in the master DB. Instead, run drush sql-dump --result-file --gzip to create a timestamped file in ~/vagrant-backups.

If it's really essential to commit dev1's dumps for posterity, this needs to be done on a separate branch:

git checkout -b dev2
# dump, commit
git push -u origin dev2

TODO

Figure out workflow for relating DRUPAL_PROJECT_REPO and db-upstream.git. Resources:

Git DB workflow Proof of Concept

First initialize the upstream repository (assume it's accessible at localhost:~/upstream):

mkdir ~/upstream
cd ~/upstream
git init
touch .gitkeep
git add .gitkeep 
git commit -m "Initial commit on master"

Now clone it, to have the first developer start using it:

git clone localhost:~/upstream ~/db     # simple clone since the repo has no large files yet
cd ~/db
git checkout -b dev1       # create a branch for the first developer
du -sh . # a few kilobytes

Then generate 3 random 10 megabyte dummy files in dev1 branch, and commit each:

dd if=/dev/urandom of=fake-sql.bin bs=1024 count=10000
du -sh . # 10M
git add fake-sql.bin
git commit fake-sql.bin -m "First db dump"
du -sh . # 20M

dd if=/dev/urandom of=fake-sql.bin bs=1024 count=10000
git commit fake-sql.bin -m "Second db dump"
du -sh . # 30M

dd if=/dev/urandom of=fake-sql.bin bs=1024 count=10000
git commit fake-sql.bin -m "Third db dump"
du -sh . # 40M = 10M working copy + 3 commits on dev1 branch

Now let's push our ~/db repo (with dev1 branch) to ~/upstream repo:

git push origin dev1
du -sh ~/upstream # 30M (3 commits on dev1 branch)

Now we'd like to introduce a 2nd developer to the project. She'll need to get the most recent DB dump from dev1 branch, and her own branch to commit backups to. This can be achieved as follows:

# "depth 0" isn't supported yet
git clone --single-branch --branch dev1 --depth 1 vagrant@localhost:~/upstream ~/db2
cd ~/db2
du -sh . # 30 MB (10 MB for working copy, 20 MB for last 2 commits in history)
git log  # shows the last 2 commits on dev1: Third db dump, Second db dump

git checkout -b dev2

Now let's make 3 more fake dumps on dev2 branch, and push to upstream:

# first note the "before" sizes
du -sh ~/upstream # 30M
du -sh . # 30M (20M for 2 dev commits, 10M working copy)

dd if=/dev/urandom of=fake-sql.bin bs=1024 count=10000
git commit fake-sql.bin -m "First db dump on dev2"

dd if=/dev/urandom of=fake-sql.bin bs=1024 count=10000
git commit fake-sql.bin -m "Second db dump on dev2"

dd if=/dev/urandom of=fake-sql.bin bs=1024 count=10000
git commit fake-sql.bin -m "Third db dump on dev2"

# push the new branch to upstream
git push origin dev2

# then compare the after sizes
du -sh . # 60M (10M for working copy, 30M for dev2 commits, 20M dev1 commits)
du -sh ~/upstream # 60M (3 commits on dev1, 3 commits on dev2)

At this point everyone's commits are in upstream. What happens if a third developer wants to join the team?

Naively, we would try cloning just 2 commits from branch dev1, but this has a problem:

git clone --single-branch --branch dev1 --depth 1 vagrant@localhost:~/upstream ~/db3.dev1
du -sh ~/db3.dev1 # 50M (seems to include 2 commits from dev and dev2 branches, plus 10M working copy)

It seems that because dev2 is a child of dev1 this gets pulled in too, and so we get 2 extra commits.

# to confirm that we're pulling in two extra dev2 commits:
cd ~/db3.dev1
git show origin/dev2 # 2 commits

# to confirm this doesn't happen when cloning dev2
git clone --single-branch --branch dev2 --depth 1 vagrant@localhost:~/upstream ~/db3.dev2
du -sh ~/db3.dev2 # 30M (seems to only include 2 commits from dev)
git show origin/dev1 # fatal error

# From http://stackoverflow.com/a/14930421/9621 and git 1.7.10 CHANGELOG:
#    "git clone" learned "--single-branch" option to limit cloning to a single branch (surprise!); 
#    tags that do not point into the history of the branch are not fetched.

# now experiment with cloning dev2 (only 30M!!!):
git clone --single-branch --branch dev2 --depth 1 vagrant@localhost:~/upstream ~/db3.dev2
du -sh ~/db3.dev2 # 30M (seems to only include 2 commits from dev)

# now experiment with cloning master (right size!!)
git clone --single-branch --branch master --depth 1 vagrant@localhost:~/upstream ~/db3.master
du -sh ~/db3.master   # a few kilobytes

# note that ~/upstream is not a bare repo
cd ~/upstream
git branch # master

# as an experiment, switch to dev1
git checkout dev1

# then try again
rm -Rf ~/db3.*
git clone --single-branch --branch dev1 --depth 1 vagrant@localhost:~/upstream ~/db3.dev1
git clone --single-branch --branch dev2 --depth 1 vagrant@localhost:~/upstream ~/db3.dev2
git clone --single-branch --branch master --depth 1 vagrant@localhost:~/upstream ~/db3.master
du -sh ~/db3.*  # 49M, 49M, 20M respecitvely. WTF?

## 
##  Now try with a bare repo.... observe that no improvement is made
##

git clone --bare --branch master vagrant@localhost:~/upstream ~/upstream-bare.git
cd ~/upstream-bare.git
du -sh . # 60M from 6 commits
git branch -av # dev1 is checked out; could have used `git clone --branch master` to avoid
git clone --single-branch --branch dev1 --depth 1 vagrant@localhost:~/upstream-bare.git ~/db3.bare-dev1
git clone --single-branch --branch dev2 --depth 1 vagrant@localhost:~/upstream-bare.git ~/db3.bare-dev2
git clone --single-branch --branch master --depth 1 vagrant@localhost:~/upstream-bare.git ~/db3.bare-master
du -sh ~/db3.bare*  # 49M, 49M, 20M respecitvely... no improvement

##
## Now observe how upstream HEAD affects things
##

rm -Rf ~/db3*
cd ~/upstream

git checkout dev1
git clone --single-branch --branch dev1 --depth 1 vagrant@localhost:~/upstream ~/db3.dev1-dev1
git clone --single-branch --branch dev2 --depth 1 vagrant@localhost:~/upstream ~/db3.dev1-dev2
git clone --single-branch --branch master --depth 1 vagrant@localhost:~/upstream ~/db3.dev1-master
git clone --single-branch --depth 1 vagrant@localhost:~/upstream ~/db3.dev1-nobranch
du -sh ~/db3.dev1* # 49M, 49M, 20M, 30M respectively

git checkout master
git clone --single-branch --branch dev1 --depth 1 vagrant@localhost:~/upstream ~/db3.master-dev1
git clone --single-branch --branch dev2 --depth 1 vagrant@localhost:~/upstream ~/db3.master-dev2
git clone --single-branch --branch master --depth 1 vagrant@localhost:~/upstream ~/db3.master-master
git clone --single-branch --depth 1 vagrant@localhost:~/upstream ~/db3.master-nobranch

du -sh ~/db3.master* # 49M, 30M, 160K, 160K respectively

git checkout dev2
git clone --single-branch --branch dev1 --depth 1 vagrant@localhost:~/upstream ~/db3.dev2-dev1
git clone --single-branch --branch dev2 --depth 1 vagrant@localhost:~/upstream ~/db3.dev2-dev2
git clone --single-branch --branch master --depth 1 vagrant@localhost:~/upstream ~/db3.dev2-master
git clone --single-branch --depth 1 vagrant@localhost:~/upstream ~/db3.dev2-nobranch
du -sh ~/db3.dev2* # 49M, 30M, 20M, 30M respectively

# note that the size of ~/db3.dev{1,2}-master is surprising... let's investigate
cd ~/db3.dev1-master
git branch -av  # neither dev1 nor dev2 are present
git fsck --full # reports a dangling commit (which references 2 objects)
git gc --prune=now --aggressive   # cleans up the 2 objects it references
du -sh .   # now 160K as expected

## 
## TODO: instead try cloning master (0M), then `git fetch origin +dev1:remote/dev1` (WORKS!!)
##

So we discovered there's a conflict between git clone arguments --depth and --branch. Therefore the best work-around for our scenario is to avoid --branch when cloning, instead relying on upstream HEAD to be set correctly.

Therefore the following should work for developer 3:

git clone --single-branch --depth 1 vagrant@localhost:~/upstream ~/db3
cd ~/db3
du -sh . # 30M, as expected
git checkout -b dev3

dd if=/dev/urandom of=fake-sql.bin bs=1024 count=10000
git commit fake-sql.bin -m "1st db dump on dev3"
dd if=/dev/urandom of=fake-sql.bin bs=1024 count=10000
git commit fake-sql.bin -m "2nd db dump on dev3"
dd if=/dev/urandom of=fake-sql.bin bs=1024 count=10000
git commit fake-sql.bin -m "3rd db dump on dev3"
du -sh . # 60M, as expected

git push -u origin dev3
du -sh ~/upstream # 128MB ... too much!
cd ~/upstream
git gc --prune=now --aggressive
du -sh ~/upstream # 100M ... as expected!

So we're good, it seems.

Misc notes

Gotchas:

  • To avoid being prompted for a password whenever running ssh localhost in a VM, run this:
    • cat ~/.vagrant.d/insecure_private_key | vagrant ssh -- "cat - > /home/vagrant/.ssh/id_rsa"
  • git clone --single-branch requires git 1.7.10+.
  • git clone --depth 0 does nothing. The minimum depth is 1, which means "bring me the 2 last commits"
  • git clone --no-hardlinks is required if cloning via local filesystem. Otherwise git repo sizes are artificially reduced via links.
  • git clone --single-branch --b master shouldn't be affected by what the active HEAD in upstream is. But it seems to affect the file size!
  • git clone --single-branch before git 1.8+ leaves default refspec as +refs/heads/*:refs/remotes/origin/*, so git pull will fetch all branches.
  • git config remote.origin.fetch '+refs/heads/master:refs/remotes/origin/master' is an effective workaround
  • See git 1.8 release notes

Other things I learned:

  • git remote set-head origin trunk will connect to upstream repo and set it's head to trunk
  • git checkout --orphan newbranch will make an orphan branch with current working copy
  • git pull $ARGS is the same as git fetch $ARGS ; git merge $REMOTE-BRANCH
    • $ARGS are passed verbatim to git fetch, including --depth and --single-branch
    • $REMOTE_BRANCH is determined via config value of branch.CURRENTBRANCH.origin and branch.CURRENTBRANCH.merge.
      • These config settings are written during git push -u origin CURRENTBRANCH or git checkout -b CURRENTBRANCH origin/CURRENTBRANCH, or can be manually set via git config branch.master.remote origin and git config branch.master.merge refs/heads/master
    • git merge (and so also git pull) only merge into the currently checked out branch
  • git fetch refspec syntax is different than git push refspec syntax. Both are src:dest which are clearly reversed in push/fetch.

Resources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment