UPDATE: I discovered the issue is with specifying a commit range to filter-branch
. When that is omitted, everything works perfectly, including history truncation! Thanks to everyone who assisted with ideas and suggestions!
I am currently working on a project to split the various components of Zend Framework 2 into their own repositories.
Currently, our repository structure looks like this:
.coveralls.yml
.gitattributes
.gitignore
.php_cs
.travis.yml
bin/
CHANGELOG.md
composer.json
CONTRIBUTING.md
demos/
INSTALL.md
library/
Zend/
{component directories}
LICENSE.txt
README-GIT.md
README.md
resources/
tests/
_autoload.php
Bootstrap.php
phpunit.xml.dist
run-tests.php
run-tests.sh
TestConfiguration.php.dist
TestConfiguration.php.travis
ZendTest/
{component directories}
The goal of the rewrite is to be able to maintain each component individually.
We also want to retain history. Commits often have information on the why behind a change; I was reminded of that just yesterday when a collaborator referenced a commit message as justification for an approach they were using. We have a rich set of issues and pull requests referenced in our commits as well, and we want to retain links to those.
Finally, we want to rewrite the structure of the resulting split directory, for a number of reasons:
- We would like to create a PSR-4 directory structure for each of the source and test code.
- The
composer.json
needs to provide the list of both production and development requirements, and define appropriate autoloaders for the component. - We want to condense the
README-GIT.md
andCONTRIBUTING.md
files into the latter. - Components should have their own
README.md
,travis.yml
,.gitattributes
, and.gitignore
. Further, we can condense the instructions in theINSTALL.md
into theREADME.md
. - The current
TestConfiguration.php.*
files define PHP constants; these could and can be easily moved tophpunit.xml.dist
andphpunit.xml.travis
files. Additionally, thephpunit.xml.*
files, if moved up a level, will simplify running the unit tests (no need to descend a directory down). _autoload.php
andrun-tests.*
can be removed entirely, and theBootstrap.php
file can be vastly simplified.
Essentially, when done, we want the following directory structure:
.coveralls.yml
.gitattributes
.gitignore
.php_cs
.travis.yml
composer.json
CONTRIBUTING.md
src/
LICENSE.txt
phpunit.xml.dist
phpunit.xml.travis
README.md
test/
bootstrap.php
{component test cases}
I've tried a number of methods to accomplish this, including subtree
, and
filter-branch
with each of subdirectory-filter
and tree-filter
subtree
is a "contributed" command of git
, maintained as part of the main
source tree, but not installed by default. It provides a rich set of
functionality around dealing with subtrees of repositories, allowing you to
split off subtrees, add them, and even push commits back and forth between them.
filter-branch
is like a Swiss Army knife for git, and provides mechanisms for
rewriting commit messages, restructuring the repository filesystem, and more.
git subtree
, at first blush, seems like the ideal, easiest solution.
Essentially, the process would be:
- split each of the library and test trees into their own branches.
- create a new repo, and add each of the above branches as subtrees.
As an example:
$ git clone zendframework/zf2
$ git init zend-http
$ cd zf2
$ git subtree split --prefix=library/Zend/Http -b src
$ git subtree split --prefix=tests/ZendTest/Http -b test
$ cd ../zend-http
$ # add in basic assets, and create initial commit
$ git remote add zf2 ../zf2
$ git subtree add --prefix=src/ zf2 src
$ git subtree add --prefix=test/ zf2 test
When done, the directory looks great!
However, the history is all wrong: if you checkout a tag, you get the full contents of the ZF2 tree for that tag. This fails criteria that the repo be in a usable state at any given commit.
I based this on work Ralph Schindler did for splitting out our "service" components when we were starting ZF2; you can read his gist for the full example.
The basic idea is similar to git subtree
; the difference is that you have to
start with separate checkouts for each of the source and tests, as you rewrite
their history:
$ git clone zendframework/zf2 zend-http-src
$ git clone zendframework/zf2 zend-http-test
$ cd zend-http-src
$ git filter-branch --subdirectory-filter library/Zend/Http --tag-name-filter cat -- -all
$ cd ../zend-http-test
$ git filter-branch --subdirectory-filter tests/ZendTest/Http --tag-name-filter cat -- -all
$ cd ..
$ git init zend-http
$ cd zend-http
$ # add in basic assets, and create initial commit
$ git remote add -f src ../zend-http-src
$ git remote add -f test ../zend-http-test
$ git merge -s ours --no-commit src/master
$ git read-tree -u --prefix=src/ src/master
$ git commit -m 'Merging src tree'
$ git merge -s ours --no-commit test/master
$ git read-tree -u --prefix=test/ test/master
$ git commit -m 'Merging test tree'
Again, this looks great at first blush; all the contents for the given component
are rewritten perfectly. But when you start looking at previous tags and
commits, you see an interesting picture: based on the commit and which remote
you added first, you'll see a completely different directory structure.
Like subtree
, this fails criteria that the repo be in a usable state at any
given commit.
tree-filter
allows rewriting the tree contents themselves, which looks like a
perfect fit for our goals; we should be able to retain history and still have
each commit represent only the given component.
In playing with tree-filter
, I also discovered several other filters that are
of interest:
msg-filter
allows rewriting the commit messagescommit-filter
allows detecting and removing empty commitstag-name-filter
ensures tags are rewritten when the parent commits change or are removed
I ended up with something that looks like this:
git filter-branch -f \
--tree-filter "php /path/to/tree-filter.php" \
--msg-filter "sed -re 's/(^|[^a-zA-Z])(\#[1-9][0-9]*)/\1zendframework\/zf2\2/g'" \
--commit-filter 'git_commit_non_empty_tree "$@"' \
--tag-name-filter cat \
${START_COMMIT}..HEAD
tree-filter.php
is a script that rewrites the contents of the directory tree
to match our expectations. It's not just moving files around, but also rewriting
the contents of some files (notably, the composer.json
). One cool aspect is
that we can also introduce files in each commit, such as the various assets I
was adding in the other approaches; this ensures they're present in any given
commit.
The message filter ensures that references to issues and pull requests are rewritten to specify the original ZF2 repository. This is important, as it allows us to link to the original issue and/or pull request from the individual repositories.
The commit filter is intended to prune empty commits. We found that the
--prune-empty
option… didn't. I have no idea why. This worked.
The tag name filter ensures that tags are rewritten properly. Without it, many tags were referencing unreachable commits.
Finally, we specified the start commit, and tell it to rewrite through the most recent.
The above, when doing some limited tests, appeared to work well, with one
exception: we still had empty merge commits. I found some recommendations for
this, and have a secondary filter-branch
operation that runs after the above:
git filter-branch -f \
--commit-filter '
if [ z$1 = z`git rev-parse $3^{tree}` ];then
skip_commit "$@";
else
git commit-tree "$@";
fi
' --tag-name-filter cat ${START_COMMIT}..HEAD ;
The tests I ran on this were somewhat inconclusive; I found that most empty merge commits were removed, but there were still some lingering.
Now, I want you to note a phrase I used earlier: "limited tests".
I've tried a bunch of different iterations. In limited tests, these always seemed to work. By limited, I mean "a subrange of what we'll actually run". The reason for using a subrange? Time.
We have over 26k commits in our current ZF2 repository. Running over the entire range takes 5-6 hours on a machine with 4 cores and 16G of RAM (interestingly, more cores and more RAM do not affect speed much; I can run 5 such jobs in parallel in the same time period). If I specify a range from 2.0.0rc7 forward, I cut the number of commits down to around 13k, which takes a little over 3 hours. Due to the amount of time each run takes, I have to test on subranges.
So, what's the problem?
We've run into several.
A community member attempted to run all splits in parallel EC2 instances last week. Interestingly, when we started seeing them complete, none of the directory structures were rewritten. We don't know why; when we run any one of them individually, they appear to be rewritten fine. The problem may have been due to some last minute tweaks of the scripts (though those changes did not affect the tree-filter itself), but the uncertainty is unsettling
As a result of that failure, I did some more tweaking of our scripts, and used the parallel command to run 5 at a time over this past weekend. When I did a cursory examination, all looked fine. Then I started checking out individual tags, and discovered that not all tag commits were rewritten correctly. In fact, I started checking out the commits that led up to some of these tags, and those were not rewritten, either.
On top of all of this, I've also attemped to prune our history from prior to the specified start commit. To do this, I used a graft point:
echo "${START_COMMIT} > .git/info/grafts
git filter-branch -f --prune-empty --tag-name-filter cat -- --all
git reflog expire --expire=now --all
git gc --prune=now --aggressive
This takes a fair bit of time (though not as long as the filter-branch operation), but I have yet to witness any observable effect. The repository retains the old commits from before the specified start commit, and the repository size has no noticeable size difference.
So, I'm now at a loss: I have yet to get a rewrite that accomplishes all of our goals:
- Keeping history from a given commit forward only
- Rewriting the directory structure in all commits
- Rewriting commit messages when they reference issues/pull requests in all commits
At best, I've accomplished the third.
So, this is my plea for help: I'm unsure how to proceed. Every attempt I've
tried looks like a success at first blush, but examining the repository in more
detail — checking out old commits or tags, etc. — reveals that one or more goals
are not met. I feel like I've exhausted the information I've been able to glean
from the internet on filter-branch
and subtree
at this time, and I need a
new set of eyes to assist.
Currently, I've put our scripts in a dedicated repository. Feel free to issue pull requests there, to comment on this gist, or to contact me directly if you have ideas. Ideally, if you can try something out and post a repository for me to verify the results, that would be fantastic.
It was an errant commit; not sure how that got in there. Surprisingly, it doesn't appear to affect the runs.
Yes; essentially, this is needed so that older tags will continue to work.
Googling now; would like to try this, as the amount of time it takes is ridiculous.