weierophinney/need-help-with-component-split.md

## need-help-with-component-split.md

      
    Raw
  

              need-help-with-component-split.md
            
          
    UPDATE: I discovered the issue is with specifying a commit range to filter-branch. When that is omitted, everything works perfectly, including history truncation! Thanks to everyone who assisted with ideas and suggestions!
I am currently working on a project to split the various components of Zend
Framework 2 into their own repositories.
Currently, our repository structure looks like this:
.coveralls.yml
.gitattributes
.gitignore
.php_cs
.travis.yml
bin/
CHANGELOG.md
composer.json
CONTRIBUTING.md
demos/
INSTALL.md
library/
    Zend/
        {component directories}
LICENSE.txt
README-GIT.md
README.md
resources/
tests/
    _autoload.php
    Bootstrap.php
    phpunit.xml.dist
    run-tests.php
    run-tests.sh
    TestConfiguration.php.dist
    TestConfiguration.php.travis
    ZendTest/
        {component directories}

The goal of the rewrite is to be able to maintain each component individually.
We also want to retain history. Commits often have information on the why
behind a change; I was reminded of that just yesterday when a collaborator
referenced a commit message as justification for an approach they were using.
We have a rich set of issues and pull requests referenced in our commits as
well, and we want to retain links to those.
Finally, we want to rewrite the structure of the resulting split directory, for
a number of reasons:

We would like to create a PSR-4 directory structure for each of the source and
test code.
The composer.json needs to provide the list of both production and
development requirements, and define appropriate autoloaders for the
component.
We want to condense the README-GIT.md and CONTRIBUTING.md files into the
latter.
Components should have their own README.md, travis.yml,
.gitattributes, and .gitignore. Further, we can condense the instructions
in the INSTALL.md into the README.md.
The current TestConfiguration.php.* files define PHP constants; these could
and can be easily moved to phpunit.xml.dist and phpunit.xml.travis files.
Additionally, the phpunit.xml.* files, if moved up a level, will simplify
running the unit tests (no need to descend a directory down).
_autoload.php and run-tests.* can be removed entirely, and the
Bootstrap.php file can be vastly simplified.

Essentially, when done, we want the following directory structure:
.coveralls.yml
.gitattributes
.gitignore
.php_cs
.travis.yml
composer.json
CONTRIBUTING.md
src/
LICENSE.txt
phpunit.xml.dist
phpunit.xml.travis
README.md
test/
    bootstrap.php
    {component test cases}

Methods Tried

I've tried a number of methods to accomplish this, including subtree, and
filter-branch with each of subdirectory-filter and tree-filter
subtree is a "contributed" command of git, maintained as part of the main
source tree, but not installed by default. It provides a rich set of
functionality around dealing with subtrees of repositories, allowing you to
split off subtrees, add them, and even push commits back and forth between them.
filter-branch is like a Swiss Army knife for git, and provides mechanisms for
rewriting commit messages, restructuring the repository filesystem, and more.
Subtree

git subtree, at first blush, seems like the ideal, easiest solution.
Essentially, the process would be:

split each of the library and test trees into their own branches.
create a new repo, and add each of the above branches as subtrees.

As an example:
$ git clone zendframework/zf2
$ git init zend-http
$ cd zf2
$ git subtree split --prefix=library/Zend/Http -b src
$ git subtree split --prefix=tests/ZendTest/Http -b test
$ cd ../zend-http
$ # add in basic assets, and create initial commit
$ git remote add zf2 ../zf2
$ git subtree add --prefix=src/ zf2 src
$ git subtree add --prefix=test/ zf2 test
When done, the directory looks great!
However, the history is all wrong: if you checkout a tag, you get the full
contents of the ZF2 tree for that tag. This fails criteria that the repo be in a
usable state at any given commit.
subdirectory-filter

I based this on work Ralph Schindler did for splitting out our "service"
components when we were starting ZF2; you can read his
gist for the full example.
The basic idea is similar to git subtree; the difference is that you have to
start with separate checkouts for each of the source and tests, as you rewrite
their history:
$ git clone zendframework/zf2 zend-http-src
$ git clone zendframework/zf2 zend-http-test
$ cd zend-http-src
$ git filter-branch --subdirectory-filter library/Zend/Http --tag-name-filter cat -- -all
$ cd ../zend-http-test
$ git filter-branch --subdirectory-filter tests/ZendTest/Http --tag-name-filter cat -- -all
$ cd ..
$ git init zend-http
$ cd zend-http
$ # add in basic assets, and create initial commit
$ git remote add -f src ../zend-http-src
$ git remote add -f test ../zend-http-test
$ git merge -s ours --no-commit src/master
$ git read-tree -u --prefix=src/ src/master
$ git commit -m 'Merging src tree'
$ git merge -s ours --no-commit test/master
$ git read-tree -u --prefix=test/ test/master
$ git commit -m 'Merging test tree'
Again, this looks great at first blush; all the contents for the given component
are rewritten perfectly. But when you start looking at previous tags and
commits, you see an interesting picture: based on the commit and which remote
you added first, you'll see a completely different directory structure.
Like subtree, this fails criteria that the repo be in a usable state at any
given commit.
tree-filter

tree-filter allows rewriting the tree contents themselves, which looks like a
perfect fit for our goals; we should be able to retain history and still have
each commit represent only the given component.
In playing with tree-filter, I also discovered several other filters that are
of interest:

msg-filter allows rewriting the commit messages
commit-filter allows detecting and removing empty commits
tag-name-filter ensures tags are rewritten when the parent commits change or are removed

I ended up with something that looks like this:
git filter-branch -f \
    --tree-filter "php /path/to/tree-filter.php" \
    --msg-filter "sed -re 's/(^|[^a-zA-Z])(\#[1-9][0-9]*)/\1zendframework\/zf2\2/g'" \
    --commit-filter 'git_commit_non_empty_tree "$@"' \
    --tag-name-filter cat \
    ${START_COMMIT}..HEAD
tree-filter.php is a script that rewrites the contents of the directory tree
to match our expectations. It's not just moving files around, but also rewriting
the contents of some files (notably, the composer.json). One cool aspect is
that we can also introduce files in each commit, such as the various assets I
was adding in the other approaches; this ensures they're present in any given
commit.
The message filter ensures that references to issues and pull requests are
rewritten to specify the original ZF2 repository. This is important, as it
allows us to link to the original issue and/or pull request from the individual
repositories.
The commit filter is intended to prune empty commits. We found that the
--prune-empty option… didn't. I have no idea why. This worked.
The tag name filter ensures that tags are rewritten properly. Without it, many
tags were referencing unreachable commits.
Finally, we specified the start commit, and tell it to rewrite through the most
recent.
The above, when doing some limited tests, appeared to work well, with one
exception: we still had empty merge commits. I found some recommendations for
this, and have a secondary filter-branch operation that runs after the above:
git filter-branch -f \
    --commit-filter '
        if [ z$1 = z`git rev-parse $3^{tree}` ];then
            skip_commit "$@";
        else
            git commit-tree "$@";
        fi
'   --tag-name-filter cat ${START_COMMIT}..HEAD ;
The tests I ran on this were somewhat inconclusive; I found that most empty
merge commits were removed, but there were still some lingering.
Now, I want you to note a phrase I used earlier: "limited tests".
I've tried a bunch of different iterations. In limited tests, these always
seemed to work. By limited, I mean "a subrange of what we'll actually run". The
reason for using a subrange? Time.
We have over 26k commits in our current ZF2 repository. Running over the entire
range takes 5-6 hours on a machine with 4 cores and 16G of RAM (interestingly,
more cores and more RAM do not affect speed much; I can run 5 such jobs in
parallel in the same time period). If I specify a range from 2.0.0rc7 forward, I cut the
number of commits down to around 13k, which takes a little over 3 hours. Due to
the amount of time each run takes, I have to test on subranges.
So, what's the problem?
We've run into several.
A community member attempted to run all splits in parallel EC2 instances last week.
Interestingly, when we started seeing them complete, none of the directory
structures were rewritten. We don't know why; when we run any one of them
individually, they appear to be rewritten fine. The problem may have been due to
some last minute tweaks of the scripts (though those changes did not affect the
tree-filter itself), but the uncertainty is unsettling
As a result of that failure, I did some more tweaking of our scripts, and used
the parallel command to run 5 at a time
over this past weekend. When I did a cursory examination, all looked fine. Then I
started checking out individual tags, and discovered that not all tag commits
were rewritten correctly. In fact, I started checking out the commits that led
up to some of these tags, and those were not rewritten, either.
Pruning commits

On top of all of this, I've also attemped to prune our history from prior to
the specified start commit. To do this, I used a graft
point:
echo "${START_COMMIT} > .git/info/grafts
git filter-branch -f --prune-empty --tag-name-filter cat -- --all
git reflog expire --expire=now --all
git gc --prune=now --aggressive
This takes a fair bit of time (though not as long as the filter-branch
operation), but I have yet to witness any observable effect. The repository
retains the old commits from before the specified start commit, and the
repository size has no noticeable size difference.
Conclusions

So, I'm now at a loss: I have yet to get a rewrite that accomplishes all of our
goals:

Keeping history from a given commit forward only
Rewriting the directory structure in all commits
Rewriting commit messages when they reference issues/pull requests in all
commits

At best, I've accomplished the third.
So, this is my plea for help: I'm unsure how to proceed. Every attempt I've
tried looks like a success at first blush, but examining the repository in more
detail — checking out old commits or tags, etc. — reveals that one or more goals
are not met. I feel like I've exhausted the information I've been able to glean
from the internet on filter-branch and subtree at this time, and I need a
new set of eyes to assist.
Currently, I've put our scripts in a dedicated
repository. Feel free to
issue pull requests there, to comment on this gist, or to contact me directly if
you have ideas. Ideally, if you can try something out and post a repository for
me to verify the results, that would be fantastic.