Skip to content

Instantly share code, notes, and snippets.

@weierophinney
Last active March 31, 2016 07:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save weierophinney/dfd36c2f839c2810a994 to your computer and use it in GitHub Desktop.
Save weierophinney/dfd36c2f839c2810a994 to your computer and use it in GitHub Desktop.
Please help me with a tricky git component split problem

UPDATE: I discovered the issue is with specifying a commit range to filter-branch. When that is omitted, everything works perfectly, including history truncation! Thanks to everyone who assisted with ideas and suggestions!

I am currently working on a project to split the various components of Zend Framework 2 into their own repositories.

Currently, our repository structure looks like this:

.coveralls.yml
.gitattributes
.gitignore
.php_cs
.travis.yml
bin/
CHANGELOG.md
composer.json
CONTRIBUTING.md
demos/
INSTALL.md
library/
    Zend/
        {component directories}
LICENSE.txt
README-GIT.md
README.md
resources/
tests/
    _autoload.php
    Bootstrap.php
    phpunit.xml.dist
    run-tests.php
    run-tests.sh
    TestConfiguration.php.dist
    TestConfiguration.php.travis
    ZendTest/
        {component directories}

The goal of the rewrite is to be able to maintain each component individually.

We also want to retain history. Commits often have information on the why behind a change; I was reminded of that just yesterday when a collaborator referenced a commit message as justification for an approach they were using. We have a rich set of issues and pull requests referenced in our commits as well, and we want to retain links to those.

Finally, we want to rewrite the structure of the resulting split directory, for a number of reasons:

  • We would like to create a PSR-4 directory structure for each of the source and test code.
  • The composer.json needs to provide the list of both production and development requirements, and define appropriate autoloaders for the component.
  • We want to condense the README-GIT.md and CONTRIBUTING.md files into the latter.
  • Components should have their own README.md, travis.yml, .gitattributes, and .gitignore. Further, we can condense the instructions in the INSTALL.md into the README.md.
  • The current TestConfiguration.php.* files define PHP constants; these could and can be easily moved to phpunit.xml.dist and phpunit.xml.travis files. Additionally, the phpunit.xml.* files, if moved up a level, will simplify running the unit tests (no need to descend a directory down).
  • _autoload.php and run-tests.* can be removed entirely, and the Bootstrap.php file can be vastly simplified.

Essentially, when done, we want the following directory structure:

.coveralls.yml
.gitattributes
.gitignore
.php_cs
.travis.yml
composer.json
CONTRIBUTING.md
src/
LICENSE.txt
phpunit.xml.dist
phpunit.xml.travis
README.md
test/
    bootstrap.php
    {component test cases}

Methods Tried

I've tried a number of methods to accomplish this, including subtree, and filter-branch with each of subdirectory-filter and tree-filter

subtree is a "contributed" command of git, maintained as part of the main source tree, but not installed by default. It provides a rich set of functionality around dealing with subtrees of repositories, allowing you to split off subtrees, add them, and even push commits back and forth between them.

filter-branch is like a Swiss Army knife for git, and provides mechanisms for rewriting commit messages, restructuring the repository filesystem, and more.

Subtree

git subtree, at first blush, seems like the ideal, easiest solution. Essentially, the process would be:

  • split each of the library and test trees into their own branches.
  • create a new repo, and add each of the above branches as subtrees.

As an example:

$ git clone zendframework/zf2
$ git init zend-http
$ cd zf2
$ git subtree split --prefix=library/Zend/Http -b src
$ git subtree split --prefix=tests/ZendTest/Http -b test
$ cd ../zend-http
$ # add in basic assets, and create initial commit
$ git remote add zf2 ../zf2
$ git subtree add --prefix=src/ zf2 src
$ git subtree add --prefix=test/ zf2 test

When done, the directory looks great!

However, the history is all wrong: if you checkout a tag, you get the full contents of the ZF2 tree for that tag. This fails criteria that the repo be in a usable state at any given commit.

subdirectory-filter

I based this on work Ralph Schindler did for splitting out our "service" components when we were starting ZF2; you can read his gist for the full example.

The basic idea is similar to git subtree; the difference is that you have to start with separate checkouts for each of the source and tests, as you rewrite their history:

$ git clone zendframework/zf2 zend-http-src
$ git clone zendframework/zf2 zend-http-test
$ cd zend-http-src
$ git filter-branch --subdirectory-filter library/Zend/Http --tag-name-filter cat -- -all
$ cd ../zend-http-test
$ git filter-branch --subdirectory-filter tests/ZendTest/Http --tag-name-filter cat -- -all
$ cd ..
$ git init zend-http
$ cd zend-http
$ # add in basic assets, and create initial commit
$ git remote add -f src ../zend-http-src
$ git remote add -f test ../zend-http-test
$ git merge -s ours --no-commit src/master
$ git read-tree -u --prefix=src/ src/master
$ git commit -m 'Merging src tree'
$ git merge -s ours --no-commit test/master
$ git read-tree -u --prefix=test/ test/master
$ git commit -m 'Merging test tree'

Again, this looks great at first blush; all the contents for the given component are rewritten perfectly. But when you start looking at previous tags and commits, you see an interesting picture: based on the commit and which remote you added first, you'll see a completely different directory structure. Like subtree, this fails criteria that the repo be in a usable state at any given commit.

tree-filter

tree-filter allows rewriting the tree contents themselves, which looks like a perfect fit for our goals; we should be able to retain history and still have each commit represent only the given component.

In playing with tree-filter, I also discovered several other filters that are of interest:

  • msg-filter allows rewriting the commit messages
  • commit-filter allows detecting and removing empty commits
  • tag-name-filter ensures tags are rewritten when the parent commits change or are removed

I ended up with something that looks like this:

git filter-branch -f \
    --tree-filter "php /path/to/tree-filter.php" \
    --msg-filter "sed -re 's/(^|[^a-zA-Z])(\#[1-9][0-9]*)/\1zendframework\/zf2\2/g'" \
    --commit-filter 'git_commit_non_empty_tree "$@"' \
    --tag-name-filter cat \
    ${START_COMMIT}..HEAD

tree-filter.php is a script that rewrites the contents of the directory tree to match our expectations. It's not just moving files around, but also rewriting the contents of some files (notably, the composer.json). One cool aspect is that we can also introduce files in each commit, such as the various assets I was adding in the other approaches; this ensures they're present in any given commit.

The message filter ensures that references to issues and pull requests are rewritten to specify the original ZF2 repository. This is important, as it allows us to link to the original issue and/or pull request from the individual repositories.

The commit filter is intended to prune empty commits. We found that the --prune-empty option… didn't. I have no idea why. This worked.

The tag name filter ensures that tags are rewritten properly. Without it, many tags were referencing unreachable commits.

Finally, we specified the start commit, and tell it to rewrite through the most recent.

The above, when doing some limited tests, appeared to work well, with one exception: we still had empty merge commits. I found some recommendations for this, and have a secondary filter-branch operation that runs after the above:

git filter-branch -f \
    --commit-filter '
        if [ z$1 = z`git rev-parse $3^{tree}` ];then
            skip_commit "$@";
        else
            git commit-tree "$@";
        fi
'   --tag-name-filter cat ${START_COMMIT}..HEAD ;

The tests I ran on this were somewhat inconclusive; I found that most empty merge commits were removed, but there were still some lingering.

Now, I want you to note a phrase I used earlier: "limited tests".

I've tried a bunch of different iterations. In limited tests, these always seemed to work. By limited, I mean "a subrange of what we'll actually run". The reason for using a subrange? Time.

We have over 26k commits in our current ZF2 repository. Running over the entire range takes 5-6 hours on a machine with 4 cores and 16G of RAM (interestingly, more cores and more RAM do not affect speed much; I can run 5 such jobs in parallel in the same time period). If I specify a range from 2.0.0rc7 forward, I cut the number of commits down to around 13k, which takes a little over 3 hours. Due to the amount of time each run takes, I have to test on subranges.

So, what's the problem?

We've run into several.

A community member attempted to run all splits in parallel EC2 instances last week. Interestingly, when we started seeing them complete, none of the directory structures were rewritten. We don't know why; when we run any one of them individually, they appear to be rewritten fine. The problem may have been due to some last minute tweaks of the scripts (though those changes did not affect the tree-filter itself), but the uncertainty is unsettling

As a result of that failure, I did some more tweaking of our scripts, and used the parallel command to run 5 at a time over this past weekend. When I did a cursory examination, all looked fine. Then I started checking out individual tags, and discovered that not all tag commits were rewritten correctly. In fact, I started checking out the commits that led up to some of these tags, and those were not rewritten, either.

Pruning commits

On top of all of this, I've also attemped to prune our history from prior to the specified start commit. To do this, I used a graft point:

echo "${START_COMMIT} > .git/info/grafts
git filter-branch -f --prune-empty --tag-name-filter cat -- --all
git reflog expire --expire=now --all
git gc --prune=now --aggressive

This takes a fair bit of time (though not as long as the filter-branch operation), but I have yet to witness any observable effect. The repository retains the old commits from before the specified start commit, and the repository size has no noticeable size difference.

Conclusions

So, I'm now at a loss: I have yet to get a rewrite that accomplishes all of our goals:

  • Keeping history from a given commit forward only
  • Rewriting the directory structure in all commits
  • Rewriting commit messages when they reference issues/pull requests in all commits

At best, I've accomplished the third.

So, this is my plea for help: I'm unsure how to proceed. Every attempt I've tried looks like a success at first blush, but examining the repository in more detail — checking out old commits or tags, etc. — reveals that one or more goals are not met. I feel like I've exhausted the information I've been able to glean from the internet on filter-branch and subtree at this time, and I need a new set of eyes to assist.

Currently, I've put our scripts in a dedicated repository. Feel free to issue pull requests there, to comment on this gist, or to contact me directly if you have ideas. Ideally, if you can try something out and post a repository for me to verify the results, that would be fantastic.

@weierophinney
Copy link
Author

One note: Essentially, I feel that the tree-filter approach is correct. What I cannot seem to figure out is why the filter is not applied to every commit in the range specified. Every other aspect of that run works fine (the message filter, commit filter, and tag-name filter are all applied correctly); it's the tree-filter that fails occasionally, without any indication in my logs as to which commits might have issues.

@renatomefi
Copy link

Hello Matthew, I just started a test with Filter and Db components using the "component-split" utility.
Just a tip for performance, use your ram as filesystem (mount a tmpfs), since you have 16GB it'll not be a problem. The filter process seems to use a lot of IOPS.

Another things:
What is this? https://github.com/zendframework/component-split/blob/37d1986caa76fb0079c0b42cfc25ba35babd0808/bin/split-component.sh#L171
You are rewriting the composer.json history to keep old versions usable, is it?

I'll wait for those first tests to run and then I'll take a closer look at how the repository is organized.

@weierophinney
Copy link
Author

What is this?

It was an errant commit; not sure how that got in there. Surprisingly, it doesn't appear to affect the runs.

You are rewriting the composer.json history to keep old versions usable, is it?

Yes; essentially, this is needed so that older tags will continue to work.

use your ram as filesystem (mount a tmpfs)

Googling now; would like to try this, as the amount of time it takes is ridiculous.

@weierophinney
Copy link
Author

I think the root cause may be specifying the commit range to filter-branch. If you consider the commit graph, there will be times when parents are not in that direct lineage, and I think that's where things are going awry. I'm trying a run now that uses -- --all instead of the range, and will report my findings.

@weierophinney
Copy link
Author

@renatomefidf - thanks for the tip on the tmpfs! I started the run on my HDD, and calculated it was going to take ~10 hours to do the full run; on the tmpfs filesystem, it looks like < 3 hours! I'll know before I go to bed if it worked!

@renatomefi
Copy link

@weierophinney you are welcome! tmpfs is a valuable tool! :)

I'm going for a simpler solution now, since you have a "from" and "to" structure, I'm willing to use bfg (https://rtyley.github.io/bfg-repo-cleaner/) to clean everything that is not need (including all other components besides the current) and then use your filter-tree script to apply your logic.
Maybe after the deleting step I can create a backup to run the filter-branch more times while testing!

Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment