Skip to content

Instantly share code, notes, and snippets.

@tmacam
Created June 12, 2011 19:13
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save tmacam/1021890 to your computer and use it in GitHub Desktop.
Save tmacam/1021890 to your computer and use it in GitHub Desktop.
Short guide on how to export code to Git and tidy up its history

Exporting code to Git and tiding up its history

Author

Tiago Alves Macambira [tmacam burocarata org]

Licence

Creative Commons By-SA

Table of Contents

Introduction

So you have that awesome (or perhaps a not so awesome but at least not shameful) project of yours that is just lingering around, collecting dust and you thought: "What if I released this code to the world? Would I get famous? Would I became rich? Would I became the next Linus Torvards?" Well, I would hate to disappoint you but the answer to those question is probably no.

Nevertheless, be it for self promotion, for pure generosity or just for the sake of having third-party maintained backup of your code, releasing it to the world is a Good Thing (tm), and something that would earn you some karma points -- and we are all short on those, right?

"B-b-but", you say, "my code is in a <ancient, restraining, démodé or plain untrendy by last standards> Version Control System and I would like to do what all the other cool kids are doing and export it as a Git repository in, say, GitHub or... like... whatever..."

Fear no more, dear sheep, this guide is for you.

Objectives

This short guide's purpose is to show you how to export a project from another Version Control System -- or even from another Git repository -- such that its history is represented as cleanly and linearly as possible.

Perhaps this project my be part of an old corporate project that you just got approval to release as open source and, although you would like it to retain as much history information as possible when you release it, you still has a need (or obligation) to strip from it all and any sensitive corporate information it has while releasing it. Maybe they are in the form of sensitive log files that found their way into the repository, or personal information (e-mails, usernames) that are in the commit log messages.

So, it's not a matter of just removing, renaming, copying or moving files around and committing -- as those files would still show up in history, revealing the information you wanted to protect and taking unnecessary repository space -- but of doing some serious cleaning and re-structuring in the source code, its history and associated meta-data -- whatever that is. This short guide is also about that.

Starting things up

Importing from the previous version control system

So, the first thing you must do is import your project from the Version Control System it is currently residing into a Git repository.

If it is a subversion repository, git-svn will do just fine. If you are using something else, say, perforce or CVS, similar tools exist to convert your project and its history to Git. You may need to do a intermediary conversion, say, from CVS to subversion and from subversion to Git.

For simplicity, last assume you have a project in a subversion repository. Let's also assume that the URL for this repository root is svn+ssh://svn.example.tld/secure/repositories/meh_project/ and that your project (or the files you want to export) is located in aux/super_dupper_code. The following command would fetch this project and its history from subversion into a new Git repository :

git svn clone --no-metadata \
 svn+ssh://svn.example.tld/secure/repositories/meh_project/aux/super_dupper_code

First, notice that we are not converting the whole repository to Git: we are limiting as much as possible what we are importing from subversion by grabbing just the code inside the super_duper_code directory. If, for some reason you had to import the whole repository into Git, do not worry, we will explain how to "prune" it later.

The import may bring some extra files that you may want to remove say, because they are lame, for some legal reason or because they contain sensitive information that it is not OK to share with the whole world. We will completely remove them and their history from the Git repository later, hopefully leaving no trace of them whatsoever.

Since we have no interest in exporting any changes we make back to its original subversion repository, we are using the --no-metadata option here. It will also get rid of some extra git-svn-id: lines that git-svn adds at the end of every commit. Had we not used the --no-metadata option, we would need to edit the commit messages to remove them. We will also show how to modify commit meta-data (commit messages, commit authors etc) later.

Cloning a repository

"We did not even got started and we are already cloning my repository? What gives?", you may ask.

Most of the steps we will give you in the following sections will alter your Git repository in semi-destructive ways, making heavy use of git-filter-branch. I say semi-destructive because although git-filter-branch almost always makes a copy of your repository's previous state, getting back to this state may be complicated or, depending on the kind of modification performed by git-filter-branch, impossible.

Additionally, it is of our interest to get rid of any "previous state" we get and properly cloning a repository does the trick.

So, to avoid regrets and problems, let's first make a proper backup or clone of your git repository:

git clone --no-hardlinks /XYZ /ABC

Using --no-hardlinks makes Git create a clone by really coping the files and not by using hard-links. This way the original repository won't share files and metadata with its clone. See the man page if you have no idea of what I am talking about.

Another way to get the same effect is by using a file://path/to/your/git/repo URL, as documented in the section "Checklist for shrinking a repository" from git filter-branch manpage:

git clone  file://full/path/to/XYZ /ABC

With your backup done, let's move to destruct and reconstruct your Git history.

Pruning files from history

The import may have brought some extra files. Now it's time to remove them and prune the history we have in our Git repository.

Removing them from Git with a git rm will just remove them from the last commit, but it will still leave traces and previous versions of those files in our Git history -- not really what we wanted. We want to remove any trace of them from the Git repository.

Extract a single directory

Suppose you had to bring more files from your precious VCS than you originally wanted. Say, you imported a whole CVS repository into Git and all you wanted was a project that lives inside a particular subdirectory. In this case, instead of removing all the other files and directories, it would be simpler (and saner) to extract the target subdirectory from the whole mess.

Let's suppose your target subdirectory path is projects/parsing/htmlparser. The following commands would detach it this from your repository, leaving nothing but it and its history:

git filter-branch --subdirectory-filter projects/parsing/htmlparser HEAD -- --all
git reset --hard
git gc --aggressive
git prune

Notice that the first command ends in -- --all. That's right: two dashes space dash-dash-all. That will force Git to rewrite the history for all branches and tags you have.

Now your repository consists only of the contents of projects/parsing/htmlparser and its history. Nothing more, nothing less. Well, you may have mentioned other files in your commit messages but they will not be there.

Remove files and directory from history for real

So, by now we limited our history to the enclosing subdirectory holding all the files we wanted. But there may still be some extra files that you may not want to export because they are lame, for some legal reason or because they contain sensitive information that it is not OK to share with the whole world. Let's erase them from our repository and from its history altogether.

To remove a file or a directory named path/to/SensitiveLogs from your repository, run:

git filter-branch --index-filter \
  "git rm -r -f --cached --ignore-unmatch path/to/SensitiveLogs" \
  --prune-empty HEAD -- --all

Remove all files and directory you don't want exported using the command above.

Fixing and tiding meta-data

OK. As far as files and their history goes, your repository is clean and neat. But during the process of converting your project and its history to Git, some commit information such as commit author and commit messages may have been lost or altered. Perhaps your commit messages mention sensitive data or informs your previous and now invalid e-mail. Time to fix that.

Fix committer information

Let's start this section with the committer information: its name and e-mail address.

Once again, we will use git-filter-branch to edit our commit history. This time, though, I will show two ways to accomplishing the same task.

The first is somewhat more elaborated as it shows how one can programmatically alter the committer information. Say, for instance, that except for a given committer, all other committers' meta-data are OK. So you just want to alter commits related to this guy. Let's say that this guy was you using a now invalid e-mail address. All you have to do is alter only those commits where that old and invalid e-mail is used. Here is how:

git filter-branch --commit-filter '
        if [ "$GIT_COMMITTER_NAME" = "tmacam" ];
        then
                GIT_AUTHOR_NAME=`git config --get user.name`;
                # or ...="Your (full) Name";
                GIT_AUTHOR_EMAIL=`git config --get user.email`;
                # or ...="<your.email@example.tld>";
                GIT_COMMITTER_NAME=$GIT_AUTHOR_NAME;
                GIT_COMMITTER_EMAIL=$GIT_AUTHOR_EMAIL;
                git commit-tree "$@";
        else
                git commit-tree "$@";
        fi' HEAD

Notice that this command is assuming that you had already configured your identification information in git. If this is not your case, just replace those git config --get xxxxxxx commands for "Your name" and "<your.email@example.tld>".

Anyway, as you can see, with some Bash programming kung-fu you can create a pretty elaborated logic on how to replace or modify committers' meta-data.

If all you want is to replace all committer information for a single identity, the following one-liner would to the trick:

git filter-branch --env-filter '\
    GIT_AUTHOR_EMAIL="your.email@example.tld";\
    GIT_AUTHOR_NAME="Your (Full) Name";\
    export GIT_AUTHOR_EMAIL;\
    export GIT_AUTHOR_NAME;\
    export GIT_COMMITTER_EMAIL=${GIT_AUTHOR_EMAIL};\
    export GIT_COMMITTER_NAME=${GIT_AUTHOR_NAME};'

And that's it. All commits will be attributed to "Your (Full) Name <your.email@example.tld>".

Fix log messages

Now time to tidy up those commit log messages. Guess what we will use for this: git-filter-branch and its --msg-filter option. You can perform almost any kind of editing with this duo: add lines, remove lines, replace text. Just give it the name of a program that will alter the log messages and that's it. The sky is the limit. :)

So, here is a short example of a command that will remove all those nasty "git-svn-id:" lines that you got in your log messages just because if did not read what I wrote in the Importing from the previous version control system section.:

git filter-branch --msg-filter ' sed -e "/^git-svn-id:/d" '

Final steps

Shrink your repository.

Now that your repository, its history, commits and their log messages are all clean, tidy and free from shameful or sensitive information, is time to do one last thing: shrink your repository.

See, as I said before in the Cloning a repository section, git-filter-branch does store some copies of the state of the repository as it goes modifying it. Now that we got here, we don't need or want those copies. Time to get rid of them.

Go back to the Cloning a repository section and create another clone of your repository using the procedures explained there. This should give you a clean and neat clone to export/upload.

Final check

Use a tool like GitX or gitk to analyse your history and look for any missing or pending problem. Are there any empty branches you want to remove? Do the commit messages look good? Does your project has any tag or branch that should not be exported or that makes no sense in being exported? Remove them.

Fix those issues and shrink your repository once again. Yeah, your heard me right: go clean your repo once again!

Good boy.

Export it.

Well, time to export :-) Hooray! But export to where?

Well, there are countless options -- you could setup your own git environment or use something like GitHub. I strongly recommend you taking the latter. Just head to GitHub's page, setup an account and click on the "New repository" button. Fill the presented form and follow the steps presented there. And that's it :-) Your code now lives in a public Git repository and is there for the whole world to see. Hope you are proud of if -- I really do. ;)

Closing remarks

Some missing things and TODOs

I merely covered the steps I usually perform when I move code to GitHub from old subversion and CVS repositories of mine that used to hold stuff from my masters and PhD -- so, there your got it, lame code ;)

This means that there are tons of stuff I don't cover here. For instance:

  • How to add a copyright notice to all header files, from their first commit and make them persist across all changes?
  • How to do the opposite: remove comments or copyright notices from files and make this removal persist across changes to the files?
  • Edit the contents of some particular commit message.

And so many other issues I don't have to deal with since I own the code I am releasing. Or because I am lazy to fix everything. :-)

Your mileage may vary ;)

This guide was published in my blog, in http://www.burocrata.org/blog/archives/2011/06/12/396/exporting-code-to-git-and-tiding-up-its-history/

References

@tmacam
Copy link
Author

tmacam commented Jun 12, 2011

Moving subdirectories

First, remove any empty commit

# from http://kerneltrap.org/mailarchive/git/2008/10/30/3860634
git rev-list HEAD | while read c; do [ -n "$(git diff-tree --root $c)" ] || echo $c; done > revs

git filter-branch --commit-filter '
  if grep -q "$GIT_COMMIT" '"$(pwd)/"revs';
  then
    skip_commit "$@";
  else
    git commit-tree "$@";
  fi' HEAD

Now, move the "current directory" to a new location

# this one comes from git-filter-branch manpage
# Found via http://stackoverflow.com/questions/277029/combining-multiple-git-repositories

git filter-branch -f --prune-empty --index-filter \
        'git ls-files -s | sed "s-\t\"*-&newsubdir/-" |
                GIT_INDEX_FILE=$GIT_INDEX_FILE.new \
                        git update-index --index-info &&
         mv "$GIT_INDEX_FILE.new" "$GIT_INDEX_FILE"' HEAD


# http://stackoverflow.com/questions/1425892/how-do-you-merge-two-git-repositories

@tmacam
Copy link
Author

tmacam commented May 8, 2012

Adding files "in the past"

Sometimes you may need to add files to the beginning of the history (LICENSE, AUTHOR comes to my mind).

http://stackoverflow.com/questions/3895453/how-do-i-make-a-git-commit-in-the-past dicussess how.

@RichardDally
Copy link

Awesome post, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment