amake/git-svn-jenkins.org

## git-svn-jenkins.org

      
    Raw
  

              git-svn-jenkins.org
            
          
    Automating an svn-to-git mirror with hosted CI

There are plenty of tutorials for migrating from Subversion to git via git-svn, but they all seem to be focused on a one-time migration after which further development will happen in git. Otherwise the obvious use case for git-svn is where a lone developer uses git to push and pull to a Subversion repository from his local machine.
Like me, however, you may find yourself involved in a Subversion-hosted FOSS project with conservative leadership that strongly prefers Subversion to git. This project may also have an official git mirror of the svn repo, synced for years from a core developer’s personal machine.
You may then find yourself spearheading an effort to move this mirroring infrastructure to your project’s newly licensed hosted CI service.
This article discusses two main topics:

  Pitfalls of reproducing git-svn clones
  Setting up svn-to-git mirroring on Jenkins

Reproducing git-svn clones

git-svn was clearly not intended to be used in the context of continuous integration. I found this out the hard way when I tried to reproduce my project’s official git tree from the official svn repository.
First, consider this environment: You have

  an svn repo (accessible to you),
  a git-svn clone of said repo (not accessible to you), and
  a git repo that is the result of pushing the master branch of #2 (accessible to you).

You might think that you could git clone repo #3 and then run git svn fetch on it to pull in any additional commits to repo #1. But you’d be wrong. A git-svn clone contains a bunch of metadata that is not pushed (or even pushable?) to a remote. If #2 disappears and you want to continue to update #3, you will need to reproduce #2.
By “reproduce” I mean, essentially, “run git svn clone and produce as a result a git tree with a HEAD SHA-1 matching that of the git remote you want to push to”. If the SHA-1 hash of the HEAD commit doesn’t match, then we can’t push future fetched commits to the git remote without force-pushing, which in my case would inconvenience a lot of people and mess up the history.
There are some pitfalls in terms of what goes into the hash.
Authors

The git hash is partially based on the author of the commit. Subversion only records a username, while git uses a name + email identifier such as Aaron <aaron@example.com>. git-svn lets you provide an authors file that maps Subversion usernames to git identifiers.
My project’s official git repo uses an authors file, so I would need to use an identical file as well. Luckily some people have come up with scripts to generate templates. My project has only a handful of contributors, so it wasn’t too hard to recreate this by looking at the official git repo.
Repository URL

The git hash is also based on the URL of the source Subversion repository. Unfortunately for us, this changed partway through the history of the project, based on the whims of our host.
So what if commits 1 through N were synced from a now-defunct repository URL? This makes reproducing the history difficult, but not impossible. What I did was:

  Determine the old URL from the git-svn-id metadata visible via git log. More bad news: The URL is HTTPS-protocol.
  Create a VM and set up an apache+svn server with a self-signed SSL cert.
  Clone the official svn repo and serve it from the VM.
  Adjust /etc/hosts on my main environment to forward the old URL to the VM.
  Do an initial git svn clone up to revision N. That basically looked like:
    git svn clone -s -Aauthors.txt -r 0:N $URL
    
  
  Hope for the best!

After this successfully reproduced up to revision N, I would change the svn URL in .git/config and fetch the remaining commits.
Magic

Unfortunately this did not work as planned! There is a commit in the svn repo that is present in the official git repo but was skipped when I cloned locally, and I still can’t figure out why. The commit in svn is associated with a particular file path, but the content of the change is apparently empty. I have two theories about what happened:

  The svn commit was modified to be essentially a no-op after it was fetched by git-svn, or
  The commit dates back to around 8 years ago; it’s possible that the behavior of svn and/or git has changed since it was first fetched.

I did investigate using a contemporary version of git to clone the repo, but it still skipped the commit in question.
Conclusion, part 1

I ultimately failed to reproduce the official git tree from the official repository. The moral of the story is that git-svn clones are themselves artifacts that must be curated if they are to be anything more than throwaway means for pushing and pulling.
I ended up asking the core developer who had been running the mirror to provide me with a tarball of his git-svn clone, and from this I proceeded to build our Jenkins infrastructure.
svn-to-git mirroring on Jenkins

Continuous integration is a wonderful thing, and we are lucky enough to live in a world where FOSS projects have their choice of free hosted CI providers. My project never took advantage of this until I made a push to do so.
We settled on CloudBees, partly because I’m familiar with Jenkins, and partly because their FOSS program includes compute resources (instead of e.g. Atlassian Bamboo, which we also evaluated; they make you provide your own AWS account).
So, how does one go about automating svn-to-git mirroring in Jenkins? Again we have some problems.
git-svn is not installed!

First of all, git-svn is not installed by default! This is hosted Jenkins so we don’t have root access! In fact, we don’t even have writable persistent storage!
What we do have is a Fedora 17-based slave template (configuration for job executors) with standard compilers and dev tools pre-installed. So we build git-svn!
After a whole lot of trial and error, I managed to create a Jenkins job that builds git-svn and installs it to the local user, then packages the entire thing into a build artifact so it can be used in other jobs. See build-svn-core.sh for the main build script.
Once you have a successful run of this job saved, you can use the Copy Artifact Plugin to copy the artifacts into your actual svn-to-git mirroring job, set up the Perl environment variables, and run git-svn.
No persistent storage

If you run your own Jenkins installation you can decide how you want to handle storage, but in the case of CloudBees DEV@cloud you have to assume that your entire environment is wiped out between runs.
You actually do get some persistent storage, in the form of the DEV@Cloud Private Webdav Repository. This is mounted at /private/$ACCOUNT_NAME, but is read-only.
So are we going to have to re-make our clone from scratch every time? git svn clone is slow as molasses; that would take forever!
In the first half of this article we discovered that you can’t necessarily reproduce a git-svn clone anyway, so we already know we have to “seed” our mirroring job with an existing clone. Why don’t we store that in our read-only persistent storage, and copy it into our job on each run? That would work at first, but git svn fetch would get slower and slower over time as the svn repo diverges from our seed.
My solution was to take the moral from the first half of the article to heart: git-svn clones are themselves artifacts. Thus we use the following strategy:

  Copy the tarballed git-svn clone archived as a build artifact from the previous run of our mirroring job.
  (Bootstrap: If the previous clone was not available, copy the “master seed” from /private/$ACCOUNT_NAME.)
  Run git svn fetch, push to official git mirror.
  Tar up the clone and archive as a build artifact to be copied in #1 on the next execution.

(You may have guessed that despite my claim of “no persistent storage”, Jenkins does store “build artifacts” semi-persistently in some master location. These can be copied with the aforementioned Copy Artifact Plugin, and can be culled according to per-job settings such as number of days since build or number of builds retained.)
See sourceforge-svn-git-sync.sh for the main build script of this job.
Note that by necessity this job will persistently consume your available storage quota in the amount of size of git-svn clone * number of retained builds. Note also that for garbage collection to work properly you must make sure it is run synchronously, e.g. by setting git config gc.autodetach false.
Conclusion, part 2

git-svn is really, really not suited for use in CI environments, but with some persistence it is possible to get the job done.