DerekV/ArchivingRemoteMavenRepo.md

## ArchivingRemoteMavenRepo.md

      
    Raw
  

              ArchivingRemoteMavenRepo.md
            
          
    I had a need to grab a backup of a remove Maven repo, and then saving this to AWS s3 for backup / archival purposes.  In my case, it was an Artifactory Online instance that hosted jars and other artifacts that we own, but we did not have ssh access.
This is published with the intent of providing some inspiration for your own solution, as well as notes for myself.  If you have something to add that might be useful for others, please open a PR.
Authentication

This was tested against artifactory, which documents several ways to authenticate.

In my case I used a header with an API token, which is Artifactory specific.  However for this Gist I will replace that with basic auth, which will work with (I think) a broader range of maven implementations such as Sonatype Nexus, and works with Artifactory as well.
Set up environment

You need:

wget  - to clone the maven repo
aws cli  - if you want to archive to to s3
tar - tested with gnu tar
local storage for the cloned artifacts to land

This is an example of setting up an ec2 Ubuntu 20 instance with an extra data volume.
sudo apt install wget curl awscli
sudo mkfs.xfs /dev/xvdb
sudo mount /dev/xvdb /mnt
sudo mkdir /mnt/maven-archives
sudo chown ubuntu:ubuntu /mnt/maven-archives
export $LOCAL_ARCHIVE=/mnt/maven-archives

Spider the Maven repo using wget

WGet is very cleaver with crawling (spidering) a remote site by parsing links out of html.
This works well with a maven repo because it is a self-listing filesystem like api, a bit like a webdav server.
cd $LOCAL_ARCHIVE
wget -m --no-parent -E -K -e robots=off  --user=$MVN_USERNAME --password=$MVN_PASSWORD --reject ".html"  https://$MAVEN_HOST/$REPO

In the case of Aartifactory Online users, your $MAVEN_HOST might be set to something like exampledotcom.jfrog.io/artifactory.  Please note the /artifactory bit.
The --reject .html means that wget will follow and download html links in order to do it's crawl, but it won't save them as files.  You could alternatively save them and delete them after with a find search.
One thing that is nice about the wget approach is that if you need to resume, or you need to update again later, you can just run it again.  It will scan the whole remote recursively again, but it shouldn't re-download files.  More about this in later
Alternate, clone a list of repos

for REPO in libs-release-local libs-snapshot-local ext-release-local ext-snapshot-local plugins-release-local plugins-snapshot-local exapample-shapshots example-releases exapmle-extras-production; do wget -m --no-parent -E -K -e robots=off --reject ".html" --user=$MVN_USERNAME --password=$MVN_PASSWORD https://$MVN_HOST/$REPO; done

Wrapping into tarballs

Since my target was to archive into glacier, I didn't want to upload a bunch of small pom.xml files, since there is an overhead per object.  You can work out for yourself if the juice is worth the squeeze on this for your usecase.
To give it a little bit more fine grain structure then just one giant tar, I identified all the group x artifactId combinations by locating the maven_metadata.xml files, and then combined all the versions together into one tarball.  I also construct a name for each archive by replacing slashes with underscores.
cd $LOCAL_ARCHIVE
mkdir tars
for TARGET in $(find ./ -type f -name 'maven-metadata.xml'  | xargs dirname ); do NAME="$(echo $TARGET | sed 's/\.\///g;s/\//_/g').tar.gz"; echo taring $TARGET ' -> ' $NAME; tar czf "./tars/$NAME" $TARGET; done

Regarding the compression - I doubt it does much but burn CPU cycles, since the bulk of our content is already compressed in some way, such .jar or .gz files, so the only thing left that is compressible would be the xml and other metadata.  You can test and decide for yourself, if you want it to run faster you could leave it off or use a faster algorithm or setting.
Archiving to glacier

I opted to use the s3 glacier storage class, rather than the glacier "direct api".
export ARCHIVE_BUCKET=maven-archives.example.com
export PREFIX=2021-02-13

If needed, create the bucket:
aws s3api create-bucket --acl private --bucket artifactory-archives.pc.res0.net

Upload using cp
cd $LOCAL_ARCHIVE/tars
aws s3 cp --storage-class=GLACIER --recursive ./ s3://ARCHIVE_BUCKET/$PREFIX/

Thoughts

What I did is really only suitable for an occasional, one off solution I think. To make it work well for an automated incremental archival or backup solution, I would make some adjustments.
For an "incremental archive" solution, I'd adjust the archives to be one tar per version of the artifact, and I'd maintain a local copy on disk, perhaps on an old, spare machine, and leave the $LOCAL_ARCHIVE path in tact to avoid re-downloading the same files.  You'd need to add some logic to only upload new archives.
A much simpler approach, and probably not that much more costly, would be to use wget + aws s3 sync, and not worry about tar or compression.  From there you can select appropriate storage clasess using lifecycle rules on the s3 bucket.  This would yield a sort of live backup that I think would work quite well.