Skip to content

Instantly share code, notes, and snippets.

@allenday
Last active September 3, 2018 10:07
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save allenday/88a63ed80f9b4fda4080704ac034698c to your computer and use it in GitHub Desktop.
Save allenday/88a63ed80f9b4fda4080704ac034698c to your computer and use it in GitHub Desktop.
Create URL list for Google Cloud Storage Transfer Service, see: https://cloud.google.com/storage/transfer/create-url-list
#!/usr/bin/perl
use strict;
my $http_base = shift;
my $path_base = shift;
my $prev_file = shift;
if ( ! $http_base || ! $path_base ) {
print STDERR <<"HERE";
USAGE:
$0 <HTTP base URL> <Filesystem base path> [<Previous output file>]
EXAMPLE:
$0 http://my.hostname.org/ ~/public_html/sync/ ~/TsvHttpData.pre
SYNOPSIS:
This script is used to generate a 'TsvHttpData' URL list for Google Cloud
Storage Transfer Service, see:
https://cloud.google.com/storage/transfer/create-url-list
TsvHttpData file is tab-delimited and contains 3 columns:
* URL
* Object size, in bytes
* Base64 encoded MD5 checksum of object
MD5 checksumming is I/O intense, and to improve efficience for repeated runs
of this script, we take advantage of each file's modification time to determine
if the previously calculated MD5 checksum can be reused, or needs to be
recalculated. As such, output format contains 4 columns:
* URL
* Object size, in bytes
* Base64 encoded MD5 checksum of object
* Object modification time, in seconds since Unix epoch
TsvHttpData format can be created from this script's output like:
cat ~/TsvHttpData.pre | cut -f 1,2,3 > ~/public_html/TsvHttpData.tsv
HERE
exit(1);
}
if ( ! -d $path_base ) {
die "Not a directory: $path_base";
}
if ( ! -f $prev_file ) {
print STDERR "no previous file";
}
#recover previously cached object modification times
#and MD5 checksums
my %ent = ();
open( P, $prev_file );
<P>; #skip header
while ( my $line = <P> ) {
chomp $line;
my ( $url, $size, $md5, $mod ) = split /\t/, $line;
my $path = $url;
$path =~ s/$http_base/$path_base/;
$ent{ $path } = [ $url, $size, $md5, $mod ];
}
print qq(TsvHttpData-1.0\n);
#iterate over files currently in path to sync.
#reuse cached data if it exists and isn't stale.
foreach my $path ( `find $path_base -type f` ) {
chomp $path;
#modification time
my $m = `stat -c %Y "$path"`;
chomp $m;
if ( $ent{ $path } && $m == $ent{ $path }[3] ) {
print join "\t", $ent{ $path }[0], $ent{ $path }[1], $ent{ $path }[2], $ent{ $path }[3];
print "\n";
next;
}
my $o = $path;
$o =~ s/$path_base//;
$o = $http_base . $o;
#TODO make escaping more robust.
$o =~ s/%/%25/g;
#size in bytes
my $s = -s $path;
#base64 encoded md5 checksum
my $h = `openssl md5 -binary "$path" | openssl enc -base64`;
chomp $h;
print "$o\t$s\t$h\t$m\n";
}
@rhamses
Copy link

rhamses commented Jun 9, 2018

Hey! thanks for that! :)
Just a heads up. My gc file transfer failed when I tried to upload the file generated directly by the output.
I had to erase the last bit of the file, after the md5 hash to make work. I don't know exactly what it stands for though.
I'm transfer files from an ubuntu machine, 17.10
image

@erandagan1000
Copy link

can you share an example how you used this script?
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment