Skip to content

Instantly share code, notes, and snippets.

@mikeatlas
Last active January 12, 2018 15:54
Show Gist options
  • Save mikeatlas/bdfef665e0535bf5996c to your computer and use it in GitHub Desktop.
Save mikeatlas/bdfef665e0535bf5996c to your computer and use it in GitHub Desktop.
syncing ftp to s3 one-time really fast.

Original idea from Transfer files from an FTP server to S3 by "Hack N Cheese".

I moved roughly a terrabyte in less than an hour. Granted, I couldn't take advantage of lftp's --parallel=30 switch due to my ftp source limiting me to one connection at a time, but use-pget-n=N did seem to help out.

  • Get a fast Ubuntu 14.4 EC2 box on Amazon for temporary usage (I went with m1.xlarge) so data tranfers aren't limited by your local bandwidth at least. I also attached a fat 2TB EBS volume and symlinked it to /bigdisk, and made sure the EBS volume was deleted after I terminated this EC2 box. I hope lftp 2.6.4 is available as a stable package by the next time I attempt this.

  • Build lftp 2.6.4+ (Not easy to compile, so read the INSTALL file and plow through all your missing dependencies - you'll also need to re-run sudo ./configure && sudo make && sudo make install if you were in my case, without sudo they just won't work). Presently the Ubunutu apt package is at lftp/trusty,now 4.4.13-1 amd64 [residual-config], so uninstall this if you had it previously since the mirror options in that version of lftp are severely limited and not available till at least 2.6.4 version.

  • Run all these in mosh and tmux window sessions just incase...

  • Run this on your ec2 box:

lftp -e " \
    debug -t 2; \
    set net:max-retries 3000; \
    set net:timeout 10m; \
    set ftp:charset iso-8859-1; \
    open ftp.yoursite.com; \
    mirror \
        --log log.txt \
        --use-pget-n=1000 \
        --use-cache \
        --continue \
        --loop \
        /your/ftp/remote/path /your/ec2/local/path \
    exit; \
    "
  • Note: lftp mirror command with --parallel=30 is only possible if your FTP server lets you connect 30 simultaneous connections. In my case, I was limited to only 1 connection :(.

  • Then wget a copy of s3-parallel-put.py (my (fork)[https://github.com/weftio/s3-parallel-put] for regionalized buckets if you need to):

  • Do the parallel s3 put dance: /your/ec2/local/path$ python s3-parallel-put --bucket=weft-wind-data --secure --put=update --processes=50 --content-type=guess --verbose --log-filename=/tmp/s3pp.log /your/local/ec2/path

Wow, not so bad. Kinda. Except I had to hack a pull request for s3-parallel-put to support my bucket which lives outside the US Standard region - you may need this as well.

@mikeatlas
Copy link
Author

@sylvainemery h/t to your blog post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment