Skip to content

Instantly share code, notes, and snippets.

@mattparlane
Created March 13, 2023 02:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mattparlane/8dfea96b5a753e139434b45181fa733c to your computer and use it in GitHub Desktop.
Save mattparlane/8dfea96b5a753e139434b45181fa733c to your computer and use it in GitHub Desktop.
Simple multi-thread R2/S3 backup
#!/usr/bin/env ruby
require 'aws-sdk-core'
require 'aws-sdk-s3'
require 'digest'
date = Date.today.strftime('%Y-%m-%d')
client = Aws::S3::Client.new(
access_key_id: 'XXXXXX',
secret_access_key: 'XXXXXX',
endpoint: 'https://XXXXXX.r2.cloudflarestorage.com/',
region: 'auto',
)
client.list_buckets.buckets.each do |bucket|
queue = Queue.new
client.list_objects(bucket: bucket.name).each do |response|
response.contents.each do |object|
queue << [bucket.name, object]
rescue => e
p e
sleep 5
retry
end
rescue => e
p e
sleep 5
retry
end
threads = []
8.times do
threads << Thread.new do
while !queue.empty? do
begin
bucket_name, object = queue.pop
path = "r2-backups/#{date}/#{bucket_name}"
FileUtils.mkdir_p(path)
file_path = "#{path}/#{object.key}"
if File.exist? file_path
md5 = Digest::MD5.hexdigest(File.read(file_path))
etag = object.etag.gsub(/"/, '') # For some reason the etags are double-quoted
next if etag == md5
end
puts "#{bucket_name}/#{object.key}"
real_object = client.get_object(bucket: bucket_name, key: object.key)
body = real_object.body.read
IO.write(file_path, body)
rescue => e
p e
sleep 5
retry
end
end
end
end
threads.each(&:join)
end
@mattparlane
Copy link
Author

This is a simple multi-threaded backup for R2/S3. I am using R2 so I'm only testing with that, but it uses AWS's Gem so it should be compatible with S3.

It uses the current date as the base path and then a separate directory for each bucket.

It processes one bucket at a time, and uses 8 threads per bucket.

It checks the MD5 of the local file against the etag of the remote file and only downloads if they don't match. I am locally recursively copying the last backup directory and then running this process over the local files, in order to reduce the total time.

Let me know if any questions/bugs, happy to iterate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment