Skip to content

Instantly share code, notes, and snippets.

@res0nat0r
Forked from emersonf/s3etag.sh
Last active July 16, 2021 17:17
Show Gist options
  • Save res0nat0r/2f9878e41448c62089ceb295b5916ee6 to your computer and use it in GitHub Desktop.
Save res0nat0r/2f9878e41448c62089ceb295b5916ee6 to your computer and use it in GitHub Desktop.
A Bash script to compute ETag values for S3 multipart uploads on OS X.

Verifying Amazon S3 multi-part uploads with the ETag hash

Uploading big files to Amazon S3 can be a bit of pain when you're on an unstable network connection. If an error occurs, your transfer will be cancelled and you can start the upload process all over again.

To check the integrity of a file that was uploaded in multiple parts, you can calculate the checksum of the local file and compare it with the checksum on S3. Problem is: Amazon doesn't use a regular md5 hash for multipart uploads. In this post we'll take a look at how you can compute the correct checksum on your computer so you can compare it to the checksum calculated by Amazon.

The solution So if you want to check if your files where transferred correctly, you have to compute the ETag hash in the same way that Amazon does. Luckily there is this bash script which splits up your files (like the multipart upload) and calculates the correct ETag hash.

To use it:

Download the script from GitHub and save it somewhere.

In terminal, make it executable:

chmod +x s3etag.sh Now you can use the script. Say you uploaded a file myBigBackup.zip and set your multipart upload size to 16 megabytes. After transferring the file to S3 you want to check the integrity: ./s3etag.sh myBigBackup.zip 16 The script should return the same hash as Amazon has calculate. If not, your file got corrupted somewhere and needs to be re-uploaded.

Background information Multipart & ETag Multipart uploading splits big files into smaller pieces and uploads them one by one. After receiving all the parts, Amazon will stitch them back together. If one of the parts fail to upload, you just hit "retry" for that piece. You don't have to re-upload the entire file! Great for unstable connections!

Each file on S3 gets an ETag, which is essentially the md5 checksum of that file. Comparing md5 hashes is really simple but Amazon calculates the checksum differently if you've used the multipart upload feature. Instead of calculating the hash of the entire file, Amazon calculates the hash of each part and combines that into a single hash.

Manually compute the ETag

This is what an ETag looks like: 57f456164b0e5f365aaf9bb549731f32-95

It has two parts, separated by a dash. The first part is the actual checksum and the second part indicates in how many parts the file was split during transfer.

To calculate the first part you have to make a list of md5 hashes of all your parts, convert it into binary format and take the md5 hash of it. Afterwards you append a dash and add the number of parts you've split your file in. That's a very brief summary. Check out this StackOverflow answer if you want to know more.

On Amazon's side this makes a lot of sense: they calculate the hashes of each part as they receive it. After all the pieces are transferred they only have to combine the hashes. No need to read the big file again to calculate the hash.

#!/bin/bash
if [ $# -ne 2 ]; then
echo "Usage: $0 file partSizeInMb";
exit 0;
fi
file=$1
if [ ! -f "$file" ]; then
echo "Error: $file not found."
exit 1;
fi
partSizeInMb=$2
fileSizeInMb=$(du -m "$file" | cut -f 1)
parts=$((fileSizeInMb / partSizeInMb))
if [[ $((fileSizeInMb % partSizeInMb)) -gt 0 ]]; then
parts=$((parts + 1));
fi
checksumFile=$(mktemp -t s3md5)
for (( part=0; part<$parts; part++ ))
do
skip=$((partSizeInMb * part))
$(dd bs=1m count=$partSizeInMb skip=$skip if="$file" 2>/dev/null | md5 >>$checksumFile)
done
echo $(xxd -r -p $checksumFile | md5)-$parts
rm $checksumFile
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment