Skip to content

Instantly share code, notes, and snippets.

@emersonf
Last active August 14, 2024 13:56
Show Gist options
  • Save emersonf/7413337 to your computer and use it in GitHub Desktop.
Save emersonf/7413337 to your computer and use it in GitHub Desktop.
A Bash script to compute ETag values for S3 multipart uploads on OS X.
#!/bin/bash
if [ $# -ne 2 ]; then
echo "Usage: $0 file partSizeInMb";
exit 0;
fi
file=$1
if [ ! -f "$file" ]; then
echo "Error: $file not found."
exit 1;
fi
partSizeInMb=$2
fileSizeInMb=$(du -m "$file" | cut -f 1)
parts=$((fileSizeInMb / partSizeInMb))
if [[ $((fileSizeInMb % partSizeInMb)) -gt 0 ]]; then
parts=$((parts + 1));
fi
checksumFile=$(mktemp -t s3md5)
for (( part=0; part<$parts; part++ ))
do
skip=$((partSizeInMb * part))
$(dd bs=1m count=$partSizeInMb skip=$skip if="$file" 2>/dev/null | md5 >>$checksumFile)
done
echo $(xxd -r -p $checksumFile | md5)-$parts
rm $checksumFile
@bitwombat
Copy link

Very cool. I have a patch to make it work for Linux, if that's of interest. I'll fork and PR if so.

One file I have doesn't match S3's MD5 sum, even after multiple downloads. Chunk size is rather big (512 MB).
Any ideas what this could be?

Not that a hash tells us much, but Amazon says its:
29fd5af267ee59b66273451bc0f549e8-2

Whereas your script says:
f209c8604d57297b0e06ca84fafeac00-2

File size is 609865657 bytes.

Different algorithm for big files? Doesn't really make sense.

@RichardBronosky
Copy link

RichardBronosky commented Jan 9, 2017

How do you know what part size was used/to use?
(Size: 9476171423 ETage: 44dab9123b49dab2c2b3b10c360ceda1-1130)

@komiyak
Copy link

komiyak commented Aug 4, 2017

@RichardBronosky
I finally understand.
https://stackoverflow.com/questions/12186993/what-is-the-algorithm-to-compute-the-amazon-s3-etag-for-a-file-larger-than-5gb#answer-19896823

Note: If you uploaded with aws-cli via aws s3 cp then you most likely have a 8MB chunksize. According to the docs, that is the default.

We should use this, if uploaded with aws-cli via aws s3 cp.

$ ./s3etag.sh something.zip 8

@jocot
Copy link

jocot commented Apr 17, 2018

Thanks for this, it helped me validate a heap of files I had in S3.

Note that AWS S3 supports a maximum of 10,000 parts. I recently exceeded this on a project with a 54GB file (5MB part size). The AWS SDK adjusts the part size to fit 10,000 parts. I used this expression to get the right part size to calculate the ETag correctly, if you happen to exceed 10,000 parts. I also specified the partsize in bytes for better accuracy.

partsize = (filesize / 10000) + 1

@veenits
Copy link

veenits commented May 15, 2018

Thank you. This is helpful. Are there any alternatives for xxd on linux?

@cyb3rz3us
Copy link

Awesome script - it doesn't work for SSE-KMS files so if you happen to uncover any intel on how AWS is generating the MD5 for that scenario, please share. Again, awesome job here.

@rfraimow
Copy link

rfraimow commented Dec 9, 2019

Thanks for the script, this is incredibly helpful and we're incorporating it into our workflows!

@skchronicles
Copy link

Linux users

Here is an equivalent script if you are not using OSX. I hope this helps!

#!/bin/bash
set -euo pipefail
if [ $# -ne 2 ]; then
    echo "Usage: $0 file partSizeInMb";
    exit 0;
fi
file=$1
if [ ! -f "$file" ]; then
    echo "Error: $file not found." 
    exit 1;
fi
partSizeInMb=$2
fileSizeInMb=$(du -m "$file" | cut -f 1)
parts=$((fileSizeInMb / partSizeInMb))
if [[ $((fileSizeInMb % partSizeInMb)) -gt 0 ]]; then
    parts=$((parts + 1));
fi
checksumFile=$(mktemp -t s3md5.XXXXXXXXXXXXX)
for (( part=0; part<$parts; part++ ))
do
    skip=$((partSizeInMb * part))
    $(dd bs=1M count=$partSizeInMb skip=$skip if="$file" 2> /dev/null | md5sum >> $checksumFile)
done
etag=$(echo $(xxd -r -p $checksumFile | md5sum)-$parts | sed 's/ --/-/')
echo -e "${1}\t${etag}"
rm $checksumFile

@shntnu
Copy link

shntnu commented Jul 20, 2022

Thank you @skchronicles

I think there's an error in the parts calculations, now fixed below

https://gist.github.com/emersonf/7413337?permalink_comment_id=3244707#gistcomment-3244707

#!/bin/bash
set -euo pipefail
if [ $# -ne 2 ]; then
    echo "Usage: $0 file partSizeInMb";
    exit 0;
fi
file=$1
if [ ! -f "$file" ]; then
    echo "Error: $file not found." 
    exit 1;
fi
partSizeInMb=$2
partSizeInB=$((partSizeInMb * 1024 * 1024)) ### I added this
fileSizeInB=$(du -b "$file" | cut -f 1) ### I edited this
parts=$((fileSizeInB / partSizeInB)) ### I edited this and the next line
if [[ $((fileSizeInB % partSizeInB)) -gt 0 ]]; then 
    parts=$((parts + 1));
fi
checksumFile=$(mktemp -t s3md5.XXXXXXXXXXXXX)
for (( part=0; part<$parts; part++ ))
do
    skip=$((partSizeInMb * part))
    $(dd bs=1M count=$partSizeInMb skip=$skip if="$file" 2> /dev/null | md5sum >> $checksumFile)
done
etag=$(echo $(xxd -r -p $checksumFile | md5sum)-$parts | sed 's/ --/-/')
echo -e "${1}\t${etag}"
rm $checksumFile

@rajivnarayan
Copy link

rajivnarayan commented Jul 23, 2022

Thanks, this is quite useful.

I modified the script to speedup the hash computation and avoid generating temporary files. Link to script

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment