Skip to content

Instantly share code, notes, and snippets.

@rajivnarayan
Forked from emersonf/s3etag.sh
Last active June 4, 2024 07:27
Show Gist options
  • Save rajivnarayan/1a8e5f2b6783701e0b3717dbcfd324ba to your computer and use it in GitHub Desktop.
Save rajivnarayan/1a8e5f2b6783701e0b3717dbcfd324ba to your computer and use it in GitHub Desktop.
Calculate checksum corresponding to the entity-tag hash (ETag) of Amazon S3 objects
#!/bin/bash
#
# Calculate checksum corresponding to the entity-tag hash (ETag) of Amazon S3 objects
#
# Usage: compute_etag.sh <filename> <part_size_mb>
#
# filename: file to process
# part_size_mb: chunk size in MiB used for multipart uploads.
# This is 8M by default for the AWS CLI See:
# https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#multipart_chunksize
#
# The Etag for an S3 object can be obtained from the command-line using:
# aws s3api head-object --bucket <bucket-name> --key <key-name> --query ETag --output text
# Note that the Etag may or may not correspond to the MD5 digest, see here for details:
# https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.html
# Adapted from: https://gist.github.com/emersonf/7413337
# Changes
# 7/23/2022
# - Parallelized hash calculation
# - Removed need for temporary files
# Script requires: dd, md5sum, xxd
set -euo pipefail
NUM_PARALLEL=$(nproc)
# Minimum filesize in bytes to switch to multipart uploads
# https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#multipart-threshold
MULTIPART_MINSIZE=$((8*1024*1024))
if [[ $# -ne 2 ]]; then
echo "Usage: $0 file partSizeInMb";
exit 0;
fi
file="$1"
partSizeInMb=$2
if [[ ! -f "$file" ]]; then
echo "Error: $file not found."
exit 1;
fi
# Calculate checksum for a specified file chunk
# inputs: file, partSizeInMb, chunk
# output: chunk md5sum
hash_chunk(){
file="$1"
partSizeInMb="$2"
chunk="$3"
skip=$((partSizeInMb * chunk))
# output chunk + md5 (to allow sorting later)
dd bs=1M count="$partSizeInMb" skip="$skip" if="$file" 2> /dev/null | echo -e "$chunk $(md5sum)"
}
# Integer quotient a/b after rounding up
div_round_up(){
echo $((($1 + $2 - 1)/$2))
}
partSizeInB=$((partSizeInMb * 1024 * 1024))
fileSizeInB=$(du -b "$file" | cut -f1 )
parts=$(div_round_up fileSizeInB partSizeInB)
if [[ $fileSizeInB -gt $MULTIPART_MINSIZE ]]; then
export -f hash_chunk
etag=$(seq 0 $((parts-1)) | \
xargs -P ${NUM_PARALLEL} -I{} bash -c 'hash_chunk "$@"' -- "$file" "$partSizeInMb" {} | \
sort -n -k1,1 |tr -s ' '|cut -f2,3 -d' '|xxd -r -p|md5sum|cut -f1 -d' ')"-$parts"
else
etag=$(md5sum "$file"|cut -f1 -d' ')
fi
echo -e "${file}\t${etag}"
@shntnu
Copy link

shntnu commented Jul 25, 2022

Thanks, @rajivnarayan!

Our most common use case for this operation is to check upload consistency. I'm hoping this issue is addressed aws/aws-cli#6750 because that will make life so much easier!

@rajivnarayan
Copy link
Author

@shntnu yup same use-case, The new checksum support looks useful thanks! Added a note to the issue above using s3api calls.

@tpwrules
Copy link

tpwrules commented Mar 26, 2024

Line 72 needs to have quotes around $file, but other than that this worked great, thank you!

@rajivnarayan
Copy link
Author

Updated thanks for reporting

@TrevorSayre
Copy link

TrevorSayre commented May 16, 2024

Verified working with correct output on macOS with the added steps:

brew install coreutils
PATH="/opt/homebrew/opt/coreutils/libexec/gnubin:$PATH"

This will set you up with the otherwise missing md5sum and nproc (equivalent to sysctl -n hw.logicalcpu), and provides a compatible du (which on macOS does not have the -b flag).

The PATH modification is only needed since du ships with macOS and installing coreutils aliases the GNU versions as gdu without the PATH modification. You can also chose to forego that PATH modification and change line 63 to use gdu instead of du.

@wcmatthysen
Copy link

I modified the script slightly (my version is here). S3 doesn't support more than 10,000 parts. I've seen some other libraries (like Uppy) switch to a variable part-size when they hit 10,000 parts (see this issue for an example of the calculation). The modification now takes this 10,000 parts limit into account and will then switch to a variable part-size calculation. The part-size is also printed at the end as output so that you can get an idea of the part-size that was used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment