Skip to content

Instantly share code, notes, and snippets.

@rajivnarayan
Forked from emersonf/s3etag.sh
Last active March 26, 2024 21:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save rajivnarayan/1a8e5f2b6783701e0b3717dbcfd324ba to your computer and use it in GitHub Desktop.
Save rajivnarayan/1a8e5f2b6783701e0b3717dbcfd324ba to your computer and use it in GitHub Desktop.
Calculate checksum corresponding to the entity-tag hash (ETag) of Amazon S3 objects
#!/bin/bash
#
# Calculate checksum corresponding to the entity-tag hash (ETag) of Amazon S3 objects
#
# Usage: compute_etag.sh <filename> <part_size_mb>
#
# filename: file to process
# part_size_mb: chunk size in MiB used for multipart uploads.
# This is 8M by default for the AWS CLI See:
# https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#multipart_chunksize
#
# The Etag for an S3 object can be obtained from the command-line using:
# aws s3api head-object --bucket <bucket-name> --key <key-name> --query ETag --output text
# Note that the Etag may or may not correspond to the MD5 digest, see here for details:
# https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.html
# Adapted from: https://gist.github.com/emersonf/7413337
# Changes
# 7/23/2022
# - Parallelized hash calculation
# - Removed need for temporary files
# Script requires: dd, md5sum, xxd
set -euo pipefail
NUM_PARALLEL=$(nproc)
# Minimum filesize in bytes to switch to multipart uploads
# https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#multipart-threshold
MULTIPART_MINSIZE=$((8*1024*1024))
if [[ $# -ne 2 ]]; then
echo "Usage: $0 file partSizeInMb";
exit 0;
fi
file="$1"
partSizeInMb=$2
if [[ ! -f "$file" ]]; then
echo "Error: $file not found."
exit 1;
fi
# Calculate checksum for a specified file chunk
# inputs: file, partSizeInMb, chunk
# output: chunk md5sum
hash_chunk(){
file="$1"
partSizeInMb="$2"
chunk="$3"
skip=$((partSizeInMb * chunk))
# output chunk + md5 (to allow sorting later)
dd bs=1M count="$partSizeInMb" skip="$skip" if="$file" 2> /dev/null | echo -e "$chunk $(md5sum)"
}
# Integer quotient a/b after rounding up
div_round_up(){
echo $((($1 + $2 - 1)/$2))
}
partSizeInB=$((partSizeInMb * 1024 * 1024))
fileSizeInB=$(du -b "$file" | cut -f1 )
parts=$(div_round_up fileSizeInB partSizeInB)
if [[ $fileSizeInB -gt $MULTIPART_MINSIZE ]]; then
export -f hash_chunk
etag=$(seq 0 $((parts-1)) | \
xargs -P ${NUM_PARALLEL} -I{} bash -c 'hash_chunk "$@"' -- "$file" "$partSizeInMb" {} | \
sort -n -k1,1 |tr -s ' '|cut -f2,3 -d' '|xxd -r -p|md5sum|cut -f1 -d' ')"-$parts"
else
etag=$(md5sum "$file"|cut -f1 -d' ')
fi
echo -e "${file}\t${etag}"
@shntnu
Copy link

shntnu commented Jul 25, 2022

Thanks, @rajivnarayan!

Our most common use case for this operation is to check upload consistency. I'm hoping this issue is addressed aws/aws-cli#6750 because that will make life so much easier!

@rajivnarayan
Copy link
Author

@shntnu yup same use-case, The new checksum support looks useful thanks! Added a note to the issue above using s3api calls.

@tpwrules
Copy link

tpwrules commented Mar 26, 2024

Line 72 needs to have quotes around $file, but other than that this worked great, thank you!

@rajivnarayan
Copy link
Author

Updated thanks for reporting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment