Skip to content

Instantly share code, notes, and snippets.

@shntnu
Forked from rajivnarayan/compute_etag.sh
Created July 25, 2022 18:51
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shntnu/320826671cac1d564e7f2374cdf3a1ca to your computer and use it in GitHub Desktop.
Save shntnu/320826671cac1d564e7f2374cdf3a1ca to your computer and use it in GitHub Desktop.
Calculate checksum corresponding to the entity-tag hash (ETag) of Amazon S3 objects
#!/bin/bash
#
# Calculate checksum corresponding to the entity-tag hash (ETag) of Amazon S3 objects
#
# Usage: compute_etag.sh <filename> <part_size_mb>
#
# filename: file to process
# part_size_mb: chunk size in MiB used for multipart uploads.
# This is 8M by default for the AWS CLI See:
# https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#multipart_chunksize
#
# The Etag for an S3 object can be obtained from the command-line using:
# aws s3api head-object --bucket <bucket-name> --key <key-name> --query ETag --output text
# Note that the Etag may or may not correspond to the MD5 digest, see here for details:
# https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.html
# Adapted from: https://gist.github.com/emersonf/7413337
# Changes
# 7/23/2022
# - Parallelized hash calculation
# - Removed need for temporary files
# Script requires: dd, md5sum, xxd
set -euo pipefail
NUM_PARALLEL=$(nproc)
# Minimum filesize in bytes to switch to multipart uploads
# https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#multipart-threshold
MULTIPART_MINSIZE=$((8*1024*1024))
if [[ $# -ne 2 ]]; then
echo "Usage: $0 file partSizeInMb";
exit 0;
fi
file="$1"
partSizeInMb=$2
if [[ ! -f "$file" ]]; then
echo "Error: $file not found."
exit 1;
fi
# Calculate checksum for a specified file chunk
# inputs: file, partSizeInMb, chunk
# output: chunk md5sum
hash_chunk(){
file="$1"
partSizeInMb="$2"
chunk="$3"
skip=$((partSizeInMb * chunk))
# output chunk + md5 (to allow sorting later)
dd bs=1M count="$partSizeInMb" skip="$skip" if="$file" 2> /dev/null | echo -e "$chunk $(md5sum)"
}
# Integer quotient a/b after rounding up
div_round_up(){
echo $((($1 + $2 - 1)/$2))
}
partSizeInB=$((partSizeInMb * 1024 * 1024))
fileSizeInB=$(du -b "$file" | cut -f1 )
parts=$(div_round_up fileSizeInB partSizeInB)
if [[ $fileSizeInB -gt $MULTIPART_MINSIZE ]]; then
export -f hash_chunk
etag=$(seq 0 $((parts-1)) | \
xargs -P ${NUM_PARALLEL} -I{} bash -c 'hash_chunk "$@"' -- "$file" "$partSizeInMb" {} | \
sort -n -k1,1 |tr -s ' '|cut -f2,3 -d' '|xxd -r -p|md5sum|cut -f1 -d' ')"-$parts"
else
etag=$(md5sum $file|cut -f1 -d' ')
fi
echo -e "${file}\t${etag}"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment