Skip to content

Instantly share code, notes, and snippets.

@skchronicles
Created April 23, 2020 22:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save skchronicles/7861fd08004876af7bd6f19356fb0b41 to your computer and use it in GitHub Desktop.
Save skchronicles/7861fd08004876af7bd6f19356fb0b41 to your computer and use it in GitHub Desktop.
Calculate S3 ETag
#!/bin/bash
set -euo pipefail
help() { cat << EOF
Calculates S3 etag
USAGE:
s3etag [OPTIONS] input_file [chunk_size_in_MB]
Files uploaded to Amazon S3 that are smaller than 1GB have an etag that is the
MD5 checksum of the uploaded file; however, when a file is larger than than 1GB,
it is broken up into N chunks (of 'chunk_size_in_MB' size) and a checksum is calculated for
each chunk. Each of these chunk's checksums are concatentated together and a final checksum
is calculated. This script takes a file breaks into N chunks depending on user defined
chunk_size_in_MB argument and calculates the checksum of checksums for these chunks.
Positional Arguments:
[1] input_file Calculate S3 etag of this file
[2] chunk_size_in_MB Chunk size in MB for S3 etag calculation [Default: 5]
OPTIONS:
-h, --help Displays usage and help information
NOTE:
If a file's size is less than 50 GB, then chunk_size_in_MB should be set to 5
If a file's size is greater than 50 GB, then chunk_size_in_MB should be set to 50
Examples:
./s3etag /path/to/file.fastq.gz 5 # Calculates etag, breaking up file into 5MB chunks
./s3etag -h # Display usage and help information
EOF
}
s3etag(){
# Calculate S3 etag
file="$1" && if [ ! -f "$file" ]; then echo "Error: File ${file} not found!"; help ; exit 1; fi
partSizeInMb="$2"
checksumFile="$3"
fileSizeInMb=$(du -m "$file" | cut -f 1)
parts=$((fileSizeInMb / partSizeInMb))
if [[ $((fileSizeInMb % partSizeInMb)) -gt 0 ]]; then parts=$((parts + 1)); fi
# Break up file and calculate indivdual checksums of chunks
for (( part=0; part<$parts; part++ )); do
skip=$((partSizeInMb * part))
$(dd bs=1M count=$partSizeInMb skip=$skip if="$file" 2> /dev/null | md5sum >> $checksumFile)
done
# Calculate checksum of checksums
etag=$(echo $(xxd -r -p $checksumFile | md5sum)-$parts | sed 's/ --/-/')
echo -e "${1}\t${etag}"
}
# Main: check usage
if [ $# -eq 0 ]; then help; exit 1; fi
# Check options
case "$1" in
-h | --help) help && exit 0;;
-*) help && exit 1;;
esac
# Parse Args
file="$1"
partSizeInMb="${2:-5}" # Default chunk size: 5MB
checksumFile=$(mktemp -t s3md5.XXXXXXXXXXXXX) # tmp file for appending checksums
trap "rm -f $checksumFile" EXIT
# Calculate S3 etag
s3etag "$file" "$partSizeInMb" "$checksumFile"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment