Skip to content

Instantly share code, notes, and snippets.

@petrmvala
Created December 26, 2020 15:49
Show Gist options
  • Save petrmvala/62c4c4ef4a8b7192d0999425395ff950 to your computer and use it in GitHub Desktop.
Save petrmvala/62c4c4ef4a8b7192d0999425395ff950 to your computer and use it in GitHub Desktop.
Deduplicate files
#!/bin/bash
# Deduplicates files based on md5 hash
#
# Usage:
# ./deduplicate <directory_name>
#
#
# Output like:
# [DUPLICATE]: (./Documents/Olympus pics/Stockholm445.JPG) found in: (./Documents/P5310445.JPG)
#
# Suggestion:
# Redirect output to file (dups) and alter such that filenames are on individual lines, omitting the filenames which you want to keep.
# Then use the file as to-delete list like this (use IFS to counter the plague in the form of spaces in filenames):
#
# IFS=$(echo -en "\n\b"); for file in `cat dups`; do echo Removing duplicate: "$file"; sudo rm "$file"; done
directory=$1
find ./$directory -type f -exec md5 {} \; \
| awk -F'=' '{gsub(/ /, "", $2); gsub(/MD5/, "", $1); print $2 "|" $1}' \
| sort \
| awk 'BEGIN{ FS="|"; count = 1 } \
{ curr = $1; \
if(curr == prev) { \
rec = rec " " $2; \
count = count + 1 \
} else { \
if(count>1) { \
print rec " "; \
count = 1; \
}; \
rec = "[DUPLICATE]: " $2 " found in:" \
}; \
prev = curr \
} \
END{if(count>1) print rec}'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment