Skip to content

Instantly share code, notes, and snippets.

@leonmax
Last active November 14, 2022 04:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save leonmax/0db294f10292d286f3614ef6f56e1cfe to your computer and use it in GitHub Desktop.
Save leonmax/0db294f10292d286f3614ef6f56e1cfe to your computer and use it in GitHub Desktop.
Find duplicated files (of the same md5) within a folder using awk.
#!/usr/bin/env -S awk -f
# Example usage:
# `fd -tf -x md5 -r | find_dup | jq -s`
# (the example assume you have `fd` and `jq`, which is not required for this script.
{
md5=$1; # md5 is expected as 1st argument
$1=""; # shift
gsub(/^[ \t]+|[ \t]+$/, ""); # trim
name=$0; # filename is expected as 2nd argument
files[md5] = md5 in files \
? files[md5] "\",\"" name \
: name;
count[md5]++;
} END {
for ( md5 in files )
if ( count[md5] > 1 ) # wrapped duplicates in json format
printf "{\"MD5\": \"%s\", \"files\":[\"%s\"]}\n", md5, files[md5]
}

This gist helps to find all the files of the same md5 in a folder you can save the awk above as a executable on your path such as ~/.local/bin/find_dup

chmod +x ~/.local/bin/find_dup

then you can run the script below:

find . -type f -exec md5 -r {} \; | find_dup | jq -s

This assumes you are on mac with md5 instead of md5sum, and have jq installed.

Also recommmend fd, which is much faster and respect your .gitignore.

fd -tf -x md5 -r | find_dup | jq -s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment