Skip to content

Instantly share code, notes, and snippets.

@snarlysodboxer
Created January 30, 2016 01:23
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save snarlysodboxer/005e5fb45da0cdf26108 to your computer and use it in GitHub Desktop.
Find (and optionally delete) duplicate files in directories using md5sum
#!/usr/bin/env ruby
# Usage:
# `./find_duplicates.rb 'directory to search*'` for a dry run, and
# `./find_duplicates.rb 'directory to search*' --delete` to actually delete the duplicates.
find_path = ARGV[0]
delete = false
delete = true if ARGV[1] == "--delete"
files = `find #{find_path} -type f`.split "\n"
array = []
files.each do |file|
hash = {}
hash['name'] = file
hash['sum'] = `md5sum #{file} | awk '{print $1}'`.chomp
array.push hash
end
new_hash = {}
array.each do |hash|
new_hash[hash['sum']] ||= []
new_hash[hash['sum']].push hash['name']
end
non_duplicates = 0
new_hash.each do |key, value|
if value.length > 1
puts "Found multiples files with the md5sum #{key}"
print " #{value}\n"
puts " Leave #{value.shift}"
puts " Delete #{value.join(" ")}"
if delete
puts " running `rm #{value.join(" ")}`"
`rm #{value.join(" ")}`
end
else
# puts "Found only one file with the md5sum #{key}"
non_duplicates += 1
end
end
puts "#{non_duplicates} files found without duplicates"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment