Skip to content

Instantly share code, notes, and snippets.

@samuelkadolph
Created June 23, 2011 05:25
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save samuelkadolph/1041956 to your computer and use it in GitHub Desktop.
Save samuelkadolph/1041956 to your computer and use it in GitHub Desktop.
Find duplicate files, works best on jruby for true multithreading
require "digest/md5"
require "thread"
files = Dir["**/*"]
files_mutex = Mutex.new
hashes = Hash.new { |h, k| h[k] = [] }
hashes_mutex = Mutex.new
processors = `sysctl -n hw.logicalcpu`.to_i
processors.times.map do |i|
Thread.start do
digest = Digest::MD5.new
loop do
file = files_mutex.synchronize { files.shift }
break unless file
next unless File.file?(file)
File.open(file, "rb") do |f|
begin
loop { digest << f.readpartial(4096) }
rescue EOFError
end
end
hash = digest.hexdigest
hashes_mutex.synchronize { hashes[hash] << file }
digest.reset
end
end
end.each(&:join)
hashes.each do |hash, files|
puts "Duplicate: #{files.join(", ")}" if files.size > 1
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment