Skip to content

Instantly share code, notes, and snippets.

@georgy7
Forked from ma11hew28/find-duplicate-files.rb
Last active March 8, 2017 18:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save georgy7/a8ab4d5a2e90282b189c to your computer and use it in GitHub Desktop.
Save georgy7/a8ab4d5a2e90282b189c to your computer and use it in GitHub Desktop.
#! /usr/bin/ruby
require 'rubygems'
require 'digest/md5'
require 'json'
# Usage:
# 1. Locate a folder where you want to search dublicates in console.
# 2. Run the script without any arguments.
# 3. Watch the progress.
# 4. Get your dublicates.json file.
#
# https://gist.github.com/georgy7/a8ab4d5a2e90282b189c
# Forked from https://gist.github.com/mattdipasquale/571405
# Dot (unix hidden) files and folders are ignored.
# Warning: This script is *very* IO intensive. It can freeze your PC down.
# It's provided 'as-is', without any express or implied warranty, etc.
hash = {}
output = 'dublicates.json'
fail "#{output} already exists" if File.exist?(output)
puts 'Exploring subdirectories. It may take a long time.'
counter = 0
Dir.glob('**/*').each do |filename|
next if File.directory?(filename)
puts "Start!\n" if counter < 1
begin
key = Digest::MD5.file(filename).to_s
if hash.key? key
hash[key].push filename
else
hash[key] = [filename]
end
rescue
puts "Error processing #{filename}"
end
counter += 1
sleep(0.005 * rand)
puts "#{counter} calculated (#{filename})." if 0 == counter % 1000
end
puts "\nWriting #{output}"
counter = 0
File.open(output, 'w') do |f|
f.puts '['
hash.each do |key, filename_array|
next if filename_array.length <= 1
record = {}
record['files'] = filename_array
record['md5'] = key
f.puts ',' if counter > 0
f.write JSON.pretty_generate(record)
counter += 1
end
f.puts "\n]"
end
puts "Done.\n"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment