public
Created

Solution suggestion for "Help with ideas for finding dups on very large file"

  • Download Gist
find-duplicate-id-instance.rb
Ruby
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
#!/usr/bin/ruby -w
 
Entry = Struct.new :id, :instance do
def self.parse(line)
if /ID=\s*'([^']*)'\s+INSTANCE=\s*'([^']*)'/ =~ line
new $1, $2
else
raise "Cannot parse: %p" % line
end
end
end
 
# Phase 1: count occurrences of all pairs
entries = Hash.new 0
 
ARGV.each do |file|
File.foreach file do |line|
entries[Entry.parse(line)] += 1 rescue nil # ignore
end
end
 
# save some memory, not necessarily needed
entries.delete_if {|k, v| v < 2}
 
# Phase 2: print only dupes
ARGV.each do |file|
File.foreach file do |line|
puts line if entries[Entry.parse(line)] > 1 rescue nil # ignore
end
end

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.