• Download Gist
find-duplicate-id-instance.rb
Ruby
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
#!/usr/bin/ruby -w
require 'set'
 
Entry = Struct.new :id, :instance do
def self.parse(line)
if /ID=\s*'([^']*)'\s+INSTANCE=\s*'([^']*)'/ =~ line
new $1, $2
else
raise "Cannot parse: %p" % line
end
end
end
 
entries = Set.new
 
ARGV.each do |file|
File.foreach file do |line|
begin
entry = Entry.parse(line)
 
if entries.include? entry
puts line
else
entries << entry
end
rescue
# Ignore lines that don't parse
end
end
end

With 1,000,000 entries and 452 duplicates:

$ /usr/bin/time -v ruby doit.rb input > /dev/null
    Command being timed: "ruby doit.rb input"
    User time (seconds): 9.78
    System time (seconds): 0.14
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:09.93
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 913760
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 57211
    Voluntary context switches: 3
    Involuntary context switches: 1013
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

And with Robert's original:

$ /usr/bin/time -v ruby doit_orig.rb input > /dev/null
    Command being timed: "ruby doit_orig.rb input"
    User time (seconds): 16.28
    System time (seconds): 0.19
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:16.50
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 913344
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 57191
    Voluntary context switches: 3
    Involuntary context switches: 1656
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

So no memory savings, but faster.

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.