Skip to content

Instantly share code, notes, and snippets.

@avsej

avsej/cleanup.rb Secret

Created October 11, 2012 05:05
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save avsej/859df6c1db21a9bb561b to your computer and use it in GitHub Desktop.
Save avsej/859df6c1db21a9bb561b to your computer and use it in GitHub Desktop.
Sample code for blog about duplication
#!/usr/bin/env ruby
# encoding: utf-8
require 'rubygems'
require 'couchbase'
require 'optparse'
# just output extra empty line on CTRL-C
trap("INT") do
STDERR.puts
exit
end
options = {
:bucket => "default",
:design_document => "users",
:view => "all",
:hostname => nil,
:username => nil,
:password => nil
}
OptionParser.new do |opts|
opts.banner = "Usage: cleanup.rb [options]"
opts.on("-h", "--hostname HOSTNAME", "Hostname to connect to (default: 127.0.0.1:8091)") do |v|
host, port = v.split(':')
options[:hostname] = host.empty? ? '127.0.0.1' : host
options[:port] = port.to_i > 0 ? port.to_i : 8091
end
opts.on("-u", "--user USERNAME", "Username to log with (default: none)") do |v|
options[:username] = v
end
opts.on("-p", "--password PASSWORD", "Password to log with (default: none)") do |v|
options[:password] = v
end
opts.on("-b", "--bucket NAME", "Name of the bucket to connect to (default: #{options[:bucket]})") do |v|
options[:bucket] = v
end
opts.on("-d", "--design-document DDOC", "Name of the design document containing the view (default: #{options[:design_document]})") do |v|
options[:design_document] = v
end
opts.on("-v", "--view VIEW", "Name of the view (default: #{options[:view]})") do |v|
options[:view] = v
end
opts.on_tail("-?", "--help", "Show this message") do
puts opts
exit
end
end.parse!
connection = Couchbase.connect(options)
ddoc = connection.design_docs[options[:design_document]]
view = ddoc.send(options[:view])
connection.run do
view.each(:group => true) do |doc|
dup_num = doc.value.size
if dup_num > 1
puts "left doc #{doc.value[0]}, "
# delete documents from second to last
connection.delete(doc.value[1..-1])
puts "removed #{dup_num} duplicate(s)"
end
end
end
#!/usr/bin/env ruby
# encoding: utf-8
require 'rubygems'
require 'couchbase'
# more info about faker: http://faker.rubyforge.org
require 'faker'
require 'optparse'
# just output extra empty line on CTRL-C
trap("INT") do
STDERR.puts
exit
end
options = {
:total_records => 10_000,
:duplicate_rate => 30,
:bucket => "default",
:hostname => nil,
:username => nil,
:password => nil
}
OptionParser.new do |opts|
opts.banner = "Usage: generate.rb [options]"
opts.on("-h", "--hostname HOSTNAME", "Hostname to connect to (default: 127.0.0.1:8091)") do |v|
host, port = v.split(':')
options[:hostname] = host.empty? ? '127.0.0.1' : host
options[:port] = port.to_i > 0 ? port.to_i : 8091
end
opts.on("-u", "--user USERNAME", "Username to log with (default: none)") do |v|
options[:username] = v
end
opts.on("-p", "--passwd PASSWORD", "Password to log with (default: none)") do |v|
options[:password] = v
end
opts.on("-b", "--bucket NAME", "Name of the bucket to connect to (default: #{options[:bucket]})") do |v|
options[:bucket] = v
end
opts.on("-t", "--total-records NUM", Integer, "The total number of the records to generate (default: #{options[:total_records]})") do |v|
options[:total_records] = v
end
opts.on("-d", "--duplicate-rate NUM", Integer, "Each NUM-th record will be duplicate (default: #{options[:duplicate_rate]})") do |v|
options[:duplicate_rate] = v
end
opts.on_tail("-?", "--help", "Show this message") do
puts opts
exit
end
end.parse!
connection = Couchbase.connect(options)
document = nil
options[:total_records].times do |n|
STDERR.printf("%10d / %d\r", n + 1, options[:total_records])
if n % options[:duplicate_rate] != 0 || !document
document = {
:first_name => Faker::Name.first_name,
:last_name => Faker::Name.last_name,
:postal_code => Faker::Address.zip_code,
}
end
connection.set("id#{n}", document)
end
STDERR.puts
function (doc, meta) {
emit([doc.first_name + doc.last_name + doc.postal_code], meta.id);
}

You have duplicate content and want none of it? ... Nada. Zip. Zilch!

As a savvy consumer, you likely love to hear about 2 for 1 offers, of something that you buy frequently. That's a great deal, normally. But it doesn't work the same way for data (For example, imagine building a item catalog system where you need clean non-duplicate data). Whether you're combining data from two different data sources, you have multiple purchases from the same customer, or you just space out for a minute and enter the same data in a web form twice, it seems like everyone faces the problem of duplicate entries at one point or another.

In this blog post, we'll look at removing duplicate documents stored in Couchbase Server in 3 easy steps. For the sake of this example, assume each document has three common user specified fields -- first_name, last_name, postal_code. For easy generation you can use the generator. The script depends on faker gem, you should should install it first as gem install faker. Here is an execution sample:

$ ruby ./generate.rb --help
Usage: generate.rb [options]
    -h, --hostname HOSTNAME          Hostname to connect to (default: 127.0.0.1:8091)
    -u, --user USERNAME              Username to log with (default: none)
    -p, --passwd PASSWORD            Password to log with (default: none)
    -b, --bucket NAME                Name of the bucket to connect to (default: default)
    -t, --total-records NUM          The total number of the records to generate (default: 10000)
    -d, --duplicate-rate NUM         Each NUM-th record will be duplicate (default: 30)
    -?, --help                       Show this message
$ ruby ./generate.rb -t 1000 -d 5
      1000 / 1000

Each document in Couchbase has an user specified key which is accessible as meta.id in the map function of the view.

Step 1

Write a custom map function that emits the document ID (meta.id) of all the documents if the a particular duplicate pattern matches (first_name, last_name, postal_code in this case)

function (doc, meta) {
  emit([doc.first_name + doc.last_name + doc.postal_code], meta.id);
}

Step 2

Write a reduce function that counts all the keys.

function (keys, values, rereduce) {
  if (rereduce) {
    var res = [];
    for (var i = 0; i < array.length; i++){
      res = res.concat(array[i])
    }
    return res;
  } else {
    return values;
  }
}

Step 3

Drop the keys that have counts greater than 1.

Result

The core part of the cleaner:

require 'couchbase'

connection = Couchbase.connect(options)
ddoc = connection.design_docs[options[:design_document]]
view = ddoc.send(options[:view])
connection.run do
  view.each(:group => true) do |doc|
    dup_num = doc.value.size
    if dup_num > 1
      puts "left doc #{doc.value[0]}, "
      # delete documents from second to last
      connection.delete(doc.value[1..-1])
      puts "removed #{dup_num} duplicate(s)"
    end
  end
end

You can find complete source of it here: https://gist.github.com/859df6c1db21a9bb561b#file_cleanup.rb.

Execution sample:

$ ruby ./cleanup.rb --help
Usage: cleanup.rb [options]
    -h, --hostname HOSTNAME          Hostname to connect to (default: 127.0.0.1:8091)
    -u, --user USERNAME              Username to log with (default: none)
    -p, --password PASSWORD          Password to log with (default: none)
    -b, --bucket NAME                Name of the bucket to connect to (default: default)
    -d, --design-document DDOC       Name of the design document containing the view (default: users)
    -v, --view VIEW                  Name of the view (default: all)
    -?, --help                       Show this message
$ ruby ./cleanup.rb -d users -v all
function (keys, values, rereduce) {
if (rereduce) {
var res = [];
for (var i = 0; i < array.length; i++){
res = res.concat(array[i])
}
return res;
} else {
return values;
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment