secret
Last active

Sample code for blog about duplication

  • Download Gist
cleanup.rb
Ruby
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
#!/usr/bin/env ruby
# encoding: utf-8
 
require 'rubygems'
require 'couchbase'
require 'optparse'
 
# just output extra empty line on CTRL-C
trap("INT") do
STDERR.puts
exit
end
 
options = {
:bucket => "default",
:design_document => "users",
:view => "all",
:hostname => nil,
:username => nil,
:password => nil
}
 
OptionParser.new do |opts|
opts.banner = "Usage: cleanup.rb [options]"
opts.on("-h", "--hostname HOSTNAME", "Hostname to connect to (default: 127.0.0.1:8091)") do |v|
host, port = v.split(':')
options[:hostname] = host.empty? ? '127.0.0.1' : host
options[:port] = port.to_i > 0 ? port.to_i : 8091
end
opts.on("-u", "--user USERNAME", "Username to log with (default: none)") do |v|
options[:username] = v
end
opts.on("-p", "--password PASSWORD", "Password to log with (default: none)") do |v|
options[:password] = v
end
opts.on("-b", "--bucket NAME", "Name of the bucket to connect to (default: #{options[:bucket]})") do |v|
options[:bucket] = v
end
opts.on("-d", "--design-document DDOC", "Name of the design document containing the view (default: #{options[:design_document]})") do |v|
options[:design_document] = v
end
opts.on("-v", "--view VIEW", "Name of the view (default: #{options[:view]})") do |v|
options[:view] = v
end
opts.on_tail("-?", "--help", "Show this message") do
puts opts
exit
end
end.parse!
 
connection = Couchbase.connect(options)
ddoc = connection.design_docs[options[:design_document]]
view = ddoc.send(options[:view])
connection.run do
view.each(:group => true) do |doc|
dup_num = doc.value.size
if dup_num > 1
puts "left doc #{doc.value[0]}, "
# delete documents from second to last
connection.delete(doc.value[1..-1])
puts "removed #{dup_num} duplicate(s)"
end
end
end
generate.rb
Ruby
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
#!/usr/bin/env ruby
# encoding: utf-8
 
require 'rubygems'
require 'couchbase'
# more info about faker: http://faker.rubyforge.org
require 'faker'
require 'optparse'
 
# just output extra empty line on CTRL-C
trap("INT") do
STDERR.puts
exit
end
 
options = {
:total_records => 10_000,
:duplicate_rate => 30,
:bucket => "default",
:hostname => nil,
:username => nil,
:password => nil
}
 
OptionParser.new do |opts|
opts.banner = "Usage: generate.rb [options]"
opts.on("-h", "--hostname HOSTNAME", "Hostname to connect to (default: 127.0.0.1:8091)") do |v|
host, port = v.split(':')
options[:hostname] = host.empty? ? '127.0.0.1' : host
options[:port] = port.to_i > 0 ? port.to_i : 8091
end
opts.on("-u", "--user USERNAME", "Username to log with (default: none)") do |v|
options[:username] = v
end
opts.on("-p", "--passwd PASSWORD", "Password to log with (default: none)") do |v|
options[:password] = v
end
opts.on("-b", "--bucket NAME", "Name of the bucket to connect to (default: #{options[:bucket]})") do |v|
options[:bucket] = v
end
opts.on("-t", "--total-records NUM", Integer, "The total number of the records to generate (default: #{options[:total_records]})") do |v|
options[:total_records] = v
end
opts.on("-d", "--duplicate-rate NUM", Integer, "Each NUM-th record will be duplicate (default: #{options[:duplicate_rate]})") do |v|
options[:duplicate_rate] = v
end
opts.on_tail("-?", "--help", "Show this message") do
puts opts
exit
end
end.parse!
 
connection = Couchbase.connect(options)
document = nil
options[:total_records].times do |n|
STDERR.printf("%10d / %d\r", n + 1, options[:total_records])
if n % options[:duplicate_rate] != 0 || !document
document = {
:first_name => Faker::Name.first_name,
:last_name => Faker::Name.last_name,
:postal_code => Faker::Address.zip_code,
}
end
connection.set("id#{n}", document)
end
STDERR.puts
map.js
JavaScript
1 2 3
function (doc, meta) {
emit([doc.first_name + doc.last_name + doc.postal_code], meta.id);
}
post.markdown
Markdown

You have duplicate content and want none of it? ... Nada. Zip. Zilch!

As a savvy consumer, you likely love to hear about 2 for 1 offers, of something that you buy frequently. That's a great deal, normally. But it doesn't work the same way for data (For example, imagine building a item catalog system where you need clean non-duplicate data). Whether you're combining data from two different data sources, you have multiple purchases from the same customer, or you just space out for a minute and enter the same data in a web form twice, it seems like everyone faces the problem of duplicate entries at one point or another.

In this blog post, we'll look at removing duplicate documents stored in Couchbase Server in 3 easy steps. For the sake of this example, assume each document has three common user specified fields -- first_name, last_name, postal_code. For easy generation you can use the generator. The script depends on faker gem, you should should install it first as gem install faker. Here is an execution sample:

$ ruby ./generate.rb --help
Usage: generate.rb [options]
    -h, --hostname HOSTNAME          Hostname to connect to (default: 127.0.0.1:8091)
    -u, --user USERNAME              Username to log with (default: none)
    -p, --passwd PASSWORD            Password to log with (default: none)
    -b, --bucket NAME                Name of the bucket to connect to (default: default)
    -t, --total-records NUM          The total number of the records to generate (default: 10000)
    -d, --duplicate-rate NUM         Each NUM-th record will be duplicate (default: 30)
    -?, --help                       Show this message
$ ruby ./generate.rb -t 1000 -d 5
      1000 / 1000

Each document in Couchbase has an user specified key which is accessible as meta.id in the map function of the view.

Step 1

Write a custom map function that emits the document ID (meta.id) of all the documents if the a particular duplicate pattern matches (first_name, last_name, postal_code in this case)

function (doc, meta) {
  emit([doc.first_name + doc.last_name + doc.postal_code], meta.id);
}

Step 2

Write a reduce function that counts all the keys.

function (keys, values, rereduce) {
  if (rereduce) {
    var res = [];
    for (var i = 0; i < array.length; i++){
      res = res.concat(array[i])
    }
    return res;
  } else {
    return values;
  }
}

Step 3

Drop the keys that have counts greater than 1.

Result

The core part of the cleaner:

require 'couchbase'

connection = Couchbase.connect(options)
ddoc = connection.design_docs[options[:design_document]]
view = ddoc.send(options[:view])
connection.run do
  view.each(:group => true) do |doc|
    dup_num = doc.value.size
    if dup_num > 1
      puts "left doc #{doc.value[0]}, "
      # delete documents from second to last
      connection.delete(doc.value[1..-1])
      puts "removed #{dup_num} duplicate(s)"
    end
  end
end

You can find complete source of it here: https://gist.github.com/859df6c1db21a9bb561b#file_cleanup.rb.

Execution sample:

$ ruby ./cleanup.rb --help
Usage: cleanup.rb [options]
    -h, --hostname HOSTNAME          Hostname to connect to (default: 127.0.0.1:8091)
    -u, --user USERNAME              Username to log with (default: none)
    -p, --password PASSWORD          Password to log with (default: none)
    -b, --bucket NAME                Name of the bucket to connect to (default: default)
    -d, --design-document DDOC       Name of the design document containing the view (default: users)
    -v, --view VIEW                  Name of the view (default: all)
    -?, --help                       Show this message
$ ruby ./cleanup.rb -d users -v all
reduce.js
JavaScript
1 2 3 4 5 6 7 8 9 10 11
function (keys, values, rereduce) {
if (rereduce) {
var res = [];
for (var i = 0; i < array.length; i++){
res = res.concat(array[i])
}
return res;
} else {
return values;
}
}

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.