As a savvy consumer, you likely love to hear about 2 for 1 offers, of something that you buy frequently. That's a great deal, normally. But it doesn't work the same way for data (For example, imagine building a item catalog system where you need clean non-duplicate data). Whether you're combining data from two different data sources, you have multiple purchases from the same customer, or you just space out for a minute and enter the same data in a web form twice, it seems like everyone faces the problem of duplicate entries at one point or another.
In this blog post, we'll look at removing duplicate documents stored in
Couchbase Server in 3 easy steps. For the sake of this example, assume
each document has three common user specified fields -- first_name
,
last_name
, postal_code
. For easy generation you can use the
generator. The script depends on faker gem, you should should
install it first as gem install faker
. Here is an execution sample:
$ ruby ./generate.rb --help
Usage: generate.rb [options]
-h, --hostname HOSTNAME Hostname to connect to (default: 127.0.0.1:8091)
-u, --user USERNAME Username to log with (default: none)
-p, --passwd PASSWORD Password to log with (default: none)
-b, --bucket NAME Name of the bucket to connect to (default: default)
-t, --total-records NUM The total number of the records to generate (default: 10000)
-d, --duplicate-rate NUM Each NUM-th record will be duplicate (default: 30)
-?, --help Show this message
$ ruby ./generate.rb -t 1000 -d 5
1000 / 1000
Each document in Couchbase has an user specified key which is accessible
as meta.id
in the map function of the view.
Write a custom map function that emits the document ID
(meta.id
) of all the documents if the a particular duplicate
pattern matches (first_name
, last_name
, postal_code
in this case)
function (doc, meta) {
emit([doc.first_name + doc.last_name + doc.postal_code], meta.id);
}
Write a reduce function that counts all the keys.
function (keys, values, rereduce) {
if (rereduce) {
var res = [];
for (var i = 0; i < array.length; i++){
res = res.concat(array[i])
}
return res;
} else {
return values;
}
}
Drop the keys that have counts greater than 1.
The core part of the cleaner:
require 'couchbase'
connection = Couchbase.connect(options)
ddoc = connection.design_docs[options[:design_document]]
view = ddoc.send(options[:view])
connection.run do
view.each(:group => true) do |doc|
dup_num = doc.value.size
if dup_num > 1
puts "left doc #{doc.value[0]}, "
# delete documents from second to last
connection.delete(doc.value[1..-1])
puts "removed #{dup_num} duplicate(s)"
end
end
end
You can find complete source of it here: https://gist.github.com/859df6c1db21a9bb561b#file_cleanup.rb.
Execution sample:
$ ruby ./cleanup.rb --help
Usage: cleanup.rb [options]
-h, --hostname HOSTNAME Hostname to connect to (default: 127.0.0.1:8091)
-u, --user USERNAME Username to log with (default: none)
-p, --password PASSWORD Password to log with (default: none)
-b, --bucket NAME Name of the bucket to connect to (default: default)
-d, --design-document DDOC Name of the design document containing the view (default: users)
-v, --view VIEW Name of the view (default: all)
-?, --help Show this message
$ ruby ./cleanup.rb -d users -v all