Skip to content

Instantly share code, notes, and snippets.

@tisba
Created December 22, 2010 19:51
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tisba/752003 to your computer and use it in GitHub Desktop.
Save tisba/752003 to your computer and use it in GitHub Desktop.
Script to check CouchDB's _all_docs for duplicates
#!/usr/bin/env ruby
# Requirements:
# gem install yajl-ruby
#
# Usage:
#
# $ curl -\# http://localhost:5984/somedb/_all_docs | ./find_dupes.rb
#
# or
#
# $ cat all_docs_dump.json | ./find_dupes
#
require "rubygems"
require "yajl"
ids = Yajl::Parser.parse(STDIN)["rows"].map{|doc| doc["id"] }
puts "Total IDs: #{ids.size}"
ids_unique = ids.uniq
puts "Unique IDs: #{ids_unique.size}"
puts "Delta: #{ids.size - ids_unique.size}"
ids.sort!
prev_id = nil
dupes = {}
ids.each do |id|
dupes[id] = dupes[id] ? dupes[id] += 1 : 1 if prev_id && prev_id == id
prev_id = id
end
puts "Found #{dupes.keys.size} documents with duplicates: id (dup count)"
dupes.each do |k,v|
puts "#{k} (#{v})"
end
@tisba
Copy link
Author

tisba commented Dec 22, 2010

$ curl -# http://localhost:5984/somedb/_all_docs | ./find_dupes.rb
Total IDs: 13425
Unique IDs: 12503
Delta: 922
Found 10 documents with duplicates: id (dup count)
11451 (1)
11467 (1)
11473 (1)
11477 (1)
11479 (1)
9999 (1)
somekey_1234 (43)
somekey_4223 (300)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment