Skip to content

Instantly share code, notes, and snippets.

@maxcodes
Created May 25, 2016 18:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save maxcodes/df731d29cc9f8b35f0ce3e024e34c4ff to your computer and use it in GitHub Desktop.
Save maxcodes/df731d29cc9f8b35f0ce3e024e34c4ff to your computer and use it in GitHub Desktop.
require 'fuzzystringmatch'
desc "This task eliminates books without ISBN, finds the duplicated books and merges them"
task find_duplicated_books: :environment do
jarow = FuzzyStringMatch::JaroWinkler.create(:native)
#Destroy books without isbn that have no shelf_books and progresses
Book.where(isbn: nil).where(shelf_books_count: 0, progresses_count: 0).destroy_all
books = Book.all # These are 30K books.
books.each do |book|
books.each do |otherbook| # <- this double loop is the one that concerns me the most. 30,000 * 30,000 = 900,000,000 loops.
next if same_book?(book, otherbook)
if books_are_duplicate(jarow, book, otherbook)
merge_books(book, otherbook)
end
end
end
end
def merge_books(book, otherbook)
sleep 2 # omitted for brevity. traversing several AR associations, some db calls, etc.
end
def same_book?(book_1, book_2)
book_1.id == book_2.id
end
def books_are_duplicate(jarow, book, otherbook)
title = book.title.parameterize(" ")
other_title = otherbook.title.parameterize(" ")
jarow.getDistance(title, other_title) > 0.99 # this is a really fast implementation of the Jaro-Winkler distance for strings.
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment