Skip to content

Instantly share code, notes, and snippets.

@jpmckinney
Created November 17, 2011 21:46
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save jpmckinney/1374639 to your computer and use it in GitHub Desktop.
Save jpmckinney/1374639 to your computer and use it in GitHub Desktop.
Google Refine fingerprint clustering algorithm in Ruby
# blog post: http://blog.slashpoundbang.com/post/12938588984/google-refine-fingerprint-clustering-algorithm-in-ruby
# coding: utf-8
require 'unicode_utils/downcase'
class String
# Normalize spaces and fingerprint.
# http://code.google.com/p/google-refine/wiki/ClusteringInDepth
# http://code.google.com/p/google-refine/source/browse/trunk/main/src/com/google/refine/clustering/binning/FingerprintKeyer.java
def fingerprint
UnicodeUtils.downcase(gsub(/[[:space:]]+/, ' ').strip).gsub(/\p{Punct}|\p{Cntrl}/, '').split.uniq.sort.join(' ').tr(
"ÀÁÂÃÄÅàáâãäåĀāĂ㥹ÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž",
"aaaaaaaaaaaaaaaaaaccccccccccddddddeeeeeeeeeeeeeeeeeegggggggghhhhiiiiiiiiiiiiiiiiiijjkkkllllllllllnnnnnnnnnnnoooooooooooooooooorrrrrrsssssssssttttttuuuuuuuuuuuuuuuuuuuuwwyyyyyyzzzzzz")
end
end
@christophermanning
Copy link

This is very helpful to me. Thank you :D

Have you implemented the n-gram fingerprint keyer in ruby?

@jpmckinney
Copy link
Author

No, sorry :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment