Skip to content

Instantly share code, notes, and snippets.

@404pnf
Forked from plexus/partitioner.rb
Created June 5, 2014 01:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save 404pnf/add0c06fe5bcc676c514 to your computer and use it in GitHub Desktop.
Save 404pnf/add0c06fe5bcc676c514 to your computer and use it in GitHub Desktop.
# Very (very) naive wordt segmentation algorithm for Chinese
# (or any language with similar characteristics, works at the
# character level.)
class Partitioner
attr_reader :ngrams
# +ngrams+ Enumerable list of ngrams
def initialize(ngrams, lookahead = 6)
@lookahead = lookahead
@ngrams = {}
ngrams.each {|ng| @ngrams[ng] = true}
end
# Goes from beginning to end, each time trying to find the longest
# initial n characters that are in the list of known n-grams
def partition(text)
text = text.split('')
result = []
while text and not text.empty?
lookahead = @lookahead
while lookahead > 0
test = text[0...lookahead].join
if lookahead == 1 || ngrams[test]
result << test
text = text[lookahead..-1]
break
end
lookahead-=1
end
end
result
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment