Skip to content

Instantly share code, notes, and snippets.

@plexus
Created June 6, 2012 21:43
Show Gist options
  • Save plexus/2885025 to your computer and use it in GitHub Desktop.
Save plexus/2885025 to your computer and use it in GitHub Desktop.
# Very (very) naive wordt segmentation algorithm for Chinese
# (or any language with similar characteristics, works at the
# character level.)
class Partitioner
attr_reader :ngrams
# +ngrams+ Enumerable list of ngrams
def initialize(ngrams, lookahead = 6)
@lookahead = lookahead
@ngrams = {}
ngrams.each {|ng| @ngrams[ng] = true}
end
# Goes from beginning to end, each time trying to find the longest
# initial n characters that are in the list of known n-grams
def partition(text)
text = text.split('')
result = []
while text and not text.empty?
lookahead = @lookahead
while lookahead > 0
test = text[0...lookahead].join
if lookahead == 1 || ngrams[test]
result << test
text = text[lookahead..-1]
break
end
lookahead-=1
end
end
result
end
end
@plexus
Copy link
Author

plexus commented Jun 8, 2012

No I don't plan to come up with my own scheme, there's been plenty of academic efforts already to come up with good algorithms. I am more thinking of porting one or more of those to Ruby, or creating Ruby bindings to a C implementation.

@plexus
Copy link
Author

plexus commented Jun 8, 2012

This one is really just a placeholder so I can work on other aspects of my app now, and then revisit the segmentation problem later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment