Skip to content

Instantly share code, notes, and snippets.

@geekdreamzz
Last active January 18, 2019 16:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save geekdreamzz/9776de9a0eb0fcd4ca30a0858b38a570 to your computer and use it in GitHub Desktop.
Save geekdreamzz/9776de9a0eb0fcd4ca30a0858b38a570 to your computer and use it in GitHub Desktop.
wip tokenizing word phrases - planning to develop a tokenizer that can pluck entities outside of a phrase, like topics, citations, jurisdictions, companies, people etc. also detect things like intent, "is a question"? statement .... for now the purpose of this is to tokenize each subphrase permutation like a tree. example. "this test phrase" get…
#!/usr/bin/env ruby
require 'pry'
module Phrase
class Tokenizer
def initialize(phrase)
@phrase = phrase
end
def phrase
@phrase
end
def words
@words ||= phrase.split(' ').compact
end
def groups
return @groups if defined? @groups
@groups = words.map do |word|
word_index = words.index(word)
count = words.length - 1
(word_index..count).map do |current_index|
if word_index == current_index
word
else
(word_index..current_index).map do |tree_index |
words.at(tree_index)
end.join(' ')
end
end.flatten.compact
end.flatten.compact
end
end
end
p = Phrase::Tokenizer.new('this is a test phrase')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment