Skip to content

Instantly share code, notes, and snippets.

@satoryu
Created November 12, 2019 07:41
Show Gist options
  • Save satoryu/f6a51003e78380ccc0812ced0fb4e0ba to your computer and use it in GitHub Desktop.
Save satoryu/f6a51003e78380ccc0812ced0fb4e0ba to your computer and use it in GitHub Desktop.
require 'matrix'
require 'tf-idf-similarity'
class Token
def initialize(mecab_node)
@node = mecab_node
end
def valid?
@node.feature.start_with?('名詞')
end
def to_s
@node.surface
end
end
class Tokenizer
def tokenize(text)
require 'natto'
nm = Natto::MeCab.new
nm.enum_parse(text).to_a.map do |node|
Token.new(node)
end
end
end
tokenizer = Tokenizer.new
options = { tokenizer: tokenizer }
document1 = TfIdfSimilarity::Document.new("ロン毛", options)
document2 = TfIdfSimilarity::Document.new("セミロン毛", options)
document3 = TfIdfSimilarity::Document.new("セミ", options)
corpus = [document1, document2, document3]
model = TfIdfSimilarity::TfIdfModel.new(corpus)
matrix = model.similarity_matrix
puts matrix
#=> Matrix[[0.9999999999999999, 0.3360969272762574, 0.0], [0.3360969272762574, 0.9999999999999999, 0.0], [0.0, 0.0, 1.0]]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment