Skip to content

Instantly share code, notes, and snippets.

@youchan
Last active March 1, 2018 01:33
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save youchan/500cf7ac6820441f09e2fc36d8a78d98 to your computer and use it in GitHub Desktop.
TF-IDF法を実装してみた
require "sparql/client"
class Dataset
include Enumerable
def initialize
@sparql = SPARQL::Client.new("http://ja.dbpedia.org/sparql")
end
def each
@sparql.query("
SELECT ?resource ?title ?abstract
WHERE {
<http://ja.dbpedia.org/resource/プログラミング言語> dbpedia-owl:wikiPageWikiLink ?resource .
?resource rdfs:label ?title .
?resource <http://dbpedia.org/ontology/abstract> ?abstract .
}
").each do |solution|
yield solution[:abstract]
end
end
end
require_relative 'tokenizer'
class Index
attr_reader :inverse_index, :documents
def initialize(dataset)
@dataset = dataset
@tokenizer = Tokenizer.new
@inverse_index = {}
@documents = []
end
def fetch_data
@dataset.each_with_index do |text, i|
tokens = @tokenizer.tokenize(text)
@documents << [text, tokens]
tokens.each_with_index do |word, j|
(@inverse_index[word] ||= []) << [i, j]
end
end
end
end
require_relative 'dataset'
require_relative 'index'
require_relative 'tf_idf'
dataset = Dataset.new
index = Index.new(dataset)
tf_idf = TfIdf.new(index)
tf_idf.ranking('計算').take(5).each do |r|
puts index.documents[r.first].first
end
class TfIdf
def initialize(index)
@index = index
@index.fetch_data
end
def ranking(q)
inverse_index = @index.inverse_index[q]
tf_idfs = inverse_index.group_by(&:first).map do |doc_id, pos_list|
doc = @index.documents[doc_id]
n_words = doc.last.size
df = pos_list.size
tf = df.to_f / n_words
idf = Math.log(@index.documents.size.to_f / df, 10)
[doc_id, tf * idf]
end
tf_idfs.sort_by(&:last).reverse
end
end
require 'natto'
class Tokenizer
include Enumerable
def initialize
@mecab = Natto::MeCab.new
end
def tokenize(text)
@mecab.enum_parse(text).map do |n|
n.surface
end.reject(&:empty?)
end
end
@youchan
Copy link
Author

youchan commented Feb 28, 2018

放送大学テキスト「自然言語処理(著:黒橋教授)」P144のTF-IDF法を実装してみました。
データセットにはdbpediaから「プログラミング言語」から関連する150くらいの項目のアブストラクトをつかいました。

mecabを使うのとdbpediaからデータを取得するために以下のgemをつかっています。

$ gem install natto
$ gem install linkeddata
$ gem install sparql

実行は以下

$ ruby query.rb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment