Last active
March 1, 2018 01:33
Star
You must be signed in to star a gist
TF-IDF法を実装してみた
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require "sparql/client" | |
class Dataset | |
include Enumerable | |
def initialize | |
@sparql = SPARQL::Client.new("http://ja.dbpedia.org/sparql") | |
end | |
def each | |
@sparql.query(" | |
SELECT ?resource ?title ?abstract | |
WHERE { | |
<http://ja.dbpedia.org/resource/プログラミング言語> dbpedia-owl:wikiPageWikiLink ?resource . | |
?resource rdfs:label ?title . | |
?resource <http://dbpedia.org/ontology/abstract> ?abstract . | |
} | |
").each do |solution| | |
yield solution[:abstract] | |
end | |
end | |
end | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require_relative 'tokenizer' | |
class Index | |
attr_reader :inverse_index, :documents | |
def initialize(dataset) | |
@dataset = dataset | |
@tokenizer = Tokenizer.new | |
@inverse_index = {} | |
@documents = [] | |
end | |
def fetch_data | |
@dataset.each_with_index do |text, i| | |
tokens = @tokenizer.tokenize(text) | |
@documents << [text, tokens] | |
tokens.each_with_index do |word, j| | |
(@inverse_index[word] ||= []) << [i, j] | |
end | |
end | |
end | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require_relative 'dataset' | |
require_relative 'index' | |
require_relative 'tf_idf' | |
dataset = Dataset.new | |
index = Index.new(dataset) | |
tf_idf = TfIdf.new(index) | |
tf_idf.ranking('計算').take(5).each do |r| | |
puts index.documents[r.first].first | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class TfIdf | |
def initialize(index) | |
@index = index | |
@index.fetch_data | |
end | |
def ranking(q) | |
inverse_index = @index.inverse_index[q] | |
tf_idfs = inverse_index.group_by(&:first).map do |doc_id, pos_list| | |
doc = @index.documents[doc_id] | |
n_words = doc.last.size | |
df = pos_list.size | |
tf = df.to_f / n_words | |
idf = Math.log(@index.documents.size.to_f / df, 10) | |
[doc_id, tf * idf] | |
end | |
tf_idfs.sort_by(&:last).reverse | |
end | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'natto' | |
class Tokenizer | |
include Enumerable | |
def initialize | |
@mecab = Natto::MeCab.new | |
end | |
def tokenize(text) | |
@mecab.enum_parse(text).map do |n| | |
n.surface | |
end.reject(&:empty?) | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
放送大学テキスト「自然言語処理(著:黒橋教授)」P144のTF-IDF法を実装してみました。
データセットにはdbpediaから「プログラミング言語」から関連する150くらいの項目のアブストラクトをつかいました。
mecabを使うのとdbpediaからデータを取得するために以下のgemをつかっています。
実行は以下