Skip to content

Instantly share code, notes, and snippets.

@youchan
Created April 19, 2019 11:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save youchan/a530e80bc82c122a1eed532e165f477b to your computer and use it in GitHub Desktop.
Save youchan/a530e80bc82c122a1eed532e165f477b to your computer and use it in GitHub Desktop.
code partyで書いたtf-idfのRuby実装
text = File.read "./utamap.test.txt"
vocabraries = []
word2id = {}
text.split("\n").each do |doc|
words = doc.split(" ")
words.each do |word|
unless word2id.has_key?(word)
vocabraries << word
word2id[word] = vocabraries.count - 1
end
end
end
hist_all = []
hist_all = text.split("\n").map do |doc|
words = doc.split(" ")
hist = {}
words.each do |word|
if hist.has_key?(word)
hist[word] += 1
else
hist[word] = 1
end
end
hist
end
idf = hist_all.reduce({}) do |df, hist|
hist.keys.each do |word|
if df.has_key?(word)
df[word] += 1
else
df[word] = 1
end
end
df
end.map do |word, count|
[word, Math.log10(hist_all.count.to_f / count)]
end.to_h
docs = text.split("\n")
hist_all.each_with_index do |hist, index|
puts "======================================="
puts docs[index]
total = hist.values.sum
hist.each do |word, count|
tf = count.to_f / total
puts "#{word} => tfidf: #{tf * idf[word]}"
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment