Skip to content

Instantly share code, notes, and snippets.

@mitmul
Last active December 11, 2015 23:38
Show Gist options
  • Save mitmul/4678091 to your computer and use it in GitHub Desktop.
Save mitmul/4678091 to your computer and use it in GitHub Desktop.
# encoding: utf-8
src = "jawiki-latest-all-titles-in-ns0"
dst = "wikipedia.csv"
File.open(dst, "w") do |fp|
File.open(src).each do |line|
line.chomp!
# いらない単語を飛ばす
next if line =~ /^\./
next if line =~ /(曖昧さの回避)/
next if line =~ /^[0-9]{1,100}$/
next if line =~ /[0-9]{4}./
next if line =~ /[*+.,]/
if line.length > 3
score = [-32768.0, (6000 - 200 *(line.size.to_f**1.3))].max.to_i
out = "#{line},0,0,#{score},名詞,固有名詞,一般,*,*,*,#{line},*,*,wikipedia,"
fp.puts out
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment