Skip to content

Instantly share code, notes, and snippets.

@maimai-swap
Created January 13, 2012 05:17
Show Gist options
  • Save maimai-swap/1604792 to your computer and use it in GitHub Desktop.
Save maimai-swap/1604792 to your computer and use it in GitHub Desktop.
mecabの辞書登録用CSVを作る
# -*- coding: utf-8 -*-
#
# input よみがな¥t単語 のTSV(UTF-8?)
# output 標準出力(UTF-8)
# 参考:http://d.hatena.ne.jp/code46/20090531/p1
# 参考:http://tmp.blogdns.org/archives/2009/12/mecabwikipediah.html
# 参考:http://mecab.sourceforge.net/
require 'kconv'
origin="はてな"
open($*[0]).each do |line|
title = line.split("\t")[1].strip.toutf8
yomi = line.split("\t")[0].strip.toutf8
yomi.tr!('ぁ-ん', 'ァ-ン')
score = [-32768.0, (6000 - 200 *(title.size**1.3))].max.to_i
if title.size > 9 then
out = "#{title},0,0,#{score},名詞,一般,*,*,*,*,#{title},#{yomi},#{yomi},はてなキーワード,\n"
print out
end
end
@maimai-swap
Copy link
Author

おわったあと、/usr/local/libexec/mecab/mecab-dict-index -d /usr/local/lib/mecab/dic/ipadic -u hatena.dic -t utf-8 -f utf-8 /usr/local/lib/mecab/dic/ipadic/hatena.csv

@maimai-swap
Copy link
Author

事前に

grep -v "^[[:space:]]" hatena.utf8.txt > hatena.yomigana.txt
grep -v "[A-Za-z0-9]{3,50}" hatena.yomigana.txt > hatena.noeiji.txt
grep -v "[0-9]{2}" hatena.noeiji.txt > hatena.nosuuji.txt

これした。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment