Last active
January 25, 2016 08:21
-
-
Save priancho/b04f8fbe7e1f84cbccac to your computer and use it in GitHub Desktop.
tweet text filtering and normalization for NEologd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/local/bin/ruby | |
# -*- coding: utf-8 -*- | |
require 'cgi' | |
require 'nkf' | |
def normalize_text(t) | |
# Decode HTML tags in Tweet | |
# > http://www.xmisao.com/2014/03/09/how-to-encode-decode-html-entities-in-ruby.html | |
t = CGI.unescapeHTML(t) | |
##### | |
# Implement string normalization for NEologd | |
# > https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp | |
##### | |
# | |
# 全角英数字は半角に置換 | |
t.tr!('0-9a-zA-Z', '0-9a-zA-Z') | |
# 半角カタカナは全角に置換 (NKF default) | |
# アルファベットといくつかの記号(全角スペース含む)をASCIIに変換 (-Z1) | |
# 「」は全角記号に置換 (NKF default) | |
# > http://qiita.com/wada811/items/fd7edce8ce885354fc89 | |
# > http://docs.ruby-lang.org/ja/1.9.3/class/NKF.html | |
t = NKF.nkf("-w -Z1", t).gsub(/\s+/, ' ') | |
# ハイフンマイナスっぽい文字を置換 | |
# > http://d.hatena.ne.jp/y-kawaz/20101112/1289554290 | |
hyphen_like_chars = '\u02D7\u058A\u2010\u2011\u2012\u2013\u2043\u207B\u208B\u2212' | |
t.gsub!(/[#{hyphen_like_chars}]/, '-') | |
# 長音記号っぽい文字を置換 | |
longsound_like_chars = '\u2014\u2015\u2500\u2501\uFE63\uFF0D\uFF70' | |
t.gsub!(/[#{longsound_like_chars}]/, 'ー') | |
# チルダっぽい文字は削除 | |
tilt_like_chars = '\u007E\u223C\u223E\u301C\u3030\uFF5E' | |
t.gsub!(/[#{tilt_like_chars}]/, '') | |
# ひらがな・全角カタカナ・半角カタカナ・漢字(全角記号は半角に置換された)間に含まれる半角スペースは削除 | |
# ひらがな・全角カタカナ・半角カタカナ・漢字と「半角英数字」の間に含まれる半角スペースは削除 | |
# > http://ruby-doc.org/core-1.9.3/Regexp.html | |
lhs_chars = '\p{Hiragana}\p{Katakana}\p{Han}' | |
rhs_chars = '\p{Hiragana}\p{Katakana}\p{Han}\p{Alnum}' | |
t.gsub!(/([#{lhs_chars}]+)(\s+)(?=[#{rhs_chars}]+)/, '\1') | |
# 解析対象テキストの先頭と末尾の半角スペースは削除 | |
t.strip! | |
return t | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment