Skip to content

Instantly share code, notes, and snippets.

@sasamijp
Created August 23, 2015 12:39
Show Gist options
  • Save sasamijp/1a187cc916927c25679d to your computer and use it in GitHub Desktop.
Save sasamijp/1a187cc916927c25679d to your computer and use it in GitHub Desktop.
html内にある日本語の固有名詞をわいに置換する
# -*- encoding:utf-8 -*-
require 'natto'
require 'open-uri'
@natto = Natto::MeCab.new
def conv(jp_text)
words = []
jp_text.split("\n").map do |text|
text.split(' ').each do |t|
@natto.parse(t) do |n|
if n.is_eos?
break
end
if n.surface.match(/\p{Han}|\p{Hiragana}|\p{Katakana}/) == nil
words << n.surface
next
end
if n.feature.split(',')[0..1] == %w(名詞 固有名詞)
words << 'わい' unless words.last == 'わい'
else
words << n.surface
end
end
words << ' '
end
words << "\n"
end
words.join('')
end
File.write('a.html', conv(open('https://ja.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC%E3%81%AE%E6%AD%B4%E5%8F%B2').read) )
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment