Skip to content

Instantly share code, notes, and snippets.

@tily
Created November 21, 2010 11:24
Show Gist options
  • Save tily/708660 to your computer and use it in GitHub Desktop.
Save tily/708660 to your computer and use it in GitHub Desktop.
日本語の物語っぽいテキストから登場人物を抽出する (適当)
#!/usr/bin/env ruby
require 'MeCab'
require 'ostruct'
# USAGE:
# ruby extract_characters.rb < story.txt
# SPEC:
# 副詞可能は「今|今度|今日は」みたいなやつなので省いている
# TODO:
# each_two_sentences で照応を考慮する
# 一人称っぽい言葉も登場人物としてカウントする
# Perl の Term::Extract とか参考にして複合語ちゃんと扱う
# 「〜さん|くん|ちゃん|様」等の接尾をとる
# BUG:
# 駆け込み訴えで「人」も抽出されてしまう
class CharacterExtractor
class ComplexWord
attr_reader :surface, :nouns
def initialize(nouns)
@nouns = nouns
@surface = @nouns.map {|n| n.surface }.join
end
end
TOPIC_SUFFIX_FEATURE = %r#^助詞,係助詞,\*,\*,\*,\*,は,ハ,ワ$#u
NOUN_FEATURE = %r#^名詞,#u
NOUN_EXCLUDE_FEATURE = %r#^名詞,(接尾|非自立|代名詞|副詞可能)#u
COMPLEX_NOUN_FEATURE = %r#^(接頭|連体|名)詞#u
def self.extract(io)
new.extract(io)
end
def extract(io)
chars = Hash.new(0)
each_two_sentences(STDIN) do |prev,curr|
nodes = mecab_nodes(curr) if curr
next if nodes.nil?
nodes = parse_complex_nouns(nodes)
nodes.each_cons(2) do |n, m|
if (n.is_a?(ComplexWord) || n.feature =~ NOUN_FEATURE && n.feature !~ NOUN_EXCLUDE_FEATURE)
chars[n.surface] += 1 if m.feature =~ TOPIC_SUFFIX_FEATURE
end
end
end
sorted = chars.sort_by{|k,v| -v }
end
def mecab_nodes(text)
@tagger ||= MeCab::Tagger.new
nodes = []
node = @tagger.parseToNode(text)
while node = node.next
nodes << OpenStruct.new(:surface => node.surface, :feature => node.feature)
end
nodes
end
def each_two_sentences(io)
prev = nil
while line = io.gets
line.chomp!
line.gsub!(/「(.+?)」/u, '')
line.gsub!(/ /u, '')
line.gsub!(/[\(\)]/u, '')
sentences = line.split(/。/u)
yield(prev.last, sentences.first) if prev
sentences.each_cons(2) do |prev,curr|
yield(prev, curr)
end
prev = sentences
end
end
def parse_complex_nouns(nodes)
buf, res, flg = [], [], false
nodes.each do |node|
if node.feature =~ COMPLEX_NOUN_FEATURE
flg = true
buf << node
else
if flg
if buf.size == 1
res << buf.first
else
res << ComplexWord.new(buf)
end
buf = []
flg = false
end
res << node
end
end
res
end
end
chars = CharacterExtractor.extract(STDIN)
chars = chars.sort_by{|k, v| -v }
chars[0, 3].each {|k, v| puts k }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment