Skip to content

Instantly share code, notes, and snippets.

@hitode909
Forked from udonchan/tf.rb
Created May 3, 2010 12:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hitode909/388063 to your computer and use it in GitHub Desktop.
Save hitode909/388063 to your computer and use it in GitHub Desktop.
#!/usr/bin/env ruby
# -*- coding: utf-8 -*-
require 'rubygems'
require 'MeCab'
require 'uri'
require 'open-uri'
require 'generator'
require 'extractcontent.rb'
$KCODE='u'
class TF
def initialize
@togger = MeCab::Tagger.new('-O wakati')
@extract_content = ExtractContent::Extractor.new({:decay_factor=>0.75})
end
def fetch(uri_str)
uri = URI.parse(URI.encode(uri_str))
open(uri, 'User-Agent' => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.4) Gecko/20100413 Firefox/3.6.4').read
end
protected :fetch
def mecab_node(html)
@togger.parseToNode(@extract_content.analyse(fetch(html)).first)
end
protected :mecab_node
def tf(url)
Generator.new { |generator|
root = mecab_node(url)
while root
generator.yield root
root = root.next
end
}.select{ |node|
node.feature =~ /^名詞/
}.inject(Hash.new{ 0 }) { |count, node|
count[node.surface] += 1
count
}
end
end
TF::new.tf('http://ja.wikipedia.org/wiki/沢城みゆき').each do |k, v|
puts "#{k} : #{v}"
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment