Skip to content

Instantly share code, notes, and snippets.

@sasamijp
Created September 28, 2014 11:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sasamijp/c845f4784c050789a434 to your computer and use it in GitHub Desktop.
Save sasamijp/c845f4784c050789a434 to your computer and use it in GitHub Desktop.
dbからurlを読み取ってブログの記事の内容をnokogiriでパースしてから会話コーパスに変換してdbに書き込む
# -*- encoding: utf-8 -*-
require 'nokogiri'
require 'open-uri'
require 'Sequel'
require './SSparser.rb'
# for ankake.blog.jp
s = SSparser.new
def read(dbname)
db = Sequel.connect("sqlite://#{dbname}")
db[:url].all
end
def insert(dbname, data)
db = Sequel.connect("sqlite://#{dbname}")
data.each do |v|
db[:respond].insert(name: v[:name], serif: v[:serif], in_reply_to: v[:in_reply_to])
end
end
def extractcontent(url)
charset = nil
html = open(url) do |f|
charset = f.charset
f.read
end
puts url
doc = Nokogiri::HTML.parse(html, nil, charset)
re = doc.xpath('//div[@class="article-body"]')[0].children.max_by{|v|v.text.count('「')}
doc2 = Nokogiri::HTML.parse(re.to_s.gsub('<br>', "\n").gsub('</div>', "\n</div>"), nil, charset)
ret =[]
doc2.xpath('//div')[1..-1].each do |v|
next if v.text =~ /\d+:.*\d+\/\d+\/\d+\(.\) \d+:\d+:\d+.\d+ ID:........./
next if v.text.length < 100
ret << v.text
end
ret.join('')
end
#puts s.parse extractcontent('http://ankake.blog.jp/archives/1008457253.html')
a = read('url.db').map{|v|v[:value]}
a.each do |v|
insert('main.db', s.parse(extractcontent(v)))
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment