Skip to content

Instantly share code, notes, and snippets.

@orumin
Last active July 4, 2021 07:20
Show Gist options
  • Save orumin/d95ef717875bd9ebb0048d3f3c1f29b0 to your computer and use it in GitHub Desktop.
Save orumin/d95ef717875bd9ebb0048d3f3c1f29b0 to your computer and use it in GitHub Desktop.
#!/usr/bin/env ruby
# encoding: utf-8
#
# Web Archiveからスクレイピングして小説家になろうのテキストダウンロードと同じ形式でテキストパクって,
# 更にそれをnarou.rbの本文のyaml形式で出力するやつ(実行場所の一階層上のディレクトリにtoc.yamlという目次があるものとする)
#
require 'yaml'
require 'nokogiri'
require 'open-uri'
toc = YAML.load_file('../toc.yaml')["subtitles"]
for i in 1..124 do
ans = []
title = nil
begin
if 10 <= i and i <= 12 then
url = "http://web.archive.org/web/20130708184241/http://ncode.syosetu.com/n7145bl/#{i}"
else
url = "http://web.archive.org/web/20131127132412/http://ncode.syosetu.com/n7145bl/#{i}"
end
x = ""
open(url).each{ |d|
x += d.clone
}
doc = Nokogiri.HTML(x)
rescue
doc = Nokogiri.HTML("<html><body><p>abort</p></body></html>")
ans[ans.length] = "Nokogiri aborted"
end
begin
title_element = doc.xpath("//title").inner_text
title = title_element.sub(/^.* - /,"")
rescue
title = "Unknown"
end
begin
doc.xpath("//div[@class='novel_view']").each do |node|
next if node.children.inner_text == nil
node.search('rb').each do |rb| rb.replace('|' + rb.inner_text) end
node.search('rt').each do |rt| rt.replace(rt.inner_text) end
node.search('rp').each do |rp|
rp.replace("《") if rp.inner_text == "("
rp.replace("》") if rp.inner_text == ")"
end
node.search('ruby').each do |ruby| ruby.replace(ruby.inner_text) end
aa = node.inner_text.clone
ans.push(aa.clone)
end
hash = toc[i-1]
File.open("#{i} #{title}.yaml", 'w') { |file|
hash.store("element", {"introduction"=>'',"body"=>ans.join(""),"postscript"=>'',"data_type"=>"text"})
YAML.dump(hash,file)
}
rescue
p "something wrong"
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment