Skip to content

Instantly share code, notes, and snippets.

@orumin
Last active June 22, 2023 01:13
Show Gist options
  • Save orumin/7e30f1cc5b17ed5853f7f68be4935864 to your computer and use it in GitHub Desktop.
Save orumin/7e30f1cc5b17ed5853f7f68be4935864 to your computer and use it in GitHub Desktop.
#!/usr/bin/env ruby
# encoding: utf-8
#
# Web Archiveからスクレイピングして小説家になろうのテキストダウンロードと同じ形式でテキストパクってくるヤツ
#
require 'nokogiri'
require 'open-uri'
for i in 1..124 do
ans = []
title = nil
begin
if 10 <= i and i <= 12 then
url = "http://web.archive.org/web/20130708184241/http://ncode.syosetu.com/n7145bl/#{i}"
else
url = "http://web.archive.org/web/20131127132412/http://ncode.syosetu.com/n7145bl/#{i}"
end
x = ""
open(url).each{ |d|
x += d.clone
}
doc = Nokogiri.HTML(x)
rescue
doc = Nokogiri.HTML("<html><body><p>abort</p></body></html>")
ans[ans.length] = "Nokogiri aborted"
end
begin
title_element = doc.xpath("//title").inner_text
title = title_element.sub(/^.* - /,"")
rescue
title = "Unknown"
end
begin
ans.push("#{title}\n\n")
doc.xpath("//div[@class='novel_view']").each do |node|
next if node.children.inner_text == nil
node.search('rb').each do |rb| rb.replace('|' + rb.inner_text) end
node.search('rt').each do |rt| rt.replace(rt.inner_text) end
node.search('rp').each do |rp|
rp.replace("《") if rp.inner_text == "("
rp.replace("》") if rp.inner_text == ")"
end
node.search('ruby').each do |ruby| ruby.replace(ruby.inner_text) end
aa = node.inner_text.clone
ans.push(aa.clone)
end
File.open("#{i} #{title}.txt", 'w') { |file|
file.write ans.join("")
}
rescue
p "something wrong"
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment