Skip to content

Instantly share code, notes, and snippets.

@tdtds
Created January 19, 2012 10:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tdtds/1639251 to your computer and use it in GitHub Desktop.
Save tdtds/1639251 to your computer and use it in GitHub Desktop.
renzaburo.jpにあるHTML形式の文書をまとめて青空文庫形式に
#
# This command was obsolete. see also https://github.com/tdtds/aozoragen
#
#!/usr/bin/env ruby
# -*- coding: utf-8; -*-
#
# usage: renzaburo2aozora <URL>
# URL: the index of a novel in renzaburo, only HTML version NOT Flash.
#
require 'open-uri'
require 'pathname'
require 'nokogiri'
def get_content( uri, book_title = '' )
text = ''
html = open( uri, 'r:CP932', &:read ).encode( 'UTF-8' )
html = html.gsub( /\&mdash;/, "\u2500" ).gsub( /\&quot;/, "\u201D" )
(Nokogiri( html ) / 'div#mainContent' ).each do |content|
(content / 'h3').each do |t|
title = t.text.sub( /^『#{book_title}』 /, '' )
text << "[#改ページ]\n\n" << title << "\n\n"
end
(content / 'div.textBlock p' ).each do |para|
next if /<次回につづく>/ =~ para.text
text << ' ' * 10 if (para.attr('class') || '').index( 'txtAlignC' )
text << para.text.gsub( /<br>/, "\n" ) << "\n\n"
end
end
text
end
while i = ARGV.shift do
index = URI( i.sub( /index\.html$/, '' ) )
bookname = Pathname( index.to_s ).basename
title = ''
text = ''
html = Nokogiri( open( index, 'r:CP932', &:read ) )
(html / 'title').each do |t|
title = t.text.sub( /|.*/, '' )
text << title << "\n"
end
(html / 'div.textBlock strong' ).each do |st|
text << st.text << "\n\n\n"
end
(html / 'ul.btnList li.withDate a' ).each do |li|
text << get_content( index.merge( li.attr( :href ) ), title )
end
open( "#{bookname}.txt", 'w' ) do |w|
w.write text
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment