Skip to content

Instantly share code, notes, and snippets.

@stormbeta
Created December 30, 2013 21:00
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save stormbeta/8188045 to your computer and use it in GitHub Desktop.
Save stormbeta/8188045 to your computer and use it in GitHub Desktop.
Simple script to scrape Worm web serial chapters into aggregate HTML
#!/usr/bin/ruby
# Scrapes crude HTML representation of Worm (Parahumans) web serial chapters
require 'rubygems'
require 'nokogiri'
require 'open-uri'
# URL of first chapter
nextChapterUri = URI::encode("http://parahumans.wordpress.com/category/stories-arcs-1-10/arc-1-gestation/1-01/")
while true do
$stderr.puts "Opening #{nextChapterUri}"
currentPage = Nokogiri::HTML(open(nextChapterUri))
content = currentPage.css('div.entry-content')
isEnd = true
#Search for sequence link
content.css('a').each{ |n|
if(n.text=='Next Chapter') then
isEnd = false
nextChapterUri = URI::encode(n.attr('href'))
end
}
chapterText = content.css('p').to_s
puts chapterText
if(isEnd) then
$stderr.puts "No further chapters find, exiting..."
exit(0)
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment