Skip to content

Instantly share code, notes, and snippets.

@sachac
Created August 1, 2013 14:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sachac/6131743 to your computer and use it in GitHub Desktop.
Save sachac/6131743 to your computer and use it in GitHub Desktop.
Concatenate Wordpress blog pages for later text manipulation
#!/usr/bin/env ruby
# Retrieves and concatenates Wordpress blog pages, being sure to
# include only the BODY/specified elements from other files
#
# Usage: concat-html.rb URL [css selector for element] [max page number]
require 'rubygems'
require 'nokogiri'
require 'open-uri'
base_url = ARGV[0]
doc = Nokogiri::HTML(open(base_url))
element = ARGV.length > 1 ? ARGV[1] : "body"
limit = ARGV.length > 2 ? ARGV[2].to_i : 0
if matches = base_url.match(/(.*)\/page\/([0-9]+)/)
base_url = matches[1]
page_number = matches[2].to_i + 1
else
page_number = 2
end
body = doc.at_css element
# Get succeeding pages until not found
begin
while limit == 0 or page_number <= limit
$stderr.puts "Processing page #{page_number}"
doc2 = Nokogiri::HTML(open(base_url + "/page/#{page_number}"))
body2 = doc2.at_css element
body.inner_html += body2.inner_html
page_number += 1
end
rescue Exception => e
if e.to_s != "404 Not Found"
$stderr.puts e
$stderr.puts e.backtrace
end
end
puts doc.to_html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment