Skip to content

Instantly share code, notes, and snippets.

@KTamas
Created July 2, 2012 08:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save KTamas/3032047 to your computer and use it in GitHub Desktop.
Save KTamas/3032047 to your computer and use it in GitHub Desktop.
grabs Jenny Lawson's column from cafemom, creates a mobi-compatible html
#!/usr/bin/env ruby
# encoding: UTF-8
require 'rubygems'
require 'open-uri'
require 'nokogiri'
PAGES = [0, 11, 21, 31]
BASE_URL = "http://thestir.cafemom.com/column/ill_advised?next="
HEADER = '<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8" /></head><body>' + "\n"
FOOTER = '</body></html>'
output = HEADER
urls = []
PAGES.each do |page|
p = Nokogiri.HTML(open(BASE_URL + page.to_s))
p.css("a.readMore").each do |article|
urls.push(article.attributes['href'])
end
end
urls.reverse.each do |url|
puts url
a_page = Nokogiri.HTML(open(url))
title = a_page.css('div h1').text.strip
content = a_page.css('div.articleBody').to_s
to_replace = {}
content.scan(/(<img.+?src=")(.+?)(".+?>)/) do |image|
outfilename = (rand()*100).to_s.gsub('.', '')
File.open("#{outfilename}.jpg", "wb") { |img_file| open(image[1]) { |read_img| img_file.write(read_img.read) } }
to_replace[image[1]] = "#{outfilename}.jpg"
end
to_replace.each do |k,v|
content.gsub!(k, v)
end
content.gsub!(/(<img.+?>)/, '\1<br />')
output += "<h1>#{title}</h1>\n"
output += content
output += "\n\n<mbp:pagebreak/>\n\n"
end
output += FOOTER
File.open("output.html", "w") { |f| f.write(output) }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment