Created
January 17, 2012 22:57
-
-
Save danielpcox/1629614 to your computer and use it in GitHub Desktop.
Quick-and-dirty script to download and stitch together the Washington Post, sans Sports.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/home/danielpcox/.rvm/rubies/ruby-1.9.3-p0/bin/ruby | |
# ############# The Washington Post Downloader ############# | |
# | |
# Downloads and merges the day's Epaper into three sections. Ignores the Sports section. | |
# (requires pdftk, wget, and the nokogiri gem) | |
require 'open-uri' | |
require 'nokogiri' | |
todays_date = Time.now.strftime("%Y-%m-%d") | |
puts "Fetching today's paper..." | |
["A", "B", "C"].each do |section| | |
file = open("http://www.washingtonpost.com/todays_paper?dt=#{todays_date}&bk=#{section}&pg=1") | |
doc = Nokogiri::HTML(file) | |
num_pages = doc.css("li.last a").first.content.to_i | |
# if the Epaper hasn't come out yet... | |
if num_pages == 0 | |
puts "EPAPER NOT OUT YET." | |
exit 0 | |
end | |
%x(mkdir -p #{todays_date}/#{section}) | |
puts "* Getting #{section} section (#{num_pages} pages)..." | |
merge_list = "" | |
for i in 1..num_pages do | |
puts " - page " << i.to_s | |
%x(wget -q -O #{todays_date}/#{section}/#{section}x#{i}.pdf http://www.washingtonpost.com/rw/WashingtonPost/Content/Epaper/#{todays_date}/#{section}x#{i}.pdf) | |
this_pdf_path = "#{todays_date}/#{section}/#{section}x#{i}.pdf" | |
if File.zero?(this_pdf_path) | |
puts "! SKIPPING UNEXPECTEDLY EMPTY FILE #{this_pdf_path}" | |
elsif !File.exists?(this_pdf_path) | |
puts "! SKIPPING UNEXPECTEDLY MISSING FILE #{this_pdf_path}" | |
else | |
merge_list << " " << this_pdf_path | |
end | |
end | |
puts "* Merging #{section} section..." | |
%x(pdftk #{merge_list} cat output #{todays_date}/#{section}_section.pdf) | |
end | |
puts "* Cleaning up..." | |
%x(rm -rf #{todays_date}/A) | |
%x(rm -rf #{todays_date}/B) | |
%x(rm -rf #{todays_date}/C) | |
puts "Done!" |
made it more DRY and resilient
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Not DRY, "unnecessarily" uses system commands, and breaks on stuff like 2011-12-30, when pages 4 and 5 were both stuck into 4 so page 5 didn't exist. I love it anyway.