Skip to content

Instantly share code, notes, and snippets.

@jamesmartin
Created April 5, 2012 09:35
Show Gist options
  • Save jamesmartin/2309521 to your computer and use it in GitHub Desktop.
Save jamesmartin/2309521 to your computer and use it in GitHub Desktop.
Little website scraping example
require 'rubygems'
require 'nokogiri'
require 'httparty'
require 'uri'
require 'pp'
class HtmlParserIncluded < HTTParty::Parser
SupportedFormats.merge!('text/html' => :html)
def html
Nokogiri::HTML(body)
end
end
class Page
include HTTParty
parser HtmlParserIncluded
end
archive_directory = "#{Dir.pwd}/archive"
Dir.mkdir(archive_directory) unless File.directory?(archive_directory)
archive_page = Page.get('http://www.daringfireball.net/archive')
total_saved = 0
archive_page.css('.archive p a').each do |node|
article_uri = URI.parse(node['href'])
article_filename = article_uri.path.gsub('/', '_')
puts "Fetching #{node['href']}"
article_page = Page.get(node['href'])
File.open("#{archive_directory}/#{article_filename}", 'w') do |file|
file.puts article_page.css('.article').to_html
total_saved += 1
end
puts "Fetched and saved #{total_saved} articles."
end
@jamesmartin
Copy link
Author

This little script walks through the daringfireball.net archive and saves the body of each article to disk.

Daring Fireball's archive is a single HTML page stretching back to 2002, the format of each page is consistent and the HTML is marked up cleanly, so it makes for a nice example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment