Created
April 5, 2012 09:35
-
-
Save jamesmartin/2309521 to your computer and use it in GitHub Desktop.
Little website scraping example
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| require 'rubygems' | |
| require 'nokogiri' | |
| require 'httparty' | |
| require 'uri' | |
| require 'pp' | |
| class HtmlParserIncluded < HTTParty::Parser | |
| SupportedFormats.merge!('text/html' => :html) | |
| def html | |
| Nokogiri::HTML(body) | |
| end | |
| end | |
| class Page | |
| include HTTParty | |
| parser HtmlParserIncluded | |
| end | |
| archive_directory = "#{Dir.pwd}/archive" | |
| Dir.mkdir(archive_directory) unless File.directory?(archive_directory) | |
| archive_page = Page.get('http://www.daringfireball.net/archive') | |
| total_saved = 0 | |
| archive_page.css('.archive p a').each do |node| | |
| article_uri = URI.parse(node['href']) | |
| article_filename = article_uri.path.gsub('/', '_') | |
| puts "Fetching #{node['href']}" | |
| article_page = Page.get(node['href']) | |
| File.open("#{archive_directory}/#{article_filename}", 'w') do |file| | |
| file.puts article_page.css('.article').to_html | |
| total_saved += 1 | |
| end | |
| puts "Fetched and saved #{total_saved} articles." | |
| end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This little script walks through the daringfireball.net archive and saves the body of each article to disk.
Daring Fireball's archive is a single HTML page stretching back to 2002, the format of each page is consistent and the HTML is marked up cleanly, so it makes for a nice example.