Skip to content

Instantly share code, notes, and snippets.

@willgk
Forked from falsefalse/simpledesktops_scraper.rb
Last active August 29, 2015 14:15
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save willgk/7052ecd9ac6c5a5af1f2 to your computer and use it in GitHub Desktop.
Save willgk/7052ecd9ac6c5a5af1f2 to your computer and use it in GitHub Desktop.
require 'rubygems'
require 'hpricot'
require 'open-uri'
@url = "http://simpledesktops.com/browse/"
def scrape_page(url, page_number=0)
puts "Scraping #{url}"
Dir.mkdir "#{page_number}" unless File.exists? "#{page_number}"
doc = Hpricot( open( url ) )
( doc/".desktop > a" ).each do |link|
href = link.attributes["href"]
path = "#{page_number}/#{File.basename href}"
unless File.exists? path
print "#{href}... "
open( href ) do |image|
File.open( path, "wb" ) do |f|
f.write image.read
puts "Saved #{path}"
end
end
else
puts "#{path} exists"
end
end
next_page = ( doc/".pagination .older" )
unless next_page.length == 0
u = URI.parse url
next_page_url = URI::HTTP.build({ :host => u.host, :path => next_page[0].attributes["href"] }).to_s
puts "Next page: #{next_page_url}"
page_number += 1
scrape_page next_page_url, page_number
end
end
scrape_page @url
@willgk
Copy link
Author

willgk commented Feb 22, 2015

Just forking this to see if I can't add some progress bars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment