Skip to content

Instantly share code, notes, and snippets.

@icco
Last active August 29, 2015 14:22
Show Gist options
  • Save icco/d33ac8f4f06fca7a9552 to your computer and use it in GitHub Desktop.
Save icco/d33ac8f4f06fca7a9552 to your computer and use it in GitHub Desktop.
Scrape all of unsplash.com
#! /usr/bin/env ruby
require 'nokogiri'
require 'open-uri'
require 'net/http'
require 'typhoeus'
urls = []
(1...20).each do |i|
doc = Nokogiri::HTML(open("https://unsplash.com/grid?page=#{i}"))
doc.css('a').each do |a|
h = a.attributes["href"]
if h and h.value.include? "/download"
urls.push "https://unsplash.com#{h.value}"
end
end
end
thread_count = 10
urls = urls.sort.uniq
hydra = Typhoeus::Hydra.new(max_concurrency: thread_count)
urls.each do |url|
request = Typhoeus::Request.new(url, followlocation: true)
request.on_complete do |response|
# https://unsplash.com/photos/TXG9VLN1J9U/download
url_name = response.effective_url.split("/").last
ext = File.extname(url_name)
name = File.basename(url_name, ext)
# Get the correct ext for this file
content_type = response.headers["Content-Type"]
new_ext = MIME::Types[content_type].first.extensions.first
file_name = "#{name.downcase.gsub(/[^a-z0-9]/, '')}.#{new_ext.downcase}"
p file_name
File.write(file_name, response.body)
end
hydra.queue request
end
hydra.run
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment