Skip to content

Instantly share code, notes, and snippets.

@bds
Created March 13, 2014 20:47
Show Gist options
  • Save bds/9536679 to your computer and use it in GitHub Desktop.
Save bds/9536679 to your computer and use it in GitHub Desktop.
Scraping web pages with Ruby for fun and profit
# print the numbers 1 to 5
(1..5).each do |page|
puts page
end
# print a url string with a page number inside
(1..5).each do |page|
puts "http://blog/scripted.com/page/#{page}"
end
# open and read the url
# This is nice but is just a string. A bunch of characters
require 'open-uri'
(1..2).each do |page|
puts open("http://blog.scripted.com/page/#{page}").read
end
# Nokogiri gives us a document object instead of a big string
require 'nokogiri'
(1..2).each do |page|
doc = Nokogiri::HTML(open("http://blog.scripted.com/page/#{page}"))
# find of the article elements
puts doc.css("article")
end
# find all of the links in the article elements
(1..1).each do |page|
doc = Nokogiri::HTML(open("http://blog.scripted.com/page/#{page}"))
puts doc.css("article a")
end
# find the links with href attributes
(1..1).each do |page|
doc = Nokogiri::HTML(open("http://blog.scripted.com/page/#{page}"))
puts doc.css("article a").collect { |link| link['href'] }
end
# only unique links plz
(1..2).each do |page|
doc = Nokogiri::HTML(open("http://blog.scripted.com/page/#{page}"))
# unique links only
puts doc.css("article a").collect { |link| link['href'] }.uniq
end
# how much ya' bench
require 'benchmark'
puts Benchmark.measure {
(1..10).each do |page|
doc = Nokogiri::HTML(open("http://blog.scripted.com/page/#{page}"))
puts doc.css("article a").collect { |link| link['href'] }.uniq
end
}
# perform operations in parallel, much fast, wow
require 'typhoeus'
hydra = Typhoeus::Hydra.hydra
puts Benchmark.measure {
(1..10).each do |page|
request = Typhoeus::Request.new("blog.scripted.com/page/#{page}", {:followlocation => true})
hydra.queue request
request.on_complete do |response|
doc = Nokogiri::HTML(response.body)
puts doc.css("article a").collect { |link| link['href'] }.uniq
end
end
hydra.run
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment