Skip to content

Instantly share code, notes, and snippets.

@phil
Created June 11, 2014 13:13
Show Gist options
  • Save phil/cb13bebf6ff8473ba705 to your computer and use it in GitHub Desktop.
Save phil/cb13bebf6ff8473ba705 to your computer and use it in GitHub Desktop.
EventMachine scraping Hackernews
require "eventmachine"
require "em-http-request"
require 'nokogiri'
EventMachine.run do
http_hackernews = EM::HttpRequest.new("https://news.ycombinator.com").get
http_hackernews.callback do
links = Nokogiri::HTML.parse(http_hackernews.response).css("td.title a")
links.each do |link|
next unless link
href = link.attribute("href").to_s
unless href.match(/http.*/)
links.delete(link) and next
end
http_site = EventMachine::HttpRequest.new(href).get
http_site.callback do
links.delete(link)
begin
title = Nokogiri.HTML(http_site.response).css("head>title").inner_text
puts "#{href} - #{http_site.response_header.status} - #{title}"
rescue
puts "#{href} - BOOM"
end
EventMachine.stop if links.empty?
end
http_site.errback do
puts "#{href} - Error - #{http_site.error.inspect}"
end
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment