Skip to content

Instantly share code, notes, and snippets.

@Burgestrand
Created August 10, 2012 09:38
Show Gist options
  • Save Burgestrand/3312972 to your computer and use it in GitHub Desktop.
Save Burgestrand/3312972 to your computer and use it in GitHub Desktop.
Threaded scraping with Capybara, Webkit and Celluloid
source :rubygems
gem 'pry'
gem 'capybara'
gem 'capybara-webkit'
gem 'celluloid'
GEM
remote: http://rubygems.org/
specs:
addressable (2.3.2)
capybara (1.1.3)
mime-types (>= 1.16)
nokogiri (>= 1.3.3)
rack (>= 1.0.0)
rack-test (>= 0.5.4)
selenium-webdriver (~> 2.0)
xpath (~> 0.1.4)
capybara-webkit (0.12.1)
capybara (>= 1.0.0, < 1.2)
json
celluloid (0.12.3)
facter (>= 1.6.12)
timers (>= 1.0.0)
childprocess (0.3.6)
ffi (~> 1.0, >= 1.0.6)
coderay (1.0.8)
facter (1.6.13)
ffi (1.1.5)
json (1.7.5)
libwebsocket (0.1.5)
addressable
method_source (0.8.1)
mime-types (1.19)
multi_json (1.3.6)
nokogiri (1.5.5)
pry (0.9.10)
coderay (~> 1.0.5)
method_source (~> 0.8)
slop (~> 3.3.1)
rack (1.4.1)
rack-test (0.6.2)
rack (>= 1.0)
rubyzip (0.9.9)
selenium-webdriver (2.25.0)
childprocess (>= 0.2.5)
libwebsocket (~> 0.1.3)
multi_json (~> 1.0)
rubyzip
slop (3.3.3)
timers (1.0.1)
xpath (0.1.4)
nokogiri (~> 1.3)
PLATFORMS
ruby
DEPENDENCIES
capybara
capybara-webkit
celluloid
pry
desc "Start an interactive session with the search loaded."
task :console do
exec 'bundle exec pry -r./search -I.'
end
require 'bundler/setup'
require 'pry'
require 'celluloid'
require 'capybara/dsl'
require 'capybara/webkit'
require 'cgi'
Capybara.configure do |config|
config.run_server = false
config.default_driver = :webkit
end
class Search
include Celluloid
include Capybara::DSL
class << self
def href(href = nil)
@href = href if href
@href
end
end
def initialize(href = self.class.href)
@base_href = URI(href.to_s)
# Capybara requires all absolute URLs to start with http.
unless @base_href.scheme =~ /^http/
raise ArgumentError, "base_href must be of http(s) scheme"
end
# Overridden, to make sure we have one session per actor.
@page = Capybara::Session.new(Capybara.default_driver)
# Configuration things are nice.
yield self if block_given?
end
protected
attr_reader :base_href
attr_reader :page
public
# Overrridden to avoid Capybara going to the server app_host
# when given relative URLs.
def visit(url)
url = URI(url)
base_href.path = url.path
base_href.query = url.query
Celluloid.logger.info "Visiting #{base_href}"
super(base_href.to_s)
end
def title
find('head title').text
end
end
class Google < Search
href 'https://www.google.com/'
def search(query)
visit('/search?q=%s' % CGI.escape(query))
all("h3.r a").map do |link|
{ title: link.text, url: link[:href].sub(%r|\A/url\?q=|, "") }
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment