Skip to content

Instantly share code, notes, and snippets.

@zealot128
Last active August 15, 2018 12:22
Show Gist options
  • Save zealot128/6524687 to your computer and use it in GitHub Desktop.
Save zealot128/6524687 to your computer and use it in GitHub Desktop.
Web Crawler Helper class based upon Poltergeist (PhantomJS).Using Capybara as framework for building webcrawlers is surprisingly convenient
class ExampleCrawler < PoltergeistCrawler
def crawl
visit "https://news.ycombinator.com/"
click_on "More"
page.evaluate_script("window.location = '/'")
end
end
ExampleCrawler.new.crawl
require 'capybara/poltergeist'
require 'capybara/dsl'
class PoltergeistCrawler
include Capybara::DSL
def initialize
Capybara.register_driver :poltergeist_crawler do |app|
Capybara::Poltergeist::Driver.new(app, {
:js_errors => false,
:inspector => false,
phantomjs_logger: open('/dev/null') # if you don't care about JS errors/console.logs
})
end
Capybara.default_wait_time = 3
Capybara.run_server = false
Capybara.default_driver = :poltergeist_crawler
page.driver.headers = {
"DNT" => 1,
"User-Agent" => "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:22.0) Gecko/20100101 Firefox/22.0"
}
end
# handy to peek into what the browser is doing right now
def screenshot(name="screenshot")
page.driver.render("public/#{name}.jpg",full: true)
end
# find("path") and all("path") work ok for most cases. Sometimes I need more control, like finding hidden fields
def doc
Nokogiri.parse(page.body)
end
end
@IvRRimum
Copy link

IvRRimum commented May 9, 2016

Whats the ruby version ? Whats the capyBara version, poltergeist version ? This looks amazing!

@IvRRimum
Copy link

IvRRimum commented May 9, 2016

Worked with this Gemfile.loc:

GEM
  remote: https://rubygems.org/
  specs:
    addressable (2.4.0)
    capybara (2.7.1)
      addressable
      mime-types (>= 1.16)
      nokogiri (>= 1.3.3)
      rack (>= 1.0.0)
      rack-test (>= 0.5.4)
      xpath (~> 2.0)
    cliver (0.3.2)
    coderay (1.1.1)
    jsoner (0.0.4)
    method_source (0.8.2)
    mime-types (3.0)
      mime-types-data (~> 3.2015)
    mime-types-data (3.2016.0221)
    mini_portile2 (2.0.0)
    multi_json (1.12.0)
    nokogiri (1.6.7.2)
      mini_portile2 (~> 2.0.0.rc2)
    poltergeist (1.9.0)
      capybara (~> 2.1)
      cliver (~> 0.3.1)
      multi_json (~> 1.0)
      websocket-driver (>= 0.2.0)
    pry (0.10.3)
      coderay (~> 1.1.0)
      method_source (~> 0.8.1)
      slop (~> 3.4)
    rack (1.6.4)
    rack-test (0.6.3)
      rack (>= 1.0)
    slop (3.6.0)
    websocket-driver (0.6.3)
      websocket-extensions (>= 0.1.0)
    websocket-extensions (0.1.2)
    xpath (2.0.0)
      nokogiri (~> 1.3)

PLATFORMS
  ruby

DEPENDENCIES
  capybara
  jsoner
  poltergeist
  pry

BUNDLED WITH
   1.11.2

@rendekarf
Copy link

thanks for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment