Skip to content

Instantly share code, notes, and snippets.

@zealot128
Last active August 15, 2018 12:22
Show Gist options
  • Save zealot128/6524687 to your computer and use it in GitHub Desktop.
Save zealot128/6524687 to your computer and use it in GitHub Desktop.
Web Crawler Helper class based upon Poltergeist (PhantomJS).Using Capybara as framework for building webcrawlers is surprisingly convenient
class ExampleCrawler < PoltergeistCrawler
def crawl
visit "https://news.ycombinator.com/"
click_on "More"
page.evaluate_script("window.location = '/'")
end
end
ExampleCrawler.new.crawl
require 'capybara/poltergeist'
require 'capybara/dsl'
class PoltergeistCrawler
include Capybara::DSL
def initialize
Capybara.register_driver :poltergeist_crawler do |app|
Capybara::Poltergeist::Driver.new(app, {
:js_errors => false,
:inspector => false,
phantomjs_logger: open('/dev/null') # if you don't care about JS errors/console.logs
})
end
Capybara.default_wait_time = 3
Capybara.run_server = false
Capybara.default_driver = :poltergeist_crawler
page.driver.headers = {
"DNT" => 1,
"User-Agent" => "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:22.0) Gecko/20100101 Firefox/22.0"
}
end
# handy to peek into what the browser is doing right now
def screenshot(name="screenshot")
page.driver.render("public/#{name}.jpg",full: true)
end
# find("path") and all("path") work ok for most cases. Sometimes I need more control, like finding hidden fields
def doc
Nokogiri.parse(page.body)
end
end
@rendekarf
Copy link

thanks for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment