Skip to content

Instantly share code, notes, and snippets.

@barrbrain
Last active December 19, 2015 09:39
Show Gist options
  • Save barrbrain/5934893 to your computer and use it in GitHub Desktop.
Save barrbrain/5934893 to your computer and use it in GitHub Desktop.
A quick and dirt external site crawler using capybara-mechanise.
require 'capybara/mechanize'
require 'sinatra/base'
Capybara.run_server = false
Capybara.current_driver = :mechanize
class TestApp < Sinatra::Base; get '/' do; end end
Capybara.app = TestApp
Capybara.app_host = "http://127.0.0.1/"
class Spider
include Capybara::DSL
def initialize(url)
@url = url
@@queue ||= []
@@done ||= {}
@@queue << self
end
def search
return unless @url.start_with? Capybara.app_host
return if @@done.include? @url
@@done[@url] = nil
begin visit(@url) unless @url == current_url rescue return end
puts @url
all(:xpath, "//input[@type='submit']|//a[@href]").map(&:path).each { |path|
begin
node = find :xpath, path
if node[:href]
url = URI.join(@url, node[:href]).to_s
else
node.click
url = current_url
visit @url unless @url == current_url
end
Spider.new url unless @@done.include? url
rescue; end
}
self
end
def recurse
until @@queue.tap { @@queue = [] }.each(&:search).empty? do end
end
end
Spider.new(Capybara.app_host).recurse
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment