Skip to content

Instantly share code, notes, and snippets.

@dainmiller
Created April 14, 2020 06:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dainmiller/92ee371d116dca83c211d18a57175d87 to your computer and use it in GitHub Desktop.
Save dainmiller/92ee371d116dca83c211d18a57175d87 to your computer and use it in GitHub Desktop.
Crawler prototype
# 1. Initialize with a root URL `crawler = Crawler.new(root: '')`
# 2. Calls fetch on that URL
# 3. Boots HTTParty and stores the response in class
# 4. Returns `#.body` of the HTTParty response
# 5. Maybe helps store the response body in an optional
# 6. `crawler.next` called to crawl through pages, finding more pages to crawl
# 7. If there are no more pages to crawl, we know we're on a leaf
# 8. If we have a file on this leaf, upload to our DataStore
#
require 'httparty'
class Crawler
attr_reader :status
def initialize(root:)
@root = root
@status = true
end
def fetch url
unwrap Maybe.new HTTParty.get(url).body
end
def prepare!
@page_data ||= fetch @root
self
end
def _next
@page_data = update_status fetch find_url @page_data
end
def update_status
@status = false unless @page_data
end
def leaf
store if contains_file?
end
def can_continue?
status
end
private
def find_url page
url = Nokogiri.grep(page).find_ahref[0]
leaf unless url
end
def unwrap maybe
maybe.maybe
end
end
Crawler.new(root: 'http://google.com?s=thing')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment