Created
April 14, 2020 06:19
-
-
Save dainmiller/92ee371d116dca83c211d18a57175d87 to your computer and use it in GitHub Desktop.
Crawler prototype
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# 1. Initialize with a root URL `crawler = Crawler.new(root: '')` | |
# 2. Calls fetch on that URL | |
# 3. Boots HTTParty and stores the response in class | |
# 4. Returns `#.body` of the HTTParty response | |
# 5. Maybe helps store the response body in an optional | |
# 6. `crawler.next` called to crawl through pages, finding more pages to crawl | |
# 7. If there are no more pages to crawl, we know we're on a leaf | |
# 8. If we have a file on this leaf, upload to our DataStore | |
# | |
require 'httparty' | |
class Crawler | |
attr_reader :status | |
def initialize(root:) | |
@root = root | |
@status = true | |
end | |
def fetch url | |
unwrap Maybe.new HTTParty.get(url).body | |
end | |
def prepare! | |
@page_data ||= fetch @root | |
self | |
end | |
def _next | |
@page_data = update_status fetch find_url @page_data | |
end | |
def update_status | |
@status = false unless @page_data | |
end | |
def leaf | |
store if contains_file? | |
end | |
def can_continue? | |
status | |
end | |
private | |
def find_url page | |
url = Nokogiri.grep(page).find_ahref[0] | |
leaf unless url | |
end | |
def unwrap maybe | |
maybe.maybe | |
end | |
end | |
Crawler.new(root: 'http://google.com?s=thing') |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment