Skip to content

Instantly share code, notes, and snippets.

@kei-s
Forked from june29/pager.rb
Created December 3, 2008 12:19
Show Gist options
  • Save kei-s/31517 to your computer and use it in GitHub Desktop.
Save kei-s/31517 to your computer and use it in GitHub Desktop.
crawl AutoPagerize NextLink
require "rubygems"
require "nokogiri"
require "httpclient"
require "uri"
require "json"
class Pager
@@siteinfo_url = "http://wedata.net/databases/AutoPagerize/items.json"
attr_accessor :doc
def initialize(url)
@next_link_xpath
@url = url
@client = HTTPClient.new
@doc = Nokogiri::HTML(@client.get_content(@url))
@siteinfo = JSON.parse(@client.get_content(@@siteinfo_url)).reject { |siteinfo|
siteinfo["data"]["url"].nil?
}.sort { |a, b|
b["data"]["url"].size - a["data"]["url"].size
}.each do |item|
regexp = Regexp.new(item["data"]["url"])
if regexp =~ @url
@next_link_xpath = item["data"]["nextLink"]
break
end
end
end
def next
next_link = @doc.xpath(@next_link_xpath).first["href"]
next_link = URI.split(@url)[0] + "://" + URI.split(@url)[2] + next_link unless /^http/ =~ next_link
@doc = Nokogiri::HTML(@client.get_content(next_link))
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment