Skip to content

Instantly share code, notes, and snippets.

@PandaWhisperer
Created June 5, 2020 22:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save PandaWhisperer/d0800e3d01c765e0a79fce93ff948d5b to your computer and use it in GitHub Desktop.
Save PandaWhisperer/d0800e3d01c765e0a79fce93ff948d5b to your computer and use it in GitHub Desktop.
require 'vessel'
class AdCrawler < Vessel::Cargo
GAM_GET_SLOTS_JS = <<~JS.gsub(/\n/, '')
googletag.pubads().getSlots().map(slot => ({
adUnitPath: slot.getAdUnitPath(),
slotElementId: slot.getSlotElementId(),
sizes: slot.getSizes().map(size =>
typeof size == 'string' ? [0, 0] : [size.getWidth(), size.getHeight()])
}))
JS
def urls_visited
@urls_visited ||= []
end
def parse
puts "Visiting page: #{page.current_url}"
urls_visited << page.current_url
yield({
url: page.current_url,
# slots: page.evaluate(GAM_GET_SLOTS_JS)
})
# if next_page = at_css('nav[data-role=nav-links] a[name]')
# if nav_links = xpath('//nav[@data-role="nav-links"]//a[@name]')
if nav_links = xpath('//a')
nav_links.each do |next_page|
url = absolute_url(next_page.attribute(:href))
unless urls_visited.include? url
yield request(url: url, method: :parse)
end
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment