Skip to content

Instantly share code, notes, and snippets.

@kejadlen
Created April 6, 2016 15:07
Show Gist options
  • Save kejadlen/7b4dae5edcfe15c2d2b5dfc00b2ffe57 to your computer and use it in GitHub Desktop.
Save kejadlen/7b4dae5edcfe15c2d2b5dfc00b2ffe57 to your computer and use it in GitHub Desktop.
require 'open-uri'
require 'rexml/document'
module GettingToPhilosophy
module Crawler
def self.crawl(*trail)
until trail.last == Wikipedia::PHILOSOPHY || trail.uniq.size != trail.size
puts trail.last
doc = Wikipedia.article(trail.last)
links = doc.elements.to_a('//div[@id="mw-content-text"]/p//a')
lowercase_links = links.select {|link|
link.text =~ /^[[:lower:]]+$/
}
return trail if lowercase_links.empty?
href = lowercase_links.first.attribute(:href)
trail << URI.join(trail.last, href.value).to_s
end
trail
end
end
module Wikipedia
BASE = 'https://en.wikipedia.org/wiki'
SPECIAL_RANDOM = "#{BASE}/Special:Random"
PHILOSOPHY = "#{BASE}/Philosophy"
def self.random_article
article(SPECIAL_RANDOM)
.elements['//link[@rel="canonical"]']
.attribute(:href).value
end
def self.article(article)
html = open(article.to_s).read
REXML::Document.new(html)
end
end
end
if __FILE__ == $0
include GettingToPhilosophy
article = ARGV.shift
article = article ? "#{Wikipedia::BASE}/#{article}" : Wikipedia.random_article
p Crawler.crawl(article)
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment