Skip to content

Instantly share code, notes, and snippets.

@Opus1no2
Last active August 29, 2015 14:13
Show Gist options
  • Save Opus1no2/c980f64e0f0839b73b85 to your computer and use it in GitHub Desktop.
Save Opus1no2/c980f64e0f0839b73b85 to your computer and use it in GitHub Desktop.
require 'uri'
require 'mechanize'
module Wiki
class Crawl
attr_reader :start, :domain, :mech, :exclude
def initialize(start)
@start = start
@domain = get_domain()
@mech = Mechanize.new
@exclude = /main_page|talk:|category:|template:|special:|user:|wikipedia:|help:|portal:|file:/i
end
def rand_link(page, num)
return if num.zero?
link = mech
.get(page)
.links_with(href: /^\/wiki\//i)
.map{|l| l.href}
.select{ |l| l !~ exclude}
.sample
puts domain + link
sleep(3)
rand_link(domain + link, num -= 1)
end
def get_domain()
uri = URI.parse(start)
"#{uri.scheme}://#{uri.host}"
end
def init(num)
rand_link(start, num)
end
end
end
Wiki::Crawl.new('https://en.wikipedia.org/wiki/Main_Page').init(100)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment