Skip to content

Instantly share code, notes, and snippets.

@jdan
Created January 21, 2012 21:23
Show Gist options
  • Save jdan/1654064 to your computer and use it in GitHub Desktop.
Save jdan/1654064 to your computer and use it in GitHub Desktop.
(Nokogiri) Tests the idea that the first link on each wikipedia article will eventually lead to philosophy
#!/usr/bin/env ruby
# wiki-scraper.rb by Jordan Scales
# http://jordanscales.com
# http://programthis.net
#
# Tests the idea that the first link on each wikipedia article
# will eventually lead to philosophy
#
# Usage:
# ruby wiki-scraper.rb daft punk
require 'nokogiri'
require 'open-uri'
require 'cgi'
ROOT_URL = 'http://en.wikipedia.org'
def search_url(query)
"http://en.wikipedia.org/w/index.php?search=#{CGI.escape(query)}"
end
def title_from_url(url)
doc = Nokogiri::HTML(open(url))
doc.css('h1#firstHeading').first.content
end
def title_from_query(query)
title_from_url search_url(query)
end
def first_link(url)
doc = Nokogiri::HTML(open(url))
parenth = 0
# cycle through each paragraph
doc.css('div.mw-content-ltr > p').each do |p|
# in each paragraph, go through each node
p.children.each do |c|
# if we've found two parentheses, return the next link you see
if parenth == 0 or (parenth > 1 and (parenth % 2 == 0))
if c.name == 'a'
return ROOT_URL + c.attributes["href"].value
end
end
# incremement the number of parentheses we've seen
if /\(/ === c.to_s
parenth += 1
elsif /\)/ === c.to_s
parenth += 1
end
end
end
end
def first_link_from_query(query)
first_link search_url(query)
end
start = ARGV.join(' ')
url = search_url start
title = title_from_url url
puts "1: #{title}"
count = 2
while title != 'Philosophy'
url = first_link url
title = title_from_url url
puts "#{count}: #{title}: #{url}"
count += 1
end
@wondermike-zz
Copy link

Somebody on reddit posted exactly what I was thinking about when I saw this:

"If you really wanted to try all the articles, here are a couple of thoughts: - cache every article that eventually leads to Philosophy, then if a given page links to an article in the cache, it too eventually leads to Philosophy - Wikipedia used to (probably still does) make available an archive of all the articles for download. Then, you wouldn't have to deal with network latency, and whatnot.
Sounds like a fun project!"

That would be pretty awesome. Obviously, Wikipedia is changing all the time, but you could at least get a decent snapshot of how long it takes to get to Philosophy from any English article. You'd need to check the current path to check for infinite loops. Alternately, we could just write something to insert a Philosophy link as the first link in every Wikipedia article and call it a day...

My friend showed me this because I'm going to attempt to write something that randomly selects a link from a Wikipedia article. The only hard part is I only want links to other articles, no references, disambiguation pages, etc. (specifically only articles in the language of the current page). I did a prototype in VB .Net a couple of years ago (the framework used where I worked at the time). It worked pretty well and was a fun way to interact with Wikipedia.

Anyway, I just started learning ruby and whatnot, so this looks like a good place to start. Thanks for posting this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment