Skip to content

Instantly share code, notes, and snippets.

@jittat
Created June 16, 2012 17:52
Show Gist options
  • Save jittat/2942079 to your computer and use it in GitHub Desktop.
Save jittat/2942079 to your computer and use it in GitHub Desktop.
Wikipedia article dump
#!/usr/bin/env ruby
require 'nokogiri'
require 'open-uri'
require 'optparse'
DEFAULT_RECURSE_LEVEL = 2
BAD_STARTERS = ['template','portal','user','book']
URL_BASE = 'http://en.wikipedia.org/wiki/'
rlevel = DEFAULT_RECURSE_LEVEL
OptionParser.new do |opts|
opts.banner = 'Usage: cat.rb [options] page name'
opts.on("-r","--recurse R","Recursive level (default=2)") do |r|
rlevel = r.to_i
end
end.parse!
RECURSE_LEVEL = rlevel
def read_category(name,level=1)
encoded_page_name = URI::encode(name.gsub(' ','_'))
#puts encoded_page_name
url = URL_BASE + encoded_page_name
doc = Nokogiri::HTML(open(url))
doc.css("#mw-pages .mw-content-ltr a").each do |n|
name = n.text
bad = false
BAD_STARTERS.each do |b|
if name.downcase.start_with? b
bad = true
break
end
end
puts name if not bad
end
if level < RECURSE_LEVEL
doc.css("#mw-subcategories .mw-content-ltr a").each do |n|
read_category('Category:' + n.text, level+1)
end
end
end
page_name = 'Category:' + (ARGV.join(' ').split(' ').join('_'))
read_category(page_name)
#!/usr/bin/env ruby
require 'nokogiri'
require 'open-uri'
page_name = URI::encode(ARGV.join(' ').split(' ').join('_'))
url = 'http://en.wikipedia.org/wiki/' + page_name
doc = Nokogiri::HTML(open(url))
doc.css("table.navbox").remove
doc.css("span.editsection").remove
puts doc.css("#mw-content-text").text
@jittat
Copy link
Author

jittat commented Jun 16, 2012

These two Ruby scripts are used to download articles from Wikipedia for experimenting with document classification. Script cat.rb reads Wikipedia articles and outputs their text. Script cat.rb reads a category and outputs page names inside that category, it can recurse into sub-categories. The scripts use Nokogiri.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment