Skip to content

Instantly share code, notes, and snippets.

@r3bo0t
Created September 16, 2010 09:18
Show Gist options
  • Save r3bo0t/582168 to your computer and use it in GitHub Desktop.
Save r3bo0t/582168 to your computer and use it in GitHub Desktop.
Ruby crawler file for downloading externallinks for mediawiki for diff languages
require 'hpricot'
require 'open-uri'
require 'work_queue'
dir = Dir.new("path/to/your/dir")
class DownloadWiki
def initialize
@locales = collect_locales
@host = "http://download.wikimedia.org/"
end
def start uri="http://download.wikimedia.org/backup-index.html"
doc = get_doc uri
if doc
workq = WorkQueue.new(15)
@locales.each do |locale|
uri = get_locale_wiki doc, locale
ext_uri = fetch_external_links @host+uri
workq.enqueue_b { download_externallinks ext_uri }
end
workq.join
end
end
def fetch_external_links link
doc = get_doc link
puts link
return download_external_links(doc) if doc
nil
end
def download_external_links page
external_uri = get_external_uri page
# downlaod the url
external_uri
end
private
def download_externallinks uri
puts uri
`cd /path/to/your/dir; wget #{uri}`
end
def get_external_uri page
return page.search("a[@href*='externallinks]").first[:href] rescue nil
end
def get_locale_wiki page, locale
return page.search("a[@href*='#{locale}wiki/']").first[:href] rescue nil
end
def get_doc uri
return Hpricot(open(uri)) rescue nil
end
def collect_locales
["en", "fr", "it", "tr", "hi", "ro", "vi", "pt", "hu", "de"]
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment