Skip to content

Instantly share code, notes, and snippets.

@squarism
Created November 21, 2011 16:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save squarism/1383161 to your computer and use it in GitHub Desktop.
Save squarism/1383161 to your computer and use it in GitHub Desktop.
Wikipedia mirroring in 10 lines
# we are going to intentionally use the vanilla mongo driver
require 'mongo'
require 'nokogiri'
require 'open-uri'
require 'active_support/core_ext' # from rails
include Mongo
pages = Connection.new('localhost', 27017).db('loadtest').collection('pages')
wikipedia_page = "http://en.wikipedia.org/wiki/Special:Export/Ford_Motor"
# no blanks is magic here? had problems without it
doc = Nokogiri::XML(open(wikipedia_page)) { |config| config.noblanks }
page = Hash.from_xml(doc.to_s) # here's the magical method from rails
pages.insert page["mediawiki"]["page"]
# content includes hierarchy and complete structure of the original XML document
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment