Skip to content

Instantly share code, notes, and snippets.

@brenes
Created September 8, 2011 13:30
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save brenes/1203385 to your computer and use it in GitHub Desktop.
Save brenes/1203385 to your computer and use it in GitHub Desktop.
Removing infoboxes from Wikipedia pages with a regexp
# INFO extracted from : http://stackoverflow.com/questions/6331065/matching-balanced-parenthesis-in-ruby-using-recursive-regular-expressions-like-pe
require 'wikipedia'
wikipage = Wikipedia.find(title)
# This regular expression is not valid for infoboxes with '}' (i.e: http://en.wikipedia.org/wiki/The_Royal_Anthem_of_Jordan)
no_infoboxes = wikipage.content.gsub(/\{\{[^\}]*\}\}/, "")
# This regular expression makes use of recursive regular expressions so now handles recursive {{}} structures
no_infoboxes = wikipage.content.gsub(%r{(?<re>\{\{(?:(?> [^\{\}]+ )|\g<re>)*\}\})}x, "")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment