Skip to content

Instantly share code, notes, and snippets.

Created February 27, 2016 18:15
Show Gist options
  • Save halfak/e1ff31e48aaa69e3bd7d to your computer and use it in GitHub Desktop.
Save halfak/e1ff31e48aaa69e3bd7d to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"# XML Processing Example: Extract link count changes\n",
"This notebook details how to use the [mwxml]( python library to efficiently process an entire Wikipedia-sized historical XML dump. In this example, we'll extract image link-count change events from the history of Dutch Wikipedia. "
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"import mwxml"
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Gather the paths to all of the dump files\n",
"On tool labs, the XML dumps are available in `/public/dumps/public/`. We're going to use python's `glob` library to get the paths of the Dutch Wikipedia dump (December 02, 2015) that contains the text of all revisions."
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
"outputs": [
"data": {
"text/plain": [
" '/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history2.xml.bz2',\n",
" '/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history3.xml.bz2',\n",
" '/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history1.xml.bz2']"
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
"source": [
"import glob\n",
"paths = glob.glob('/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history*.xml*.bz2')\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"# Step 2: Define the image link extractor.\n",
"Here we're using a regular expression to extract image links from the revision text of articles. Nothing fancy here."
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"import re\n",
"EXTS = [\"png\", \"gif\", \"jpg\", \"jpeg\"]\n",
"# [[(file|image):<file>.<ext>]]\n",
"IMAGE_LINK_RE = re.compile(r\"\\[\\[\" + \n",
" r\"(file|image|afbeelding|bestand):\" + # Group 1\n",
" r\"([^\\]]+.(\" + \"|\".join(EXTS) + r\"))\" + # Group 2 & 3\n",
" r\"(|[^\\]]+)?\" + # Group 4\n",
" r\"\\]\\]\")\n",
"def extract_image_links(text):\n",
" for m in IMAGE_LINK_RE.finditer(text):\n",
" yield\n"
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3: Run the XML dump processor on the paths\n",
"This is the part that `mwxml` can help you do easily. You need to define a `process_dump` function that takes two arguements: dump : *[mwxml.Dump](* and a path : *str* \n",
"In the example, below, we iterate through the pages in the dump, and keep track of how many image links we saw in the last revision with `last_count`. If the `delta` isn't `0`, we yield some values. It's very important that the process_dump function either yields something or returns an iterable. We'll explain why in a moment. "
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"def process_dump(dump, path):\n",
" for page in dump:\n",
" last_count = 0\n",
" for revision in page:\n",
" image_links = list(extract_image_links(revision.text or \"\"))\n",
" delta = len(image_links) - last_count\n",
" if delta != 0:\n",
" yield, revision.timestamp, delta\n",
" last_count = len(image_links)"
"cell_type": "markdown",
"metadata": {},
"source": [
"OK. Now that everything is defined, it's time to run the code. `mwxml` has a [`map()`]( function that applied the `process_dump` function each of the XML dump file in `paths` -- ***in parallel*** -- using python's `multiprocessing` library and collects all of the *yield*ed values in a generator. As the code below demonstrates, it's easy to collect this output and write it to a new output file or print it out to the console (not recommended for large amounts of output)."
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
"outputs": [
"name": "stdout",
"output_type": "stream",
"text": [
"source": [
"count = 0\n",
"for rev_id, rev_timestamp, delta in, paths):\n",
" print(\"\\t\".join(str(v) for v in [rev_id, rev_timestamp, delta]))\n",
" count += 1\n",
" if count > 15:\n",
" break"
"cell_type": "markdown",
"metadata": {
"collapsed": true
"source": [
"## Conclusion\n",
"That's it! And we only wrote ~25 lines of code. "
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": []
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.2"
"nbformat": 4,
"nbformat_minor": 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment