Created
February 27, 2016 18:15
-
-
Save halfak/e1ff31e48aaa69e3bd7d to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# XML Processing Example: Extract link count changes\n", | |
"\n", | |
"This notebook details how to use the [mwxml](https://pythonhosted.org/mwxml) python library to efficiently process an entire Wikipedia-sized historical XML dump. In this example, we'll extract image link-count change events from the history of Dutch Wikipedia. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import mwxml" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Step 1: Gather the paths to all of the dump files\n", | |
"On tool labs, the XML dumps are available in `/public/dumps/public/`. We're going to use python's `glob` library to get the paths of the Dutch Wikipedia dump (December 02, 2015) that contains the text of all revisions." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history4.xml.bz2',\n", | |
" '/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history2.xml.bz2',\n", | |
" '/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history3.xml.bz2',\n", | |
" '/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history1.xml.bz2']" | |
] | |
}, | |
"execution_count": 2, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import glob\n", | |
"\n", | |
"paths = glob.glob('/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history*.xml*.bz2')\n", | |
"paths" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Step 2: Define the image link extractor.\n", | |
"Here we're using a regular expression to extract image links from the revision text of articles. Nothing fancy here." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import re\n", | |
"\n", | |
"EXTS = [\"png\", \"gif\", \"jpg\", \"jpeg\"]\n", | |
"# [[(file|image):<file>.<ext>]]\n", | |
"IMAGE_LINK_RE = re.compile(r\"\\[\\[\" + \n", | |
" r\"(file|image|afbeelding|bestand):\" + # Group 1\n", | |
" r\"([^\\]]+.(\" + \"|\".join(EXTS) + r\"))\" + # Group 2 & 3\n", | |
" r\"(|[^\\]]+)?\" + # Group 4\n", | |
" r\"\\]\\]\")\n", | |
"\n", | |
"def extract_image_links(text):\n", | |
" for m in IMAGE_LINK_RE.finditer(text):\n", | |
" yield m.group(2)\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Step 3: Run the XML dump processor on the paths\n", | |
"This is the part that `mwxml` can help you do easily. You need to define a `process_dump` function that takes two arguements: dump : *[mwxml.Dump](http://pythonhosted.org/mwxml/iteration.html#mwxml.Dump)* and a path : *str* \n", | |
"\n", | |
"In the example, below, we iterate through the pages in the dump, and keep track of how many image links we saw in the last revision with `last_count`. If the `delta` isn't `0`, we yield some values. It's very important that the process_dump function either yields something or returns an iterable. We'll explain why in a moment. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def process_dump(dump, path):\n", | |
" for page in dump:\n", | |
" last_count = 0\n", | |
" for revision in page:\n", | |
" image_links = list(extract_image_links(revision.text or \"\"))\n", | |
" delta = len(image_links) - last_count\n", | |
" if delta != 0:\n", | |
" yield revision.id, revision.timestamp, delta\n", | |
" last_count = len(image_links)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"OK. Now that everything is defined, it's time to run the code. `mwxml` has a [`map()`](http://pythonhosted.org/mwxml/map.html#mwxml.map) function that applied the `process_dump` function each of the XML dump file in `paths` -- ***in parallel*** -- using python's `multiprocessing` library and collects all of the *yield*ed values in a generator. As the code below demonstrates, it's easy to collect this output and write it to a new output file or print it out to the console (not recommended for large amounts of output)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"5968207\t2006-11-30T21:34:17Z\t3\n", | |
"8992798\t2007-08-19T03:19:00Z\t-1\n", | |
"8996924\t2007-08-19T12:38:26Z\t1\n", | |
"9000899\t2007-08-19T16:05:50Z\t-3\n", | |
"5969056\t2006-11-30T22:08:43Z\t1\n", | |
"14712696\t2008-11-29T21:22:49Z\t-1\n", | |
"3110580\t2006-02-09T11:33:12Z\t27\n", | |
"8336191\t2007-06-14T21:10:42Z\t-27\n", | |
"16705457\t2009-05-06T18:33:02Z\t1\n", | |
"16750330\t2009-05-09T11:59:22Z\t1\n", | |
"16785196\t2009-05-11T21:13:18Z\t-1\n", | |
"16705928\t2009-05-06T19:04:06Z\t1\n", | |
"16738884\t2009-05-08T18:06:57Z\t-1\n", | |
"27871\t2003-01-12T22:49:21Z\t1\n", | |
"1161182\t2005-05-25T11:45:28Z\t-1\n", | |
"1162073\t2005-05-25T11:46:50Z\t1\n" | |
] | |
} | |
], | |
"source": [ | |
"count = 0\n", | |
"for rev_id, rev_timestamp, delta in mwxml.map(process_dump, paths):\n", | |
" print(\"\\t\".join(str(v) for v in [rev_id, rev_timestamp, delta]))\n", | |
" count += 1\n", | |
" if count > 15:\n", | |
" break" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"## Conclusion\n", | |
"That's it! And we only wrote ~25 lines of code. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.4.2" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment