Skip to content

Instantly share code, notes, and snippets.

@halfak
Created February 27, 2016 18:15
Show Gist options
  • Save halfak/e1ff31e48aaa69e3bd7d to your computer and use it in GitHub Desktop.
Save halfak/e1ff31e48aaa69e3bd7d to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# XML Processing Example: Extract link count changes\n",
"\n",
"This notebook details how to use the [mwxml](https://pythonhosted.org/mwxml) python library to efficiently process an entire Wikipedia-sized historical XML dump. In this example, we'll extract image link-count change events from the history of Dutch Wikipedia. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import mwxml"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Gather the paths to all of the dump files\n",
"On tool labs, the XML dumps are available in `/public/dumps/public/`. We're going to use python's `glob` library to get the paths of the Dutch Wikipedia dump (December 02, 2015) that contains the text of all revisions."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history4.xml.bz2',\n",
" '/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history2.xml.bz2',\n",
" '/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history3.xml.bz2',\n",
" '/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history1.xml.bz2']"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import glob\n",
"\n",
"paths = glob.glob('/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history*.xml*.bz2')\n",
"paths"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Step 2: Define the image link extractor.\n",
"Here we're using a regular expression to extract image links from the revision text of articles. Nothing fancy here."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import re\n",
"\n",
"EXTS = [\"png\", \"gif\", \"jpg\", \"jpeg\"]\n",
"# [[(file|image):<file>.<ext>]]\n",
"IMAGE_LINK_RE = re.compile(r\"\\[\\[\" + \n",
" r\"(file|image|afbeelding|bestand):\" + # Group 1\n",
" r\"([^\\]]+.(\" + \"|\".join(EXTS) + r\"))\" + # Group 2 & 3\n",
" r\"(|[^\\]]+)?\" + # Group 4\n",
" r\"\\]\\]\")\n",
"\n",
"def extract_image_links(text):\n",
" for m in IMAGE_LINK_RE.finditer(text):\n",
" yield m.group(2)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3: Run the XML dump processor on the paths\n",
"This is the part that `mwxml` can help you do easily. You need to define a `process_dump` function that takes two arguements: dump : *[mwxml.Dump](http://pythonhosted.org/mwxml/iteration.html#mwxml.Dump)* and a path : *str* \n",
"\n",
"In the example, below, we iterate through the pages in the dump, and keep track of how many image links we saw in the last revision with `last_count`. If the `delta` isn't `0`, we yield some values. It's very important that the process_dump function either yields something or returns an iterable. We'll explain why in a moment. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def process_dump(dump, path):\n",
" for page in dump:\n",
" last_count = 0\n",
" for revision in page:\n",
" image_links = list(extract_image_links(revision.text or \"\"))\n",
" delta = len(image_links) - last_count\n",
" if delta != 0:\n",
" yield revision.id, revision.timestamp, delta\n",
" last_count = len(image_links)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"OK. Now that everything is defined, it's time to run the code. `mwxml` has a [`map()`](http://pythonhosted.org/mwxml/map.html#mwxml.map) function that applied the `process_dump` function each of the XML dump file in `paths` -- ***in parallel*** -- using python's `multiprocessing` library and collects all of the *yield*ed values in a generator. As the code below demonstrates, it's easy to collect this output and write it to a new output file or print it out to the console (not recommended for large amounts of output)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5968207\t2006-11-30T21:34:17Z\t3\n",
"8992798\t2007-08-19T03:19:00Z\t-1\n",
"8996924\t2007-08-19T12:38:26Z\t1\n",
"9000899\t2007-08-19T16:05:50Z\t-3\n",
"5969056\t2006-11-30T22:08:43Z\t1\n",
"14712696\t2008-11-29T21:22:49Z\t-1\n",
"3110580\t2006-02-09T11:33:12Z\t27\n",
"8336191\t2007-06-14T21:10:42Z\t-27\n",
"16705457\t2009-05-06T18:33:02Z\t1\n",
"16750330\t2009-05-09T11:59:22Z\t1\n",
"16785196\t2009-05-11T21:13:18Z\t-1\n",
"16705928\t2009-05-06T19:04:06Z\t1\n",
"16738884\t2009-05-08T18:06:57Z\t-1\n",
"27871\t2003-01-12T22:49:21Z\t1\n",
"1161182\t2005-05-25T11:45:28Z\t-1\n",
"1162073\t2005-05-25T11:46:50Z\t1\n"
]
}
],
"source": [
"count = 0\n",
"for rev_id, rev_timestamp, delta in mwxml.map(process_dump, paths):\n",
" print(\"\\t\".join(str(v) for v in [rev_id, rev_timestamp, delta]))\n",
" count += 1\n",
" if count > 15:\n",
" break"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## Conclusion\n",
"That's it! And we only wrote ~25 lines of code. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment