halfak/mwxml.py.ipynb

## mwxml.py.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# XML Processing Example: Extract link count changes\n",
    "\n",
    "This notebook details how to use the [mwxml](https://pythonhosted.org/mwxml) python library to efficiently process an entire Wikipedia-sized historical XML dump.  In this example, we'll extract image link-count change events from the history of Dutch Wikipedia.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import mwxml"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Gather the paths to all of the dump files\n",
    "On tool labs, the XML dumps are available in `/public/dumps/public/`.  We're going to use python's `glob` library to get the paths of the Dutch Wikipedia dump (December 02, 2015) that contains the text of all revisions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history4.xml.bz2',\n",
       " '/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history2.xml.bz2',\n",
       " '/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history3.xml.bz2',\n",
       " '/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history1.xml.bz2']"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import glob\n",
    "\n",
    "paths = glob.glob('/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history*.xml*.bz2')\n",
    "paths"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 2: Define the image link extractor.\n",
    "Here we're using a regular expression to extract image links from the revision text of articles. Nothing fancy here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "EXTS = [\"png\", \"gif\", \"jpg\", \"jpeg\"]\n",
    "# [[(file|image):<file>.<ext>]]\n",
    "IMAGE_LINK_RE = re.compile(r\"\\[\\[\" + \n",
    "                           r\"(file|image|afbeelding|bestand):\" +  # Group 1\n",
    "                           r\"([^\\]]+.(\" + \"|\".join(EXTS) + r\"))\" +  # Group 2 & 3\n",
    "                           r\"(|[^\\]]+)?\" +  # Group 4\n",
    "                           r\"\\]\\]\")\n",
    "\n",
    "def extract_image_links(text):\n",
    "  for m in IMAGE_LINK_RE.finditer(text):\n",
    "    yield m.group(2)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Run the XML dump processor on the paths\n",
    "This is the part that `mwxml` can help you do easily.  You need to define a `process_dump` function that takes two arguements: dump : *[mwxml.Dump](http://pythonhosted.org/mwxml/iteration.html#mwxml.Dump)* and a path : *str* \n",
    "\n",
    "In the example, below, we iterate through the pages in the dump, and keep track of how many image links we saw in the last revision with `last_count`.  If the `delta` isn't `0`, we yield some values.  It's very important that the process_dump function either yields something or returns an iterable.  We'll explain why in a moment. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def process_dump(dump, path):\n",
    "  for page in dump:\n",
    "    last_count = 0\n",
    "    for revision in page:\n",
    "      image_links = list(extract_image_links(revision.text or \"\"))\n",
    "      delta = len(image_links) - last_count\n",
    "      if delta != 0:\n",
    "        yield revision.id, revision.timestamp, delta\n",
    "      last_count = len(image_links)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK.  Now that everything is defined, it's time to run the code.  `mwxml` has a [`map()`](http://pythonhosted.org/mwxml/map.html#mwxml.map) function that applied the `process_dump` function each of the XML dump file in `paths` -- ***in parallel*** -- using python's `multiprocessing` library and collects all of the *yield*ed values in a generator.  As the code below demonstrates, it's easy to collect this output and write it to a new output file or print it out to the console (not recommended for large amounts of output)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5968207\t2006-11-30T21:34:17Z\t3\n",
      "8992798\t2007-08-19T03:19:00Z\t-1\n",
      "8996924\t2007-08-19T12:38:26Z\t1\n",
      "9000899\t2007-08-19T16:05:50Z\t-3\n",
      "5969056\t2006-11-30T22:08:43Z\t1\n",
      "14712696\t2008-11-29T21:22:49Z\t-1\n",
      "3110580\t2006-02-09T11:33:12Z\t27\n",
      "8336191\t2007-06-14T21:10:42Z\t-27\n",
      "16705457\t2009-05-06T18:33:02Z\t1\n",
      "16750330\t2009-05-09T11:59:22Z\t1\n",
      "16785196\t2009-05-11T21:13:18Z\t-1\n",
      "16705928\t2009-05-06T19:04:06Z\t1\n",
      "16738884\t2009-05-08T18:06:57Z\t-1\n",
      "27871\t2003-01-12T22:49:21Z\t1\n",
      "1161182\t2005-05-25T11:45:28Z\t-1\n",
      "1162073\t2005-05-25T11:46:50Z\t1\n"
     ]
    }
   ],
   "source": [
    "count = 0\n",
    "for rev_id, rev_timestamp, delta in mwxml.map(process_dump, paths):\n",
    "    print(\"\\t\".join(str(v) for v in [rev_id, rev_timestamp, delta]))\n",
    "    count += 1\n",
    "    if count > 15:\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Conclusion\n",
    "That's it!  And we only wrote ~25 lines of code.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# XML Processing Example: Extract link count changes\n",
	"\n",
	"This notebook details how to use the [mwxml](https://pythonhosted.org/mwxml) python library to efficiently process an entire Wikipedia-sized historical XML dump. In this example, we'll extract image link-count change events from the history of Dutch Wikipedia. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import mwxml"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 1: Gather the paths to all of the dump files\n",
	"On tool labs, the XML dumps are available in `/public/dumps/public/`. We're going to use python's `glob` library to get the paths of the Dutch Wikipedia dump (December 02, 2015) that contains the text of all revisions."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history4.xml.bz2',\n",
	" '/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history2.xml.bz2',\n",
	" '/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history3.xml.bz2',\n",
	" '/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history1.xml.bz2']"
	]
	},
	"execution_count": 2,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"import glob\n",
	"\n",
	"paths = glob.glob('/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history.xml.bz2')\n",
	"paths"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Step 2: Define the image link extractor.\n",
	"Here we're using a regular expression to extract image links from the revision text of articles. Nothing fancy here."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import re\n",
	"\n",
	"EXTS = [\"png\", \"gif\", \"jpg\", \"jpeg\"]\n",
	"# [[(file\|image):<file>.<ext>]]\n",
	"IMAGE_LINK_RE = re.compile(r\"\\[\\[\" + \n",
	" r\"(file\|image\|afbeelding\|bestand):\" + # Group 1\n",
	" r\"([^\\]]+.(\" + \"\|\".join(EXTS) + r\"))\" + # Group 2 & 3\n",
	" r\"(\|[^\\]]+)?\" + # Group 4\n",
	" r\"\\]\\]\")\n",
	"\n",
	"def extract_image_links(text):\n",
	" for m in IMAGE_LINK_RE.finditer(text):\n",
	" yield m.group(2)\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 3: Run the XML dump processor on the paths\n",
	"This is the part that `mwxml` can help you do easily. You need to define a `process_dump` function that takes two arguements: dump : [mwxml.Dump](http://pythonhosted.org/mwxml/iteration.html#mwxml.Dump) and a path : str \n",
	"\n",
	"In the example, below, we iterate through the pages in the dump, and keep track of how many image links we saw in the last revision with `last_count`. If the `delta` isn't `0`, we yield some values. It's very important that the process_dump function either yields something or returns an iterable. We'll explain why in a moment. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"def process_dump(dump, path):\n",
	" for page in dump:\n",
	" last_count = 0\n",
	" for revision in page:\n",
	" image_links = list(extract_image_links(revision.text or \"\"))\n",
	" delta = len(image_links) - last_count\n",
	" if delta != 0:\n",
	" yield revision.id, revision.timestamp, delta\n",
	" last_count = len(image_links)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"OK. Now that everything is defined, it's time to run the code. `mwxml` has a [`map()`](http://pythonhosted.org/mwxml/map.html#mwxml.map) function that applied the `process_dump` function each of the XML dump file in `paths` -- *in parallel* -- using python's `multiprocessing` library and collects all of the yielded values in a generator. As the code below demonstrates, it's easy to collect this output and write it to a new output file or print it out to the console (not recommended for large amounts of output)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"5968207\t2006-11-30T21:34:17Z\t3\n",
	"8992798\t2007-08-19T03:19:00Z\t-1\n",
	"8996924\t2007-08-19T12:38:26Z\t1\n",
	"9000899\t2007-08-19T16:05:50Z\t-3\n",
	"5969056\t2006-11-30T22:08:43Z\t1\n",
	"14712696\t2008-11-29T21:22:49Z\t-1\n",
	"3110580\t2006-02-09T11:33:12Z\t27\n",
	"8336191\t2007-06-14T21:10:42Z\t-27\n",
	"16705457\t2009-05-06T18:33:02Z\t1\n",
	"16750330\t2009-05-09T11:59:22Z\t1\n",
	"16785196\t2009-05-11T21:13:18Z\t-1\n",
	"16705928\t2009-05-06T19:04:06Z\t1\n",
	"16738884\t2009-05-08T18:06:57Z\t-1\n",
	"27871\t2003-01-12T22:49:21Z\t1\n",
	"1161182\t2005-05-25T11:45:28Z\t-1\n",
	"1162073\t2005-05-25T11:46:50Z\t1\n"
	]
	}
	],
	"source": [
	"count = 0\n",
	"for rev_id, rev_timestamp, delta in mwxml.map(process_dump, paths):\n",
	" print(\"\\t\".join(str(v) for v in [rev_id, rev_timestamp, delta]))\n",
	" count += 1\n",
	" if count > 15:\n",
	" break"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"## Conclusion\n",
	"That's it! And we only wrote ~25 lines of code. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.4.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}