sebastien/DeathRow-Scraping.ipynb

## DeathRow-Scraping.ipynb
{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Death Row / Last Words"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We're going to scrape and extract the data from the Texas database of executed offenders, available here http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html. We'll be using the [wwwclient library](https://github.com/sebastien/wwwclient) to scrape the website, so make sure it is installed (which is the case if your using the FF-Kit)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from wwwclient import *"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We now want to scrape the main page and return a list of of the inmates, which are represented as a table. Let's start by retrieveing the HTML."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "s = Session()\n",
      "html = s.get(\"http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html\").data()\n",
      "print html[0:1000]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\r\n",
        "<html lang=\"en-US\" xml:lang=\"en-US\" xmlns=\"http://www.w3.org/1999/xhtml\"><!-- InstanceBegin template=\"/Templates/sub_template.dwt\" codeOutsideHTMLIsLocked=\"false\" -->\r\n",
        "<head>\r\n",
        "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=ISO-8859-1\" />\r\n",
        "\r\n",
        "<link href=\"../stylesheets/sub_css.css\" rel=\"stylesheet\" type=\"text/css\" />\r\n",
        "<meta http-equiv=\"Content-Style-Type\" content=\"text/css\" />\r\n",
        "\r\n",
        "<!-- InstanceBeginEditable name=\"DCMetaTags\" -->\r\n",
        "<meta name=\"DC.Title\" content=\"Texas Department of Criminal Justice\" /> \r\n",
        "<meta name=\"DC.Creator\" content=\"Texas Department of Criminal Justice\" /> \r\n",
        "<meta name=\"DC.Date\" content=\"20000302\" /> \r\n",
        "<meta name=\"DC.Format.MIME\" content=\"text/html\" />\r\n",
        "<meta name=\"DC.Format.SysReq\" content=\"Internet browser\" /> \r\n",
        "<meta name=\"DC.Identifier\" content=\"http://www.tdcj.state.tx.us/\" /> \r\n",
        "<meta name=\"DC.Subject\" content=\"criminal justice, ad\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Scraping the index"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We've now retrieved the main page as HTML, but what we'd like to do is to find the main `table` node and iterate on the rows. The `wwwclient` module offers a powerful scraping module, accessible through the `HTML` object (imported from `wwwclient.scrape`)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "tree = HTML.tree(html)\n",
      "print str(tree)[0:1000]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "#root\n",
        "   <#text:'<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\\r\\n'\n",
        "   <html[#0@1](lang=en-US,xml:lang=en-US,xmlns=http://www.w3.org/1999/xhtml)\n",
        "       <#text:'<!-- InstanceBegin template=\"/Templates/sub_template.dwt\" codeOutsideHTMLIsLocked=\"false\" -->\\r\\n'\n",
        "       <head[#1@2]\n",
        "           <#text:'\\r\\n'\n",
        "           <meta[#2@3](content=text/html; charset=ISO-8859-1,http-equiv=Content-Type)\n",
        "           <#text:'\\r\\n\\r\\n'\n",
        "           <link[#3@3](href=../stylesheets/sub_css.css,type=text/css,rel=stylesheet)\n",
        "           <#text:'\\r\\n'\n",
        "           <meta[#4@3](content=text/css,http-equiv=Content-Style-Type)\n",
        "           <#text:'\\r\\n\\r\\n<!-- InstanceBeginEditable name=\"DCMetaTags\" -->\\r\\n'\n",
        "           <meta[#5@3](content=Texas Department of Criminal Justice,name=DC.Title)\n",
        "           <#text:' \\r\\n'\n",
        "           <meta[#6@3](content=Texas Department of Criminal Justice,name=DC.Creator)\n",
        "           <#text:' \\r\\n'\n",
        "           <meta[#7@3](c\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The tree can be queried using CSS-like expressions, which is what we're going to do"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "nodes = tree.query(\"table\")\n",
      "assert len(nodes) == 1, \"We expect to have only one table node in the data, got {0}\".format(_)\n",
      "table = nodes[0]\n",
      "print str(table)[0:1000]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "table[#60@5](width=100%,class=os,summary=This table provides a summary of executed offenders with links to their last statements.)\n",
        "   <#text:'\\r\\n  '\n",
        "   <tbody[#61@6]\n",
        "       <#text:'\\r\\n  '\n",
        "       <tr[#62@7]\n",
        "           <#text:'\\r\\n    '\n",
        "           <th[#63@8](scope=col)\n",
        "               <#text:'Execution'\n",
        "           <#text:'\\r\\n    '\n",
        "           <th[#64@8](scope=col)\n",
        "               <#text:'Link'\n",
        "           <#text:'\\r\\n    '\n",
        "           <th[#65@8](scope=col)\n",
        "               <#text:'Link'\n",
        "           <#text:'\\r\\n    '\n",
        "           <th[#66@8](scope=col)\n",
        "               <#text:'Last Name'\n",
        "           <#text:'\\r\\n    '\n",
        "           <th[#67@8](scope=col)\n",
        "               <#text:'First Name'\n",
        "           <#text:'\\r\\n    '\n",
        "           <th[#68@8](scope=col)\n",
        "               <#text:'TDCJ Number'\n",
        "           <#text:'\\r\\n    '\n",
        "           <th[#69@8](scope=col)\n",
        "               <#text:'Age'\n",
        "           <#text:'\\r\\n    '\n",
        "           <th[#70@8](scope=col)\n",
        "               <#text:'Date'\n",
        "           <#text:'\\r\\n    '\n",
        "           <t\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we'd like to get the all the rows, the first one will give us the headers and the other ones will be the data. We're going to write a little script that iterates through the rows and folds them as `dict`s."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# We'll scrape the values and store them in the rows\n",
      "headers = []\n",
      "rows    = []\n",
      "for i, row in enumerate(table.query(\"tr\")):\n",
      "    if i == 0:\n",
      "        for th in row.query(\"th\"):\n",
      "            # We extract the text from the header row and convert it to camelCase\n",
      "            header = \"\".join([w.capitalize() for w in th.text().split()])\n",
      "            header = header[0].lower() + header[1:]\n",
      "            headers.append(header)\n",
      "    else:\n",
      "        # Otherwise we fold the row into a dictionary\n",
      "        rows.append(dict((headers[i],td.text()) for i,td in enumerate(row.query(\"td\"))))\n",
      "        \n",
      "# And now we do some pretty-printing so that we can see the result\n",
      "print \"Headers: {0}\".format(\", \".join(headers))\n",
      "print \"Rows   : {0}\".format(len(rows))\n",
      "print \"-\" * 80\n",
      "print (\"\\n\" + \"-\" * 80 + \"\\n\").join(map(str,rows[0:10]))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Headers: execution, link, link, lastName, firstName, tdcjNumber, age, date, race, county\n",
        "Rows   : 515\n",
        "--------------------------------------------------------------------------------\n",
        "{u'firstName': u'Jose', u'lastName': u'Villegas', u'age': u'39', u'county': u'Nueces', u'race': u'Hispanic', u'link': u'Last Statement ', u'date': u'04/16/2014', u'execution': u'515', u'tdcjNumber': u'999417'}\n",
        "--------------------------------------------------------------------------------\n",
        "{u'firstName': u'Ramiro', u'lastName': u'Hernandez', u'age': u'44', u'county': u'Kerr', u'race': u'Hispanic', u'link': u'Last Statement ', u'date': u'04/09/2014', u'execution': u'514', u'tdcjNumber': u'999342'}\n",
        "--------------------------------------------------------------------------------\n",
        "{u'firstName': u'Tommy', u'lastName': u'Sells', u'age': u'49', u'county': u'ValVerde', u'race': u'White', u'link': u'Last Statement ', u'date': u'04/03/2014', u'execution': u'513', u'tdcjNumber': u'999367'}\n",
        "--------------------------------------------------------------------------------\n",
        "{u'firstName': u'Anthony', u'lastName': u'Doyle', u'age': u'29', u'county': u'Dallas', u'race': u'Black', u'link': u'Last Statement ', u'date': u'03/27/2014', u'execution': u'512', u'tdcjNumber': u'999478'}\n",
        "--------------------------------------------------------------------------------\n",
        "{u'firstName': u'Ray', u'lastName': u'Jasper', u'age': u'33', u'county': u'Bexar', u'race': u'Black', u'link': u'Last Statement', u'date': u'03/19/2014', u'execution': u'511', u'tdcjNumber': u'999341'}\n",
        "--------------------------------------------------------------------------------\n",
        "{u'firstName': u'Suzanne', u'lastName': u'Basso', u'age': u'59', u'county': u'Harris', u'race': u'White', u'link': u'Last Statement', u'date': u'02/05/2014', u'execution': u'510', u'tdcjNumber': u'999329'}\n",
        "--------------------------------------------------------------------------------\n",
        "{u'firstName': u'Edgar', u'lastName': u'Tamayo', u'age': u'46', u'county': u'Harris', u'race': u'Hispanic', u'link': u'Last Statement', u'date': u'01/22/2014', u'execution': u'509', u'tdcjNumber': u'999130'}\n",
        "--------------------------------------------------------------------------------\n",
        "{u'firstName': u'Jerry', u'lastName': u'Martin', u'age': u'43', u'county': u'Leon', u'race': u'White', u'link': u'Last Statement', u'date': u'12/03/2013', u'execution': u'508', u'tdcjNumber': u'999552'}\n",
        "--------------------------------------------------------------------------------\n",
        "{u'firstName': u'Jamie', u'lastName': u'McCoskey', u'age': u'49', u'county': u'Harris', u'race': u'White', u'link': u'Last Statement', u'date': u'11/12/2013', u'execution': u'507', u'tdcjNumber': u'999053'}\n",
        "--------------------------------------------------------------------------------\n",
        "{u'firstName': u'Michael', u'lastName': u'Yowell', u'age': u'43', u'county': u'Lubbock', u'race': u'White ', u'link': u'Last Statement', u'date': u'10/09/2013', u'execution': u'506', u'tdcjNumber': u'999334'}\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If you take a closer look at each row, you'll notive that the numbers are encoded as strings, as well as the dates. We'll process the rows so that the dates are expressed as `(YYYY,MM,DD)` tuples and numbers as proper numbers."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def parse_date(v):\n",
      "    \"\"\"Parses a date in DD/MM/YYYY and returns a (YYYY, MM, DD) triple\"\"\"\n",
      "    d,m,y = map(int, v.split(\"/\"))\n",
      "    return (y, m, d)\n",
      "\n",
      "def normalizer( rules ):\n",
      "    \"\"\"Returns a functor that will return a noramlized copy of the given\n",
      "    object according to the given rules dict.\"\"\"\n",
      "    def f(v):\n",
      "        v = dict(v.items())\n",
      "        for k in rules:\n",
      "            if k in v:\n",
      "                v[k] = rules[k](v[k])\n",
      "        return v\n",
      "    return f\n",
      "\n",
      "# The above seems a bit technical, but it's going to be useful and\n",
      "# will simplify a lot of the work we're going to do later on\n",
      "normalize_inmate = normalizer(dict(\n",
      "    date       = parse_date,\n",
      "    age        = int,\n",
      "    execution  = int,\n",
      "    tdcjNumber = int,\n",
      "))\n",
      "\n",
      "inmates = [normalize_inmate(i) for i in rows]\n",
      "# We print an element from inmate to make sure that we processed everything properly, which is \n",
      "print inmates[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "{u'firstName': u'Jose', u'lastName': u'Villegas', u'age': 39, u'county': u'Nueces', u'race': u'Hispanic', u'link': u'Last Statement ', u'date': (2014, 16, 4), u'execution': 515, u'tdcjNumber': 999417}\n"
       ]
      }
     ],
     "prompt_number": 29
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If you're attentive, you'll have noticed that the `link` attribute always has the `Last Statement` value. Indeed, we just asked for the text, not for the `href` attribute of the `a` tag that is contained in this specific column. If you have a look at the [source page](http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html) you'll see that there are actually two links, one for the offender information, one for the last statement. Let's start by extracting the URLs from the links."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "offender_information = []\n",
      "last_statements      = []\n",
      "for tr in table.query(\"tr\")[1:]:\n",
      "    tds = tr.query(\"td\")\n",
      "    offender_information.append(tds[1].query(\"a\")[0].get(\"href\"))\n",
      "    last_statements.append(     tds[2].query(\"a\")[0].get(\"href\"))\n",
      "# We print the result to see what the URLs look like\n",
      "print offender_information[0]\n",
      "print last_statements[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "dr_info/villegasjose.html\n",
        "dr_info/villegasjoselast.html\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If you've browsed the different pages, you'll see that the offender information is sometimes available as an image, sometimes available as an HTML file."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "images = [_ for _ in offender_information if not _.endswith(\".html\")]\n",
      "print \"\\n\".join(images[0:10])\n",
      "print \"...\\n{0} offenders without HTML information\".format(len(images))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "dr_info/tamayoedgar.jpg\n",
        "dr_info/mccoskeyjamie.jpg\n",
        "dr_info/lewisricky.jpg\n",
        "dr_info/bluecarl.jpg\n",
        "dr_info/hughesprestoni.jpg\n",
        "dr_info/hinesbobby.jpg\n",
        "dr_info/wilsonmarvin.jpg\n",
        "dr_info/lealhumberto.jpg\n",
        "dr_info/bradfordgayland.jpg\n",
        "dr_info/cantup.jpg\n",
        "...\n",
        "372 offenders without HTML information\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So we've got more than 60% of offenders who have an information page that is an image (which we will have to try to parse using OCR if we ever want to manipulate the data). Let's focus on the `last_statements` pages instead."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print len([_ for _ in last_statements if not _.endswith(\".html\")])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "The last statements"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "OK, so we know that there's no last statement that is not in HTML, let's have a look at what [the page](http://www.tdcj.state.tx.us/death_row/dr_info/villegasjoselast.html) looks like. If you look at the source code, you'll see that the information we want (date of execution, offender and last statements) are all in paragraphs, where the title paragraphs have the `text_bold` class. If you also have a closer look, you'll see that the only piece that we want is the last statement, as we already have the other information from the main page. So let's convert our `last_statements` array into an array of `{\"written\":\"<string>\",\"spoken\":\"<string>\"}` that mirror the data present in the page."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# The URL is relative to the page so we need to prefix it to work\n",
      "t = HTML.tree(s.get(\"death_row/\" + last_statements[0]))\n",
      "paragraphs = t.query(\"p\")\n",
      "print paragraphs"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[['<p id=\"h_p_or_nocss\">', 'Texas Department of Criminal Justice', '</p>'], ['<p>', '&nbsp;', '</p>'], ['<p class=\"text_bold\">', 'Date of Execution:', '</p>'], ['<p>', 'April 16, 2014  ', '</p>'], ['<p class=\"text_bold\">', 'Offender:', '</p>'], ['<p>', 'Jose Villegas, #999417', '</p>'], ['<p class=\"text_bold\">', 'Last Statement:', '</p>'], ['<p>', '(Written statement) ', '<br /', \"\\r\\n  I always said that if I even get to this point, I would have already said everything that needed to be said to all of those who I love and have been with me throughout this whole journey. Today, I realized that I can never say everything that needed to be said, because there is still so much that needs to be said. First of all, I love you. My children, my friends, and all my brothers who have shared this experience with me on the row and who continue to experience this without me, keep your heads up. I love all of you. Secondly, I am ok. I have peace in my heart and ready for the next journey. I'm really ok. Last but not least, to my true brother in life, Crazy J, I love you, man. You and Bella have been the best. I'm sorry I couldn't talk with you before all of this, but you know me...You are my bro. I love you. I'm ok. My babies, remember what I said. We'll be together soon. I love all of you. John 14:27. \", '</p>'], ['<p>', '(Spoken statement)', '<br /', '\\r\\nYes, I left a written statement. I do \\r\\nhave a verbal statement. I would like to remind my children once again, I love them. Crazy J, I forgot to write a list. Everything is ok. I love you all, and I love my children. I am at peace. John 14:27. I am done, Warden. ', '<br /', '\\r\\n', '</p>']]\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "What we want is the last two paragraphs, `(Written statement)` and `(Spoken statement)`. We actually don't care about the first three nodes of each paragraph, so we can skip them. Let's starts by find the greatest index of a `p.text_bold` paragraph"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "last = max([i for i,p in enumerate(paragraphs) if p.hasClass(\"text_bold\")])\n",
      "print last\n",
      "print paragraphs[last+1:]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "6\n",
        "[['<p>', '(Written statement) ', '<br /', \"\\r\\n  I always said that if I even get to this point, I would have already said everything that needed to be said to all of those who I love and have been with me throughout this whole journey. Today, I realized that I can never say everything that needed to be said, because there is still so much that needs to be said. First of all, I love you. My children, my friends, and all my brothers who have shared this experience with me on the row and who continue to experience this without me, keep your heads up. I love all of you. Secondly, I am ok. I have peace in my heart and ready for the next journey. I'm really ok. Last but not least, to my true brother in life, Crazy J, I love you, man. You and Bella have been the best. I'm sorry I couldn't talk with you before all of this, but you know me...You are my bro. I love you. I'm ok. My babies, remember what I said. We'll be together soon. I love all of you. John 14:27. \", '</p>'], ['<p>', '(Spoken statement)', '<br /', '\\r\\nYes, I left a written statement. I do \\r\\nhave a verbal statement. I would like to remind my children once again, I love them. Crazy J, I forgot to write a list. Everything is ok. I love you all, and I love my children. I am at peace. John 14:27. I am done, Warden. ', '<br /', '\\r\\n', '</p>']]\n"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We now have the pargraph we want, so let's extract the written and spoken statements:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print \"Written:\", \"\".join([_.text() for _ in paragraphs[last+1].children[2:]])\n",
      "print \"\"\n",
      "print \"Spoken:\",  \"\".join([_.text() for _ in paragraphs[last+2].children[2:]])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Written: \r\n",
        "  I always said that if I even get to this point, I would have already said everything that needed to be said to all of those who I love and have been with me throughout this whole journey. Today, I realized that I can never say everything that needed to be said, because there is still so much that needs to be said. First of all, I love you. My children, my friends, and all my brothers who have shared this experience with me on the row and who continue to experience this without me, keep your heads up. I love all of you. Secondly, I am ok. I have peace in my heart and ready for the next journey. I'm really ok. Last but not least, to my true brother in life, Crazy J, I love you, man. You and Bella have been the best. I'm sorry I couldn't talk with you before all of this, but you know me...You are my bro. I love you. I'm ok. My babies, remember what I said. We'll be together soon. I love all of you. John 14:27. \n",
        "\n",
        "Spoken: \r\n",
        "Yes, I left a written statement. I do \r\n",
        "have a verbal statement. I would like to remind my children once again, I love them. Crazy J, I forgot to write a list. Everything is ok. I love you all, and I love my children. I am at peace. John 14:27. I am done, Warden. \r\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We're now ready to scrape all pages"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "res = []\n",
      "for url in last_statements[0:3]:\n",
      "    # The URL is relative to the page so we need to prefix it to work\n",
      "    t = HTML.tree(s.get(\"death_row/\" + url))\n",
      "    paragraphs = t.query(\"p\")\n",
      "    last = max([i for i,p in enumerate(paragraphs) if p.hasClass(\"text_bold\")])\n",
      "    statements        = paragraphs[last+1:]\n",
      "    written_statement = paragraphs[last+1].children[2:] if len(statements) > 0 else []\n",
      "    spoken_statement  = paragraphs[last+2].children[2:] if len(statements) > 1 else []\n",
      "    written_statement = \"\".join([_.text() for _ in written_statement])\n",
      "    spoken_statement  = \"\".join([_.text() for _ in spoken_statement])\n",
      "    res.append(dict(\n",
      "        written=written_statement,\n",
      "        spoken=spoken_statement    \n",
      "    ))\n",
      "res_last_statements = res"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Offender Information"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "When we're lucky to have HTML offender information (such as here http://www.tdcj.state.tx.us/death_row/dr_info/villegasjose.html) we'll be able to retrieve a lot of information. By looking at the HTML, we have two main elements:\n",
      "\n",
      "- A table with a photo and a list of properties (birth, race, gender, etc)\n",
      "- A list of parapraphs where the header is a `span.text_bold`\n",
      "\n",
      "Let's see how we can scrape data from the above page"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "t = HTML.tree(s.get(\"death_row/dr_info/villegasjose.html\"))\n",
      "table = t.query(\"table.tabledata_deathrow_table\")\n",
      "p     = t.query(\"p\")\n",
      "print str(table[0])[0:500]\n",
      "print \"-\" * 80\n",
      "print p"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "table[#87@5](cellpadding=0,cellspacing=0,border=0,class=tabledata_deathrow_table)\n",
        "   <#text:'\\r\\n  '\n",
        "   <tr[#88@6]\n",
        "       <#text:'\\r\\n    '\n",
        "       <td[#89@7](rowspan=7,valign=top)\n",
        "           <img[#90@8](src=villegasjose2.jpg,alt=Picture of Offender,class=photo_border_black_right)\n",
        "       <#text:'\\r\\n    '\n",
        "       <td[#91@7](width=50%,class=tabledata_bold_align_right_deathrow,valign=top)\n",
        "           <#text:'Name'\n",
        "       <#text:'\\r\\n    '\n",
        "       <td[#92@7](width=50%,class=tabledata_align_left_deathro\n",
        "--------------------------------------------------------------------------------\n",
        "[['<p id=\"h_p_or_nocss\">', 'Texas Department of Criminal Justice', '</p>'], ['<p>', '<span class=\"text_bold\">', 'Prior Occupation', '</span>', '<br /', '\\r\\ncook,  dishwasher, laborer ', '</p>'], ['<p>', '<span class=\"text_bold\">', 'Prior Prison Record', '</span>', '<br /', '\\r\\n  N/A ', '</p>'], ['<p>', '<span class=\"text_bold\">', 'Summary of Incident', '</span>', '<br /', '\\r\\n  On  1/22/2001, Villegas fatally stabbed three victims. A 24 year old Hispanic  female was stabbed 32 times. Her 3 year old Hispanic male son was stabbed 19  times and her 51 year old Hispanic mother was stabbed 35 times. Villegas took  the television and a vehicle from the home. ', '</p>'], ['<p>', '<span class=\"text_bold\">', 'Co-Defendants', '</span>', '<br /', '\\r\\n  None ', '</p>'], ['<p>', '<span class=\"text_bold\">', 'Race and Gender of Victim', '</span>', '<br /', '\\r\\n  Hispanic  male and Hispanic female ', '</p>']]\n"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Retrieving the image is good to be trivial"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "image_url = s.url(table[0].query(\"img\")[0].get(\"src\"))\n",
      "print image_url"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "http://www.tdcj.state.tx.us/death_row/dr_info/villegasjose2.jpg\n"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Easy enough, now let's see how we can work out the content of the table"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "[_.text() for _ in table[0].query(\"tr\")]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 18,
       "text": [
        "[u'\\r\\n    \\r\\n    Name\\r\\n    Villegas, Jr., Jose Luis \\r\\n  ',\n",
        " u'\\r\\n    TDCJ Number\\r\\n    999417 \\r\\n  ',\n",
        " u'\\r\\n    Date of Birth\\r\\n    04/14/1975 \\r\\n  ',\n",
        " u'\\r\\n    Date Received\\r\\n    05/21/2002\\r\\n  ',\n",
        " u'\\r\\n    Age (when    Received)\\r\\n    27 \\r\\n  ',\n",
        " u'\\r\\n    Education Level (Highest Grade Completed)\\r\\n    9 \\r\\n  ',\n",
        " u'\\r\\n    Date of Offense\\r\\n    01/22/2001\\r\\n  ',\n",
        " u'\\r\\n    &nbsp;\\r\\n    Age (at the time    of Offense)\\r\\n    27 \\r\\n  ',\n",
        " u'\\r\\n    &nbsp;\\r\\n    County\\r\\n    Nueces \\r\\n  ',\n",
        " u'\\r\\n    &nbsp;\\r\\n    Race\\r\\n    Hispanic \\r\\n  ',\n",
        " u'\\r\\n    &nbsp;\\r\\n    Gender\\r\\n    Male \\r\\n  ',\n",
        " u'\\r\\n    &nbsp;\\r\\n    Hair Color\\r\\n    Black \\r\\n  ',\n",
        " u'\\r\\n    &nbsp;\\r\\n    Height\\r\\n    5 ft 7 in \\r\\n  ',\n",
        " u'\\r\\n    &nbsp;\\r\\n    Weight\\r\\n    186 \\r\\n  ',\n",
        " u'\\r\\n    &nbsp;\\r\\n    Eye Color\\r\\n    Brown \\r\\n  ',\n",
        " u'\\r\\n    &nbsp;\\r\\n    Native County\\r\\n    Nueces \\r\\n  ',\n",
        " u'\\r\\n    &nbsp;\\r\\n    Native State\\r\\n    Texas \\r\\n  ']"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "It turns out that this would be fairly easy, we just need to strip the `&nbsp;` and the spaces,\n",
      "split by `\\r\\n` and filter out the empty cells"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "fields = [[l.strip() for l in _.text().replace(\"&nbsp;\",\"\").split(\"\\r\\n\") if l.strip()] for _ in table[0].query(\"tr\")] ; fields"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 43,
       "text": [
        "[[u'Name', u'Villegas, Jr., Jose Luis'],\n",
        " [u'TDCJ Number', u'999417'],\n",
        " [u'Date of Birth', u'04/14/1975'],\n",
        " [u'Date Received', u'05/21/2002'],\n",
        " [u'Age (when    Received)', u'27'],\n",
        " [u'Education Level (Highest Grade Completed)', u'9'],\n",
        " [u'Date of Offense', u'01/22/2001'],\n",
        " [u'Age (at the time    of Offense)', u'27'],\n",
        " [u'County', u'Nueces'],\n",
        " [u'Race', u'Hispanic'],\n",
        " [u'Gender', u'Male'],\n",
        " [u'Hair Color', u'Black'],\n",
        " [u'Height', u'5 ft 7 in'],\n",
        " [u'Weight', u'186'],\n",
        " [u'Eye Color', u'Brown'],\n",
        " [u'Native County', u'Nueces'],\n",
        " [u'Native State', u'Texas']]"
       ]
      }
     ],
     "prompt_number": 43
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can now fold this in a dictionary"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "res    = {}\n",
      "for k, v in fields:\n",
      "    k = k.split(\"(\")[0] # We get rid of the parens, if any\n",
      "    k = \"\".join([w.lower() if i == 0 else w.capitalize() for i,w in enumerate(k.split())])\n",
      "    res[k] = v\n",
      "res"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 38,
       "text": [
        "{u'age': u'27',\n",
        " u'county': u'Nueces',\n",
        " u'dateOfBirth': u'04/14/1975',\n",
        " u'dateOfOffense': u'01/22/2001',\n",
        " u'dateReceived': u'05/21/2002',\n",
        " u'educationLevel': u'9',\n",
        " u'eyeColor': u'Brown',\n",
        " u'gender': u'Male',\n",
        " u'hairColor': u'Black',\n",
        " u'height': u'5 ft 7 in',\n",
        " u'name': u'Villegas, Jr., Jose Luis',\n",
        " u'nativeCounty': u'Nueces',\n",
        " u'nativeState': u'Texas',\n",
        " u'race': u'Hispanic',\n",
        " u'tdcjNumber': u'999417',\n",
        " u'weight': u'186'}"
       ]
      }
     ],
     "prompt_number": 38
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "and we can take advantage of the `normalizer` we've defined before"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "normalize_info = normalizer(dict(\n",
      "    age=int,\n",
      "    dateOfBirth=parse_date,\n",
      "    dateOfOffense=parse_date,\n",
      "    dateReceived=parse_date,\n",
      "    weight=int,\n",
      "    tdcjNumber=int,\n",
      "    educationLevel=int\n",
      "))\n",
      "normalize_info(res)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 40,
       "text": [
        "{u'age': 27,\n",
        " u'county': u'Nueces',\n",
        " u'dateOfBirth': (1975, 14, 4),\n",
        " u'dateOfOffense': (2001, 22, 1),\n",
        " u'dateReceived': (2002, 21, 5),\n",
        " u'educationLevel': 9,\n",
        " u'eyeColor': u'Brown',\n",
        " u'gender': u'Male',\n",
        " u'hairColor': u'Black',\n",
        " u'height': u'5 ft 7 in',\n",
        " u'name': u'Villegas, Jr., Jose Luis',\n",
        " u'nativeCounty': u'Nueces',\n",
        " u'nativeState': u'Texas',\n",
        " u'race': u'Hispanic',\n",
        " u'tdcjNumber': 999417,\n",
        " u'weight': 186}"
       ]
      }
     ],
     "prompt_number": 40
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "there's still a bit of information to scrape from the paragraphs we exctracted before. Let's print them as text and see what we can do"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "[_.text() for _ in p]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 47,
       "text": [
        "[u'Texas Department of Criminal Justice',\n",
        " u'Prior Occupation\\r\\ncook,  dishwasher, laborer ',\n",
        " u'Prior Prison Record\\r\\n  N/A ',\n",
        " u'Summary of Incident\\r\\n  On  1/22/2001, Villegas fatally stabbed three victims. A 24 year old Hispanic  female was stabbed 32 times. Her 3 year old Hispanic male son was stabbed 19  times and her 51 year old Hispanic mother was stabbed 35 times. Villegas took  the television and a vehicle from the home. ',\n",
        " u'Co-Defendants\\r\\n  None ',\n",
        " u'Race and Gender of Victim\\r\\n  Hispanic  male and Hispanic female ']"
       ]
      }
     ],
     "prompt_number": 47
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "it seems that if we split by `\\r\\n` and remove the first line, we're in business"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "[[v.strip() for v in _.text().split(\"\\r\\n\")] for _ in p[1:]]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 49,
       "text": [
        "[[u'Prior Occupation', u'cook,  dishwasher, laborer'],\n",
        " [u'Prior Prison Record', u'N/A'],\n",
        " [u'Summary of Incident',\n",
        "  u'On  1/22/2001, Villegas fatally stabbed three victims. A 24 year old Hispanic  female was stabbed 32 times. Her 3 year old Hispanic male son was stabbed 19  times and her 51 year old Hispanic mother was stabbed 35 times. Villegas took  the television and a vehicle from the home.'],\n",
        " [u'Co-Defendants', u'None'],\n",
        " [u'Race and Gender of Victim', u'Hispanic  male and Hispanic female']]"
       ]
      }
     ],
     "prompt_number": 49
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can fold this the same was as before and normalize it further. Let's abstract the whole process in a function called `extract_information` and add the paragraphs information."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "normalize_info = normalizer(dict(\n",
      "    age=int,\n",
      "    dateOfBirth=parse_date,\n",
      "    dateOfOffense=parse_date,\n",
      "    dateReceived=parse_date,\n",
      "    weight=int,\n",
      "    tdcjNumber=int,\n",
      "    educationLevel=int,\n",
      "    # We add these two refinements to the normalization function\n",
      "    height=lambda v:(int(v.split()[0]), int(v.split()[2])),\n",
      "    priorOccupation=lambda v:[_.strip() for _ in v.split()]\n",
      "))\n",
      "\n",
      "def extract_information(url):\n",
      "    res   = {}\n",
      "    t     = HTML.tree(s.get(\"death_row/\" + url))\n",
      "    table = t.query(\"table.tabledata_deathrow_table\")\n",
      "    p     = t.query(\"p\")\n",
      "    # We extract the image URL\n",
      "    res[\"image\"] = s.url(table[0].query(\"img\")[0].get(\"src\"))\n",
      "    # We extract the fields\n",
      "    fields  = [[l.strip() for l in _.text().replace(\"&nbsp;\",\"\").split(\"\\r\\n\") if l.strip()] for _ in table[0].query(\"tr\")]\n",
      "    fields += [[l.strip() for l in _.text().split(\"\\r\\n\")] for _ in p[1:]]\n",
      "    for k, v in fields:\n",
      "        k = k.split(\"(\")[0] # We get rid of the parens, if any\n",
      "        k = \"\".join([w.lower() if i == 0 else w.capitalize() for i,w in enumerate(k.split())])\n",
      "        res[k] = v\n",
      "    return normalize_info(res)\n",
      "extract_information(offender_information[0])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 52,
       "text": [
        "{u'age': 27,\n",
        " u'co-defendants': u'None',\n",
        " u'county': u'Nueces',\n",
        " u'dateOfBirth': (1975, 14, 4),\n",
        " u'dateOfOffense': (2001, 22, 1),\n",
        " u'dateReceived': (2002, 21, 5),\n",
        " u'educationLevel': 9,\n",
        " u'eyeColor': u'Brown',\n",
        " u'gender': u'Male',\n",
        " u'hairColor': u'Black',\n",
        " u'height': (5, 7),\n",
        " 'image': 'http://www.tdcj.state.tx.us/death_row/dr_info/villegasjose2.jpg',\n",
        " u'name': u'Villegas, Jr., Jose Luis',\n",
        " u'nativeCounty': u'Nueces',\n",
        " u'nativeState': u'Texas',\n",
        " u'priorOccupation': [u'cook,', u'dishwasher,', u'laborer'],\n",
        " u'priorPrisonRecord': u'N/A',\n",
        " u'race': u'Hispanic',\n",
        " u'raceAndGenderOfVictim': u'Hispanic  male and Hispanic female',\n",
        " u'summaryOfIncident': u'On  1/22/2001, Villegas fatally stabbed three victims. A 24 year old Hispanic  female was stabbed 32 times. Her 3 year old Hispanic male son was stabbed 19  times and her 51 year old Hispanic mother was stabbed 35 times. Villegas took  the television and a vehicle from the home.',\n",
        " u'tdcjNumber': 999417,\n",
        " u'weight': 186}"
       ]
      }
     ],
     "prompt_number": 52
    }
   ],
   "metadata": {}
  }
 ]
}