Skip to content

Instantly share code, notes, and snippets.

@mihi-tr
Last active December 21, 2015 00:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mihi-tr/6220248 to your computer and use it in GitHub Desktop.
Save mihi-tr/6220248 to your computer and use it in GitHub Desktop.
Ipython notebook for the name-pdf scraper
{
"metadata": {
"name": "Scraping a PDF with Scraperwikis PDFtoXML"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "While for simple single or double-page tables [tabula](http://jazzido.github.io/tabula/) is a viable option - if you have PDFs with tables over multiple pages you'll soon grow old marking them.\n\nThis is where you'll need some scripting. Thanks to [scraperwikis library](https://pypi.python.org/pypi/scraperwiki) (```pip install scraperwiki```) and the included pdftoxml - scraping PDFs has become a feasible task in python. On a recent Hacks/Hackers event we run into a candidate - that was quite tricky to scrape - I decided to protocol the process here."
},
{
"cell_type": "code",
"collapsed": false,
"input": "import scraperwiki, urllib2",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "raw",
"metadata": {},
"source": "First import the scraperwiki library and urrllib2 - since the file we're using is on a webserver\n"
},
{
"cell_type": "code",
"collapsed": false,
"input": "u=urllib2.urlopen(\"http://images.derstandard.at/2013/08/12/VN2p_2012.pdf\") #open the url for the PDF\nx=scraperwiki.pdftoxml(u.read()) # interpret it as xml\nprint x[:1024] # let's see what's in there abbreviated...\n",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE pdf2xml SYSTEM \"pdf2xml.dtd\">\n\n<pdf2xml producer=\"poppler\" version=\"0.22.5\">\n<page number=\"1\" position=\"absolute\" top=\"0\" left=\"0\" height=\"1263\" width=\"892\">\n\t<fontspec id=\"0\" size=\"8\" family=\"Times\" color=\"#000000\"/>\n\t<fontspec id=\"1\" size=\"7\" family=\"Times\" color=\"#000000\"/>\n<text top=\"42\" left=\"64\" width=\"787\" height=\"12\" font=\"0\"><b>TABELLE VN2Ap/1 30/07/13 11.38.44 BLATT 1 </b></text>\n<text top=\"58\" left=\"64\" width=\"718\" height=\"12\" font=\"0\"><b>STATISTIK ALLER VORNAMEN (TEILWEISE PHONETISCH ZUSAMMENGEFASST, ALPHABETISCH SORTIERT) F\u00dcR NEUGEBORENE KNABEN MIT </b></text>\n<text top=\"73\" left=\"64\" width=\"340\" height=\"12\" font=\"0\"><b>\u00d6STERREICHISCHER STAATSB\u00dcRGERSCHAFT 2012 - \u00d6STERREICH </b></text>\n<text top=\"89\" left=\"64\" width=\"6\" height=\"12\" font=\"0\"><b> </b></text>\n<text top=\"104\" left=\"64\" width=\"769\" height=\"12\" font=\"0\"><b>VORNAMEN ABSOLUT % \n"
}
],
"prompt_number": 35
},
{
"cell_type": "markdown",
"metadata": {},
"source": "As you can see above, we have successfully loaded the PDF as xml (take a look at the PDF by just opening the url given, it should give you an idea how it is structured). \n\nThe basic structure of a pdf parsed this way will always be ```page``` tags followed by ```text``` tags contianing the information, positioning and font information. The positioning and font information can often help to get the table we want - however not in this case: everything is font=\"0\" and left=\"64\". \n\nWe can now use [xpath](http://en.wikipedia.org/wiki/XPath) to query our document..."
},
{
"cell_type": "code",
"collapsed": false,
"input": "import lxml\nr=lxml.etree.fromstring(x)\nr.xpath('//page[@number=\"1\"]')",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 4,
"text": "[<Element page at 0x31c32d0>]"
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": "and also get some lines out of it\n"
},
{
"cell_type": "code",
"collapsed": false,
"input": "r.xpath('//text[@left=\"64\"]/b')[0:10] #array abbreviated for legibility",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 32,
"text": "[<Element b at 0x31c3320>,\n <Element b at 0x31c3550>,\n <Element b at 0x31c35a0>,\n <Element b at 0x31c35f0>,\n <Element b at 0x31c3640>,\n <Element b at 0x31c3690>,\n <Element b at 0x31c36e0>,\n <Element b at 0x31c3730>,\n <Element b at 0x31c3780>,\n <Element b at 0x31c37d0>]"
}
],
"prompt_number": 32
},
{
"cell_type": "code",
"collapsed": false,
"input": "r.xpath('//text[@left=\"64\"]/b')[8].text",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 7,
"text": "u'Aaron * 64 0,19 91 Aim\\xe9 1 0,00 959 '"
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Great - this will help us. If we look at the document you'll notice that there are all boys names from page 1-20 and girls names from page 21-43 - let's get them seperately..."
},
{
"cell_type": "code",
"collapsed": false,
"input": "boys=r.xpath('//page[@number<=\"20\"]/text[@left=\"64\"]/b')\ngirls=r.xpath('//page[@number>\"20\" and @number<=\"43\"]/text[@left=\"64\"]/b')\nprint boys[8].text\nprint girls[8].text",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Aaron * 64 0,19 91 Aim\u00e9 1 0,00 959 \nAarina 1 0,00 1.156 Ala\u00efa 1 0,00 1.156 \n"
}
],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": "fantastic - but you'll also notice something - the columns are all there, sperated by whitespaces. And also Aaron has an asterisk - we want to remove it (the asterisk is explained in the original doc).\n\nTo split it up into columns I'll create a small function using regexes to split it."
},
{
"cell_type": "code",
"collapsed": false,
"input": "import re\n\ndef split_entry(e):\n return re.split(\"[ ]+\",e.text.replace(\"*\",\"\")) # we're removing the asterisk here as well...",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": "now let's apply it to boys and girls"
},
{
"cell_type": "code",
"collapsed": false,
"input": "boys=[split_entry(i) for i in boys]\ngirls=[split_entry(i) for i in girls]\nprint boys[8]\nprint girls[8]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[u'Aaron', u'64', u'0,19', u'91', u'Aim\\xe9', u'1', u'0,00', u'959', u'']\n[u'Aarina', u'1', u'0,00', u'1.156', u'Ala\\xefa', u'1', u'0,00', u'1.156', u'']\n"
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": "That worked!. Notice the empty string u'' at the end? I'd like to filter it. I'll do this using the ifilter function from itertools"
},
{
"cell_type": "code",
"collapsed": false,
"input": "import itertools\nboys=[[i for i in itertools.ifilter(lambda x: x!=\"\",j)] for j in boys]\ngirls=[[i for i in itertools.ifilter(lambda x: x!=\"\",j)] for j in girls]\nprint boys[8]\nprint girls[8]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[u'Aaron', u'64', u'0,19', u'91', u'Aim\\xe9', u'1', u'0,00', u'959']\n[u'Aarina', u'1', u'0,00', u'1.156', u'Ala\\xefa', u'1', u'0,00', u'1.156']\n"
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Worked, this cleaned up our boys and girls arrays. We want to make them properly though - there are two columns each four fields wide. I'll do this with a little function"
},
{
"cell_type": "code",
"collapsed": false,
"input": "def take4(x):\n if (len(x)>5):\n return [x[0:4],x[4:]]\n else:\n return [x[0:4]]\n \nboys=[take4(i) for i in boys]\ngirls=[take4(i) for i in girls]\nprint boys[8]\nprint girls[8]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[[u'Aaron', u'64', u'0,19', u'91'], [u'Aim\\xe9', u'1', u'0,00', u'959']]\n[[u'Aarina', u'1', u'0,00', u'1.156'], [u'Ala\\xefa', u'1', u'0,00', u'1.156']]\n"
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": "ah that worked nicely! - now let's make sure it's one array with both options in it -for this i'll use reduce"
},
{
"cell_type": "code",
"collapsed": false,
"input": "boys=reduce(lambda x,y: x+y, boys, [])\ngirls=reduce(lambda x,y: x+y, girls,[])\nprint boys[10]\nprint girls[10]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "['Aiden', '2', '0,01', '667']\n['Alaa', '1', '0,00', '1.156']\n"
}
],
"prompt_number": 18
},
{
"cell_type": "markdown",
"metadata": {},
"source": "perfect - now let's add a gender to the entries\n"
},
{
"cell_type": "code",
"collapsed": false,
"input": "for x in boys:\n x.append(\"m\")\n\nfor x in girls:\n x.append(\"f\")\n\nprint boys[10]\nprint girls[10]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "['Aiden', '2', '0,01', '667', 'm']\n['Alaa', '1', '0,00', '1.156', 'f']\n"
}
],
"prompt_number": 19
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We got that! For further processing I'll join the arrays up"
},
{
"cell_type": "code",
"collapsed": false,
"input": "names=boys+girls\nprint names[10]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "['Aiden', '2', '0,01', '667', 'm']\n"
}
],
"prompt_number": 29
},
{
"cell_type": "markdown",
"metadata": {},
"source": "let's take a look at the full array..."
},
{
"cell_type": "code",
"collapsed": false,
"input": "names[0:10]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 30,
"text": "[['TABELLE', 'VN2Ap/1', '30/07/13', '11.38.44', 'm'],\n ['BLATT', '1', 'm'],\n [u'STATISTIK', u'ALLER', u'VORNAMEN', u'(TEILWEISE', 'm'],\n [u'PHONETISCH',\n u'ZUSAMMENGEFASST,',\n u'ALPHABETISCH',\n u'SORTIERT)',\n u'F\\xdcR',\n u'NEUGEBORENE',\n u'KNABEN',\n u'MIT',\n 'm'],\n [u'\\xd6STERREICHISCHER', u'STAATSB\\xdcRGERSCHAFT', u'2012', u'-', 'm'],\n ['m'],\n ['VORNAMEN', 'ABSOLUT', '%', 'RANG', 'm'],\n ['VORNAMEN', 'ABSOLUT', '%', 'RANG', 'm'],\n ['m'],\n ['INSGESAMT', '34.017', '100,00', '.', 'm']]"
}
],
"prompt_number": 30
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Notice there is still quite a bit of mess in there: basically all the lines starting with an all caps entry, \"der\", \"m\" or \"f\". Let's remove them...."
},
{
"cell_type": "code",
"collapsed": false,
"input": "names=itertools.ifilter(lambda x: not x[0].isupper(),names) # remove allcaps entries\nnames=[i for i in itertools.ifilter(lambda x: not (x[0] in [\"der\",\"m\",\"f\"]),names)] # remove all entries that are \"der\",\"m\" or \"f\"\nnames[0:10]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 31,
"text": "[['Aiden', '2', '0,01', '667', 'm'],\n ['Aiman', '3', '0,01', '532', 'm'],\n [u'Aaron', u'64', u'0,19', u'91', 'm'],\n [u'Aim\\xe9', u'1', u'0,00', u'959', 'm'],\n ['Abbas', '2', '0,01', '667', 'm'],\n ['Ajan', '2', '0,01', '667', 'm'],\n ['Abdallrhman', '1', '0,00', '959', 'm'],\n ['Ajdin', '15', '0,04', '225', 'm'],\n ['Abdel', '1', '0,00', '959', 'm'],\n ['Ajnur', '1', '0,00', '959', 'm']]"
}
],
"prompt_number": 31
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Woohoo - we have a cleaned up list. Now let's write it as csv...."
},
{
"cell_type": "code",
"collapsed": false,
"input": "import csv\nf=open(\"names.csv\",\"wb\") #open file for writing\nw=csv.writer(f) #open a csv writer\n\nw.writerow([\"Name\",\"Count\",\"Percent\",\"Rank\",\"Gender\"]) #write the header\n\nfor n in names:\n w.writerow([i.encode(\"utf-8\") for i in n]) #write each row\n\nf.close()",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 27
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Done, We've scraped a multi-page PDF using python. All in all this was a fairly quick way to get the data out of a PDF using scraperwiki tools.\n"
},
{
"cell_type": "code",
"collapsed": false,
"input": "",
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment