Skip to content

Instantly share code, notes, and snippets.

@johnb30
Created February 9, 2013 00:52
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save johnb30/4743272 to your computer and use it in GitHub Desktop.
Save johnb30/4743272 to your computer and use it in GitHub Desktop.
Tutorial on web scraping for PL SC 597I: Event Data
{
"metadata": {
"name": "Scraping Tutorial"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Web Scraping\n",
"======\n",
"\n",
"> Even with the best of websites, I don\u2019t think I\u2019ve ever encountered a scraping job that couldn\u2019t be described as *\u201cA small and simple general model with heaps upon piles of annoying little exceptions\u201d* \n",
"\n",
">> \\- Swizec Teller [http://swizec.com/blog/scraping-with-mechanize-and-beautifulsoup/swizec/5039](http://swizec.com/blog/scraping-with-mechanize-and-beautifulsoup/swizec/5039)\n",
"\n",
"##What is it?\n",
"\n",
"A large portion of the data that we as social scientists are interested in resides on the web in manner. Web scraping is a method for pulling data from the structured (or not so structured!) HTML that makes up a web page. Python has numerous libraries for approaching this type of problem, many of which are incredibly powerful. If there is something you want to do, there's usually a way to accomplish it. Perhaps not easily, but it can be done. \n",
"\n",
"\n",
"\n",
"##How is it accomplished?\n",
"\n",
"In general, there are three problems that you might face when undertaking a scraping task:\n",
"\n",
"1. You have a single page, or a set of pages, that you know of and you want to scrape.\n",
"2. You have a source that generates links, e.g., [RSS feeds](http://rss.nytimes.com/services/xml/rss/nyt/World.xml), to various pages with the same structure.\n",
"3. You have a page that contains many pages of interest that are scattered across the file system and you only have general rules for reaching these pages. \n",
"\n",
"The key is that you must identify which type of problem you have. After this, you must look at the HTML structure of a webpage and construct a script that will select the parts of the page that are of interest to you.\n",
"\n",
"\n",
"\n",
"##There's a library for that! (Yea, I know...)\n",
"\n",
"As mentioned previously, Python has various libraries for scraping tasks. The ones I have found the most useful are:\n",
"\n",
"- [pattern](http://www.clips.ua.ac.be/pages/pattern)\n",
"- [lxml](http://lxml.de/)\n",
"- [requests](http://docs.python-requests.org/en/latest/)\n",
"- [Scrapy](http://doc.scrapy.org/en/0.16/)\n",
"- [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/)\n",
"\n",
"\n",
"In addition you need some method to examine the source of a webpage in a structured manner. I use Chrome which, as a WebKit browser, allows for \"Inspect Element\" functionality. Alternatively there is [Firebug](https://getfirebug.com/) for Firefox. I have no idea about Safari, Opera, or any other browser you wish to use. \n",
"\n",
"So, let's look at some [webpage source](http://www.nytimes.com/reuters/2013/01/25/world/americas/25reuters-venezuela-prison.html?partner=rss&emc=rss). I'm going to pick on the New York Times throughout (I thought about using the eventdata.psu.edu page...it actually has very well formatted HTML). \n",
"\n",
"\n",
"\n",
"##On to the Python\n",
"\n",
"First, I feel obligated to show the philosophy of Python any time I give a talk that uses Python. So, let's take a look."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import this"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"The Zen of Python, by Tim Peters\n",
"\n",
"Beautiful is better than ugly.\n",
"Explicit is better than implicit.\n",
"Simple is better than complex.\n",
"Complex is better than complicated.\n",
"Flat is better than nested.\n",
"Sparse is better than dense.\n",
"Readability counts.\n",
"Special cases aren't special enough to break the rules.\n",
"Although practicality beats purity.\n",
"Errors should never pass silently.\n",
"Unless explicitly silenced.\n",
"In the face of ambiguity, refuse the temptation to guess.\n",
"There should be one-- and preferably only one --obvious way to do it.\n",
"Although that way may not be obvious at first unless you're Dutch.\n",
"Now is better than never.\n",
"Although never is often better than *right* now.\n",
"If the implementation is hard to explain, it's a bad idea.\n",
"If the implementation is easy to explain, it may be a good idea.\n",
"Namespaces are one honking great idea -- let's do more of those!\n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Okay, cool. Whatever. Let's get down to some actual webscraping. \n",
"\n",
"###Scraping a page that you know\n",
"\n",
"The easiest approach to webscraping is getting the content from a page that you know in advance. I'll go ahead and keep using that NYT page we looked at earlier. There are three basic steps to scraping a single page:\n",
"\n",
"1. Get (request) the page\n",
"2. Parse the page content\n",
"3. Select the content of interest using an XPath selector\n",
"\n",
"The following code executes these three steps and prints the result. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import requests\n",
"import lxml.html as lh\n",
"\n",
"url = 'http://www.nytimes.com/reuters/2013/01/25/world/americas/25reuters-venezuela-prison.html?partner=rss&emc=rss'\n",
"page = requests.get(url)\n",
"doc = lh.fromstring(page.content)\n",
"text = doc.xpath('//p[@itemprop=\"articleBody\"]')\n",
"finalText = str()\n",
"for par in text:\n",
" finalText += par.text_content()\n",
"\n",
"print finalText"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So we now have our lovely output. This output can be manipulated in various ways, or written to an output file.\n",
"\n",
"###Scraping generated links\n",
"\n",
"Let's say you want to get a stream of news stories in an easy manner. You could visit the homepage of the NYT and work from there, or you can use an [RSS feed](http://rss.nytimes.com/services/xml/rss/nyt/World.xml). RSS stands for Real Simple Syndication and is, at its heart, an XML document. This allows it to be easily parsed. The fantastic library `pattern` allows for easy parsing of RSS feeds. Using `pattern`'s `Newsfeed()` method, it is possible to parse a feed and obtain attributes of the XML document. The `search()` method returns an iterable composed of the individual stories. Each result has a variety of attributes such as `.url`, `.title`, `.description`, and more. The following code demonstrates these methods."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pattern.web\n",
"\n",
"url = 'http://rss.nytimes.com/services/xml/rss/nyt/World.xml'\n",
"results = pattern.web.Newsfeed().search(url, count=5)\n",
"results\n",
"\n",
"print '%s \\n\\n %s \\n\\n %s \\n\\n' % (results[0].url, results[0].title, results[0].description)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That looks pretty good, but the description looks nastier than we would generally prefer. Luckily, `pattern` provides functions to get rid of the HTML in a string. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print '%s \\n\\n %s \\n\\n %s \\n\\n' % (results[0].url, results[0].title, pattern.web.plaintext(results[0].description))"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While it's all well and good to have the title and description of a story this is often insufficient (some descriptions are just the title, which isn't particularly helpful). To get further information on the story, it is possible to combine the single-page scraping discussed previously and the results from the RSS scrape. The following code implements a function to scrape the NYT article pages, which can be done easily since the NYT is wonderfully consistent in their HTML, and then iterates over the results applying the `scrape` function to each result."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import codecs\n",
"\n",
"outputFile = codecs.open('~/tutorialOutput.txt', encoding='utf-8', mode='a')\n",
"\n",
"def scrape(url):\n",
" page = requests.get(url)\n",
" doc = lh.fromstring(page.content)\n",
" text = doc.xpath('//p[@itemprop=\"articleBody\"]')\n",
" finalText = str()\n",
" for par in text:\n",
" finalText += par.text_content()\n",
" return finalText\n",
"\n",
"for result in results:\n",
" outputText = scrape(result.url)\n",
" outputFile.write(outputText)\n",
"\n",
"outputFile.close()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###Scraping arbitrary websites\n",
"\n",
"The final approach is for a webpage that contains information you want and the pages are spread around in a fairly consistent manner, but there is no simple, straightfoward manner in which the pages are named.\n",
"\n",
"I'll offer a brief aside here to mention that it is often possible to make slight modifications to the URL of a website and obtain many different pages. For example, a website that contains Indian parliament speeches has the URL `http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl=` with differing values appended after the `=`. Thus, using a `for-loop` allows for the programatic creation of different URLs. Some sample code is below."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"url = 'http://164.100.47.132/LssNew/psearch/Result13.aspx?dbsl='\n",
"\n",
"for i in xrange(5175,5973):\n",
" newUrl = url + str(i)\n",
" print 'Scraping: %s' % newUrl"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Getting back on topic, it is often more difficult than the above to iterate over numerous webpages within a site. This is where the `Scrapy` library comes in. `Scrapy` allows for the creation of web spiders that crawl over a webpage, following any links that it finds. This is often far more difficult to implement than a simple scraper since it requires the identification of rules for link following. The [State Department](http://www.state.gov/r/pa/prs/dpb/2012/index.htm) offers a good example. I don't really have time to go into the depths of writing a `Scrapy` spider, but I thought I would put up some code to illustrate what it looks like."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from scrapy.contrib.spiders import CrawlSpider, Rule\n",
"from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor\n",
"from scrapy.selector import HtmlXPathSelector\n",
"from scrapy.item import Item\n",
"from BeautifulSoup import BeautifulSoup\n",
"import re\n",
"import codecs\n",
"\n",
"class MySpider(CrawlSpider):\n",
" name = 'statespider' #name is a name\n",
" start_urls = ['http://www.state.gov/r/pa/prs/dpb/2010/index.htm',\n",
" ] #defines the URL that the spider should start on. adjust the year.\n",
"\n",
" #defines the rules for the spider\n",
" rules = (Rule(SgmlLinkExtractor(allow=('/2010/'), restrict_xpaths=('//*[@id=\"local-nav\"]'),)), #allows only links within the navigation panel that have /year/ in them.\n",
"\n",
" Rule(SgmlLinkExtractor(restrict_xpaths=('//*[@id=\"dpb-calendar\"]',), deny=('/video/')), callback='parse_item'), #follows links within the caldendar on the index page for the individuals years, while denying any links with /video/ in them\n",
"\n",
" )\n",
"\n",
" def parse_item(self, response):\n",
" self.log('Hi, this is an item page! %s' % response.url) #prints the response.url out in the terminal to help with debugging\n",
" \n",
" #Insert code to scrape page content\n",
"\n",
" #opens the file defined above and writes 'texts' using utf-8\n",
" with codecs.open(filename, 'w', encoding='utf-8') as output:\n",
" output.write(texts)\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##The Pitfalls of Webscraping\n",
"\n",
"Web scraping is much, much, *much*, more of an art than a science. It is often non-trivial to identify the XPath selector that will get you what you want. Also, some web programmers can't seem to decide how they want to structure the pages they write, so they just change the HTML every few pages. Notice that for the NYT example if `articleBody` gets changed to `articleBody1`, everything breaks. There are ways around this that are often convoluted, messy, and hackish. Usually, however, where there is a will there is a way.\n",
"\n",
"...brief pause to demonstrate the lengths this can go to.\n",
"\n",
"##PITF Human Atrocities\n",
"\n",
"As a wrap up, I thought I would show the workflow I have been using to perform real-time scraping from various news sites of stories pertaining to human atrocities. This is illustrative both of web scraping and of the issues that can accompany programming. \n",
"\n",
"The general flow of the scraper is:\n",
"\n",
"RSS feed -> identify relevant stories -> scrape story -> place results in mongoDB -> repeat every hour"
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment