Skip to content

Instantly share code, notes, and snippets.

@nealcaren
Created February 28, 2013 20:11
Show Gist options
  • Save nealcaren/5059705 to your computer and use it in GitHub Desktop.
Save nealcaren/5059705 to your computer and use it in GitHub Desktop.
{
"metadata": {
"name": "Scraping"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": "Web Scrapping with Regular Expressions"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "This tutorial gives a brief introduction into the world of web scraping. Web scraping is the process of collecting information from websites in an automated fashion. For example, I was once part of a project that involved tens of thousands of searches using ProQuest to find out how many times different social movement organizations were in the newspaper. From each ProQuest, search, we were just looking for the number of hits\u2014a single number on the page that would be used in a regression analysis. In another instance, I scraped the content of a web forum, where I wanted to gather information about the post and poster. Here the data I acquired was both quantitative, such as the date and time the post was made, and qualitative, including the text of the post itself. \n\nSome other sociologically interesting uses of web scraping can be seen in this cool example, or might be some cool examples I made up. [more general stuff]\n\nTo provide you with the tools you need for these type of analyses, I\u2019m going to describe a fairly simple way to extract relevant information from a web page using Python. Some pages are more complex, and have to be handled using modules such as Mechanize, Scrapy or Selenium (but you don\u2019t have to worry about those to get started). While those tools are quite powerful, allowing you to scrape just about any page out there, they are also fairly complex, requiring an understating of not just Python but also HTML programming.\n\n[Web scraping is complicated because it requires] the page to first be opened in python, read as text, and then specific portions extracted. \n\nFor now, let\u2019s start with a simple case: you have one page of interest, and there is exactly one thing you want to get from it. Say, for example, you are a fan of the [American Sociological Review](http://asr.sagepub.com/), and you want to know when they last updated the section of the website where they post new articles. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "import urllib2 \n\n#You don't have to store your url as a string, but it makes your code more portable.\nurl='http://asr.sagepub.com/content/early/recent' ",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Next, you want to open up the page and read the contents. For now, we\u2019ll ignore the fact that errors might happen\u2014these could result from losing access to the Internet, websites going down, or entering a bad or no longer existent URL."
},
{
"cell_type": "code",
"collapsed": false,
"input": "journal_html=urllib2.urlopen(url).read()",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": "You can make sure that you downloaded the right thing by printing out the first part of it."
},
{
"cell_type": "code",
"collapsed": false,
"input": "print journal_html[:400]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "<!DOCTYPE html\n PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n<html\n xmlns=\"http://www.w3.org/1999/xhtml\"\n xml:lang=\"en\"\n lang=\"en\">\n <head>\n <meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />\n <title>Ahead of Print Articles (date view) </title>\n <meta name=\"googlebot\" content=\"NOODP\" />\n <met\n"
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Before you ask Python to look for the text you are interested in, you need to locate them in the coding of a webpage. What you\u2019re going to do next is copy the text that you want to scrape. You\u2019ll paste this into a find box in a minute. In this case, I\u2019m looking on the ASR page for the date the page was last updated, \u201cAugust 30, 2012\u201d, so I\u2019ll copy that down so I can look for it later. Don\u2019t worry too much about how much of the text you copy\u2014this is just so you can look for the relevant text in the raw HTML.\n\nNow that you know what you\u2019re looking for, visit this site in your browser and view the raw HTML code of the page. In Chrome, you access the code for the page by clicking \u201cview\u201d, and then \u201cView-Developer-View Source\u201d. In Safari, you do this by \u201cView-View Source.\u201d The source shows the raw text that your browser converts into a nice-looking web page. \n\nDon\u2019t worry that this page doesn\u2019t make sense. Scrapping can be a lot easier if you know about the structure of HTML, and advance scraping requires that you do, but most tasks can be handled without knowing what the HTML is doing. The key thing to know is that this page contains both things that are displayed to you, like \u201cLast updated August 30, 2012\u201d and things that tell the browser how to display that text that aren\u2019t shown, like \u201c<p class=\"pap-last-modified\">\u201d. \n\nNow that you have access the HTML, try to find the text you are looking for in the file. The easiest way to do this is by going to \u201cEdit\u201d and \u201cFind\u201d and then pasting in the text you are looking for that you copied earlier. Hopeful, you will find that one (and only one) copy of the text you are looking for is on the page. In this case, this worked perfectly. In a less ideal scenario, the text might appear in multiple places, either because the same information appears several times on the page, or because the phrase you are looking for is used in multiple contexts. If the phrase doesn\u2019t appear at the all, try a shorter version\u2014there might be some code in the middle. For example, while \u201cLast updated August 30, 2012\u201d appears together when viewing the page, it doesn\u2019t show up as one piece when viewing the source code. In the HTML, there is a \u201c<span class=\"pap-last-modified-date\">\u201d in the middle. \n\nWhat you want to find in the HTML code is some pattern of text that will uniquely identify the portion of the text you want to capture such as the date of posting, even after you remove the exact thing you are looking for, in this case, August 30, 2012. In this case, we are looking for the date the page was updated, and it won\u2019t always be August 30th\u2014if it was a constant, there would be no point in scraping the page. In an ideal world, the HTML would have something like \u201clast-modified-date>August 30, 2012<\u201d with \u201clast-modified-date\u201d not appearing anywhere else on the page. You should make sure to grab at least one character before and at least one character after the information that you want. Later on, you\u2019ll need that information so you know where to start and stop grabbing the relevant text. You don\u2019t need the front and the back to both uniquely identify the relevant text; most of the time the front part will have the identifier and the back will identify that you\u2019ve reached the end of the sections you\u2019re interested in scraping.\n\nTo confirm that you have selected a satisfactory section of the text that identifies the component you want to scrape, search for it on the HTLM source. You should make sure it is something that is only found once. You won\u2019t able to find something that does what you want, but we\u2019ll deal with those problems later. \n\nOnce you think you have the text, copy the whole thing, including the phrase you want to extract and paste it somewhere you can edit, like a text editor. I use Editra for text editing because Python can run the files from Editra, but there are many options out there. In my case, the text I\u2019ve selected looks like:"
},
{
"cell_type": "raw",
"metadata": {},
"source": "<span class=\"pap-last-modified-date\">August 30, 2012</span>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We want to prepare this text for use for a special kind of search involving a regular expression. Regular expressions are a very powerful way to find string patterns. You are likely familiar with something like it when you use \"*\" as a wildcard character in a search such as for all files containing the word \"Python\". Regular expressions aren\u2019t unique to Python, and some variant of them is found in many computer programming languages. Regular expressions are both powerful and complex, and the syntax can quickly become complicated. So, it makes sense to start with just one pattern. From the HTLM that you pasted, cut the text that you want to find, August 30, 2012 in my case, and replace it with `(.*?)`."
},
{
"cell_type": "code",
"collapsed": false,
"input": "import re\nmatch='<span class=\"pap-last-modified-date\">(.*?)</span>'\ndate=re.findall(match,journal_html)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": "print date",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "['February 22, 2013']\n"
}
],
"prompt_number": 10
},
{
"cell_type": "code",
"collapsed": false,
"input": "print date[0]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "February 22, 2013\n"
}
],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": false,
"input": "sections=the_page.split('<div class=\"gca-buttons\">')",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 12
},
{
"cell_type": "code",
"collapsed": false,
"input": "author_match='cit-auth cit-auth-type-author\">(.*?)<'",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 13
},
{
"cell_type": "code",
"collapsed": false,
"input": "for section in sections:\n authors=re.findall(author_match,section)\n print authors",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[]\n['Jo C. Phelan', 'Bruce G. Link', 'Naumi M. Feldman']\n['Bruce G. Link', 'Richard M. Carpiano', 'Margaret M. Weden']\n[]\n"
}
],
"prompt_number": 14
},
{
"cell_type": "code",
"collapsed": false,
"input": "for section in sections:\n authors=re.findall(author_match,section)\n print ' and '.join(authors)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "\nJo C. Phelan and Bruce G. Link and Naumi M. Feldman\nBruce G. Link and Richard M. Carpiano and Margaret M. Weden\n\n"
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": "That would work if there was always just two authors, but that isn't always the case. We can slightly modify the loop that it does something different for 1 author, two authors, or more than two authors. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "for section in sections:\n authors=re.findall(author_match,section)\n if len(authors)==1:\n print authors[0]\n elif len(authors)==2:\n print ' and '.join(authors)\n elif len(authors)>2:\n print ', '.join(authors[:-1])+', and '+authors[-1]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Jo C. Phelan, Bruce G. Link, and Naumi M. Feldman\nBruce G. Link, Richard M. Carpiano, and Margaret M. Weden\n"
}
],
"prompt_number": 18
},
{
"cell_type": "code",
"collapsed": false,
"input": "for section in sections:\n title_search=re.findall('</ul><span class=\"cit-title\">(.*?)<',section)\n subtitle_search=re.findall('class=\"cit-subtitle\">(.*?) <',section)\n try:\n title=title_search[0]\n except Exception, e:\n title='None'\n try:\n title='%s: %s' % (title_search[0],subtitle_search[0])\n except Exception, e:\n pass\n print title\n \n article_url=re.findall('href=\"(.*?full)',section)\n print article_url\n ",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "None\nThe Genomic Revolution and Beliefs about Essential Racial Differences: A Backdoor to Eugenics?\nCan Honorific Awards Give Us Clues about the Connection between Socioeconomic Status and Mortality? \nNone\n"
}
],
"prompt_number": 31
},
{
"cell_type": "code",
"collapsed": false,
"input": "",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 28
},
{
"cell_type": "code",
"collapsed": false,
"input": "",
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment