Skip to content

Instantly share code, notes, and snippets.

@nealcaren
Created March 14, 2013 02:38
Show Gist options
  • Save nealcaren/5158390 to your computer and use it in GitHub Desktop.
Save nealcaren/5158390 to your computer and use it in GitHub Desktop.
Scrapeme
{
"metadata": {
"name": "Lax"
},
"name": "Lax",
"nbformat": 2,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"source": "Web scraping in Python\n---------------------\nI find sports more exciting when I know what to expect. How many points\nis a team scoring relative to how they did in the past? Is there defense\nstopping a high scoring team or one with a struggling offense? For many sports,\nLas Vegas provides me the information I want with the line (expect difference \nbetween the home team's score and the visitor's score) and the total (the sum of the \nexpected scores). For example, at game time on Monday, March 11th,\nIona's men's basketball team was a four point favorite (-4) over Manhattan, with a total\nof 117.\n\nVegas doesn't provide lines for NCAA women's lacrosse (or any other college sport\nbut football and men's basketball. [Laxpower](http://www.laxpower.com/update13/binwom/rating01.php)\nhas a power rating. Comparing two team's ratings and adding about a home\nfield advantage does give a pretty good line,\nbut they don't provide total projections. So I decided to create my own\nVegas style lines and totals. You can see the [results](http://www.unc.edu/~ncaren/lax.html).\n\nCreating lines involves: finding and downloading game data; developing\na power rating model to rank each team; and using those models to predict \nfuture games. More generally, this is the same process used to scrape\ndata from websites for quantitative analysis. It's the same process \nI used, for example, to (analyze)[http://nealcaren.web.unc.edu/files/2012/05/smoc.pdf]\na white racist web forum.\n\nLuckily, Laxpower has all the game information, both for games played and scheduled.\nI wanted to cycle through each of the pages to get the information, but first\nI needed to know the URLs for all those pages. Luckily the [ranking page](http://www.laxpower.com/update13/binwom/rating01.php)\nhas all the teams listed along with links to their pages, so I can grab the information\nfrom there.\n\nIn my browser, I looked at the source for the ranking page--the raw HTML. I searched\nfor \"North Carolina\" so I could get a sense of what each link looked like. Fortunately, \nthe page had two pieces of information for each team list in a way that was very easy\nto extract. Each link began with a `\"` and ended in `PHP\"`. This was followed by a `>`,\nthe school's name, and then a `>'. This is a situation where a simple\n[regular expression](http://www.regular-expressions.info) would allow me to pull\nout the information I needed.\n\nTo get my list that contains all the URLS, I could take advantage of uniform\nway they were listed. In the Python variant of regular expressions, the powerful \ncombination of `.*?` will find any character, repeated any number of times, until it\nruns into something else. So searching a text for `My .*? dog` would grab all the word\nor words used between `My` and `dog`. In my case, I wanted to extract all the instances\nof text that happend between a quotion mark and `PHP` followed by a quotation mark, so\nI could search for instances of `\".*?PHP\"` in the page's text. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "import urllib2\nimport re\n\nteams_html=urllib2.urlopen('http://www.laxpower.com/update13/binwom/rating01.php').read()\nteams=re.findall('\".*?PHP\"',teams_html)\nprint teams[:5]",
"language": "python",
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "['\"XMADXX.PHP\"', '\"XUFLXX.PHP\"', '\"XNWSXX.PHP\"', '\"XSYRXX.PHP\"', '\"XUNCXX.PHP\"']"
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"source": "This is pretty good, but I don't want the quotation marks. I can be pickier \nabout what I extract by using parentheses, which instructs `re` to only\nreturn the stuff between parentheses."
},
{
"cell_type": "code",
"collapsed": false,
"input": "teams=re.findall('\"(.*?PHP)\"',teams_html)\nprint teams[:5]",
"language": "python",
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "['XMADXX.PHP', 'XUFLXX.PHP', 'XNWSXX.PHP', 'XSYRXX.PHP', 'XUNCXX.PHP']"
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"source": "As I noted above, next to this is also the school's name. I can\nextract this as well by extending the `re` statement."
},
{
"cell_type": "code",
"collapsed": false,
"input": "teams=re.findall('\"(.*?PHP)\">(.*?)<',teams_html)\nprint teams[:5]",
"language": "python",
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[('XMADXX.PHP', 'Maryland'), ('XUFLXX.PHP', 'Florida'), ('XNWSXX.PHP', 'Northwestern'), ('XSYRXX.PHP', 'Syracuse'), ('XUNCXX.PHP', 'North Carolina')]"
}
],
"prompt_number": 18
},
{
"cell_type": "markdown",
"source": "Adding `>(.*?)<` had the effect of extending the search and returning everything between\nthe greater than and less than signs. This is returned as a list of tuples. Note that \nregular expressions are complicated and more times than not will return either nothing\nor the entire text of the document. Trial, error, and reading is the only way forward.\n\nI want to remove any duplicates by turning the returned list into \na set, and then back into a list."
},
{
"cell_type": "code",
"collapsed": false,
"input": "print len(teams)\nteams=list(set(teams))\nprint len(teams)",
"language": "python",
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "200\n100"
}
],
"prompt_number": 19
},
{
"cell_type": "markdown",
"source": "I also want to store it in more useful format-I'll forget later on whether the team\nor the URL was first in the tuple."
},
{
"cell_type": "code",
"collapsed": false,
"input": "teams=[{'team id':t[0],'team name':t[1]} for t in teams]\nprint teams[:5]",
"language": "python",
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[{'team name': 'Lehigh', 'team id': 'XLEHXX.PHP'}, {'team name': 'Columbia', 'team id': 'XCMBXX.PHP'}, {'team name': 'Boston University', 'team id': 'XBOUXX.PHP'}, {'team name': 'Princeton', 'team id': 'XPRIXX.PHP'}, {'team name': 'Quinnipiac', 'team id': 'XQUIXX.PHP'}]"
}
],
"prompt_number": 20
},
{
"cell_type": "markdown",
"source": "Now that I know all the teams and where do get information about them, I want to go \nto each of those pages and get the information about each game-who,when,where, and if\nit has already been played, what the score was. A quick look at the source\nfor a [page](http://www.laxpower.com/update13/binwom/XMADXX.PHP) shows that information\nis stored in an HTML table. This is good news, and other ways of presenting data \non the page can be hard to get, and others, such as those displayed using Flash,\ncan be impossible.\n\nI'm going to use the `BeautifulSoup` module to help parse the HTML. Regular expressions\ncan get you pretty far, but modules like `BeautifulSoup` can save you a lot of time. They\nare much easier to use if you already know things like what a DOM element is, but are still\nusable for those who don't code web pages. \n\nAfter downloading, opening, and soupifying the page (see the function below), you can extract the table\nwith a simple `table = soup.find(\"table\")'. `rows=table.find_all(\"tr\")` will identify each of the\nrows, although you might not want the first or last rows, depending on how the information\nis presented. Within in each row, you can extract a list of the cells with \n`cell=row.findAll('td')'. Another powerful feature of `BeautifulSoup` is that in\ncan get rid of the HTML formatting with `.get_text()' which is a lot more efficient \nthan a complicated regular expression, which might not alwasy work.\n\nMy `get_team_page' function downloads the page and then extracts the contents of\nall the informative rows of the table and returns them as a list of lists. In \nretrospect, this should probably be two functions, with one that just reads and returns\nthe contents of the table. That would be a function that could be useful in other contexts."
},
{
"cell_type": "code",
"collapsed": true,
"input": "from bs4 import BeautifulSoup\n\ndef get_team_page(team):\n team_url='http://www.laxpower.com/update13/binwom/%s?tab=detail' % team['team id']\n team_html=urllib2.urlopen(team_url).read()\n soup = BeautifulSoup(team_html.decode('utf-8', 'ignore'))\n table = soup.find(\"table\")\n rows=[]\n for row in table.find_all(\"tr\")[3:-1]:\n data=[d.get_text().replace(u'\\xa0\\x96','') for d in row.findAll('td')]\n outline=[team['team name']]+data[:5]\n rows.append([i.encode('utf-8') for i in outline])\n return rows ",
"language": "python",
"outputs": [],
"prompt_number": 21
},
{
"cell_type": "markdown",
"source": "For maximum flexiblity, I want to output all the data to a tab separated file. I\ndo this with the `csv' module. The `\\t` tells the writer to use a tab instead\nof the default comma between items."
},
{
"cell_type": "code",
"collapsed": true,
"input": "import csv\noutfile=csv.writer(open('lax_13.tsv','wb'),delimiter='\\t')",
"language": "python",
"outputs": []
},
{
"cell_type": "markdown",
"source": "In order to be polite to the website, I want to pause a second between\neach page. Generally, I try to save the contents of the page locally\nso that I only have to download it once. In this case, I'll be running it\neveryday and I want the most recent results, so I'm not going to save each\npage. Additionally, the function above will crash if the web server is down\nor any other sort of HTML error. A better function would put the `urllib2.urlopen()` \nin a `try:` so that it can skip over those pages (if you think that is acceptable. \nOtherwise, you might have it so that if it can't download the page, it loads up\nthe most recent locally saved version. All depends on what the data is and\nwhat you want to do with it.) \n\nI have a `print` statement in the loop that goes through each team, downloads the page,\nreturns the table, and then writes the results to file so that I can watch it go. It takes\nabout two minutes because of the `sleep(1)` pause, when all is working, and I \nlike to make sure it isn't caught on anythig."
},
{
"cell_type": "code",
"collapsed": true,
"input": "from time import sleep\n\nfor team in teams:\n print team['team name']\n rows=get_team_page(team)\n outfile.writerows(rows)\n sleep(1)",
"language": "python",
"outputs": []
},
{
"cell_type": "markdown",
"source": "In part II, I'll describe the power ranking and prediction models."
}
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment