Skip to content

Instantly share code, notes, and snippets.

@thouis
Last active December 22, 2015 22:39
Show Gist options
  • Save thouis/6541913 to your computer and use it in GitHub Desktop.
Save thouis/6541913 to your computer and use it in GitHub Desktop.
Lab 2
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "code",
"collapsed": false,
"input": [
"import cs109style\n",
"cs109style.customize_mpl()\n",
"cs109style.customize_css()\n",
"\n",
"# special IPython command to prepare the notebook for matplotlib\n",
"%matplotlib inline \n",
"\n",
"from collections import defaultdict\n",
"\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import requests\n",
"from pattern import web\n",
"\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fetching population data from Wikipedia\n",
"\n",
"In this example we will fetch data about countries and their population from Wikipedia.\n",
"\n",
"http://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population has several tables for individual countries, subcontinents as well as different years. We will combine the data for all countries and all years in a single panda dataframe and visualize the change in population for different countries.\n",
"\n",
"###We will go through the following steps:\n",
"* fetching html with embedded data\n",
"* parsing html to extract the data\n",
"* collecting the data in a panda dataframe\n",
"* displaying the data\n",
"\n",
"To give you some starting points for your homework, we will also show the different sub-steps that can be taken to reach the presented solution."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fetching the Wikipedia site"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"url = 'http://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population'\n",
"website_html = requests.get(url).text\n",
"#print website_html"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Parsing html data"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def get_population_html_tables(html):\n",
" \"\"\"Parse html and return html tables of wikipedia population data.\"\"\"\n",
"\n",
" dom = web.Element(html)\n",
"\n",
" ### 0. step: look at html source!\n",
" \n",
" #### 1. step: get all tables\n",
"\n",
" #### 2. step: get all tables we care about\n",
"\n",
" return tbls\n",
"\n",
"tables = get_population_html_tables(website_html)\n",
"print \"table length: %d\" %len(tables)\n",
"for t in tables:\n",
" print t.attributes\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def table_type(tbl):\n",
" ### Extract the table type\n",
"\n",
"# group the tables by type\n",
"tables_by_type = defaultdict(list) # defaultdicts have a default value that is inserted when a new key is accessed\n",
"for tbl in tables:\n",
" tables_by_type[table_type(tbl)].append(tbl)\n",
"\n",
"print tables_by_type"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extracting data and filling it into a dictionary"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def get_countries_population(tables):\n",
" \"\"\"Extract population data for countries from all tables and store it in dictionary.\"\"\"\n",
" \n",
" result = defaultdict(dict)\n",
"\n",
" # 1. step: try to extract data for a single table\n",
"\n",
" # 2. step: iterate over all tables, extract headings and actual data and combine data into single dict\n",
" \n",
" return result\n",
"\n",
"\n",
"result = get_countries_population(tables_by_type['Country or territory'])\n",
"print result"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating a dataframe from a dictionary"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# create dataframe\n",
"\n",
"df = pd.DataFrame.from_dict(result, orient='index')\n",
"# sort based on year\n",
"df.sort(axis=1,inplace=True)\n",
"print df\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Some data accessing functions for a panda dataframe"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"subtable = df.iloc[0:2, 0:2]\n",
"print \"subtable\"\n",
"print subtable\n",
"print \"\"\n",
"\n",
"column = df[1955]\n",
"print \"column\"\n",
"print column\n",
"print \"\"\n",
"\n",
"row = df.ix[0] #row 0\n",
"print \"row\"\n",
"print row\n",
"print \"\"\n",
"\n",
"rows = df.ix[:2] #rows 0,1\n",
"print \"rows\"\n",
"print rows\n",
"print \"\"\n",
"\n",
"element = df.ix[0,1955] #element\n",
"print \"element\"\n",
"print element\n",
"print \"\"\n",
"\n",
"# max along column\n",
"print \"max\"\n",
"print df[1950].max()\n",
"print \"\"\n",
"\n",
"# axes\n",
"print \"axes\"\n",
"print df.axes\n",
"print \"\"\n",
"\n",
"row = df.ix[0]\n",
"print \"row info\"\n",
"print row.name\n",
"print row.index\n",
"print \"\"\n",
"\n",
"countries = df.index\n",
"print \"countries\"\n",
"print countries\n",
"print \"\"\n",
"\n",
"print \"Austria\"\n",
"print df.ix['Austria']"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plotting population of 4 countries"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"plotCountries = ['Austria', 'Germany', 'United States', 'France']\n",
" \n",
"for country in plotCountries:\n",
" row = df.ix[country]\n",
" plt.plot(row.index, row, label=row.name ) \n",
" \n",
"plt.ylim(ymin=0) # start y axis at 0\n",
"\n",
"plt.xticks(rotation=70)\n",
"plt.legend(loc='best')\n",
"plt.xlabel(\"Year\")\n",
"plt.ylabel(\"# people (million)\")\n",
"plt.title(\"Population of countries\")"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plot 5 most populous countries from 2010 and 2060"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def plot_populous(df, year):\n",
" # sort table depending on data value in year column\n",
" df_by_year = df.sort(year, ascending=False)\n",
" \n",
" plt.figure()\n",
" for i in range(5): \n",
" row = df_by_year.ix[i]\n",
" plt.plot(row.index, row, label=row.name ) \n",
" \n",
" plt.ylim(ymin=0)\n",
" \n",
" plt.xticks(rotation=70)\n",
" plt.legend(loc='best')\n",
" plt.xlabel(\"Year\")\n",
" plt.ylabel(\"# people (million)\")\n",
" plt.title(\"Most populous countries in %d\" % year)\n",
"\n",
"plot_populous(df, 2010)\n",
"plot_populous(df, 2050)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
View raw

(Sorry about that, but we can’t show files that are this big right now.)

View raw

(Sorry about that, but we can’t show files that are this big right now.)

View raw

(Sorry about that, but we can’t show files that are this big right now.)

Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment