thouis/Lab_2_A_Johanna.ipynb

## Lab_2_A_Johanna.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              Lab_2_A_Johanna.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## Lab_2_A_Live.ipynb
{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import cs109style\n",
      "cs109style.customize_mpl()\n",
      "cs109style.customize_css()\n",
      "\n",
      "# special IPython command to prepare the notebook for matplotlib\n",
      "%matplotlib inline \n",
      "\n",
      "from collections import defaultdict\n",
      "\n",
      "import pandas as pd\n",
      "import matplotlib.pyplot as plt\n",
      "import requests\n",
      "from pattern import web\n",
      "\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Fetching population data from Wikipedia\n",
      "\n",
      "In this example we will fetch data about countries and their population from Wikipedia.\n",
      "\n",
      "http://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population has several tables for individual countries, subcontinents as well as different years. We will combine the data for all countries and all years in a single panda dataframe and visualize the change in population for different countries.\n",
      "\n",
      "###We will go through the following steps:\n",
      "* fetching html with embedded data\n",
      "* parsing html to extract the data\n",
      "* collecting the data in a panda dataframe\n",
      "* displaying the data\n",
      "\n",
      "To give you some starting points for your homework, we will also show the different sub-steps that can be taken to reach the presented solution."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Fetching the Wikipedia site"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "url = 'http://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population'\n",
      "website_html = requests.get(url).text\n",
      "#print website_html"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Parsing html data"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def get_population_html_tables(html):\n",
      "    \"\"\"Parse html and return html tables of wikipedia population data.\"\"\"\n",
      "\n",
      "    dom = web.Element(html)\n",
      "\n",
      "    ### 0. step: look at html source!\n",
      "    \n",
      "    #### 1. step: get all tables\n",
      "\n",
      "    #### 2. step: get all tables we care about\n",
      "\n",
      "    return tbls\n",
      "\n",
      "tables = get_population_html_tables(website_html)\n",
      "print \"table length: %d\" %len(tables)\n",
      "for t in tables:\n",
      "    print t.attributes\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def table_type(tbl):\n",
      "    ### Extract the table type\n",
      "\n",
      "# group the tables by type\n",
      "tables_by_type = defaultdict(list)  # defaultdicts have a default value that is inserted when a new key is accessed\n",
      "for tbl in tables:\n",
      "    tables_by_type[table_type(tbl)].append(tbl)\n",
      "\n",
      "print tables_by_type"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Extracting data and filling it into a dictionary"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def get_countries_population(tables):\n",
      "    \"\"\"Extract population data for countries from all tables and store it in dictionary.\"\"\"\n",
      "    \n",
      "    result = defaultdict(dict)\n",
      "\n",
      "    # 1. step: try to extract data for a single table\n",
      "\n",
      "    # 2. step: iterate over all tables, extract headings and actual data and combine data into single dict\n",
      "    \n",
      "    return result\n",
      "\n",
      "\n",
      "result = get_countries_population(tables_by_type['Country or territory'])\n",
      "print result"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Creating a dataframe from a dictionary"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# create dataframe\n",
      "\n",
      "df = pd.DataFrame.from_dict(result, orient='index')\n",
      "# sort based on year\n",
      "df.sort(axis=1,inplace=True)\n",
      "print df\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Some data accessing functions for a panda dataframe"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "subtable = df.iloc[0:2, 0:2]\n",
      "print \"subtable\"\n",
      "print subtable\n",
      "print \"\"\n",
      "\n",
      "column = df[1955]\n",
      "print \"column\"\n",
      "print column\n",
      "print \"\"\n",
      "\n",
      "row = df.ix[0] #row 0\n",
      "print \"row\"\n",
      "print row\n",
      "print \"\"\n",
      "\n",
      "rows = df.ix[:2] #rows 0,1\n",
      "print \"rows\"\n",
      "print rows\n",
      "print \"\"\n",
      "\n",
      "element = df.ix[0,1955] #element\n",
      "print \"element\"\n",
      "print element\n",
      "print \"\"\n",
      "\n",
      "# max along column\n",
      "print \"max\"\n",
      "print df[1950].max()\n",
      "print \"\"\n",
      "\n",
      "# axes\n",
      "print \"axes\"\n",
      "print df.axes\n",
      "print \"\"\n",
      "\n",
      "row = df.ix[0]\n",
      "print \"row info\"\n",
      "print row.name\n",
      "print row.index\n",
      "print \"\"\n",
      "\n",
      "countries =  df.index\n",
      "print \"countries\"\n",
      "print countries\n",
      "print \"\"\n",
      "\n",
      "print \"Austria\"\n",
      "print df.ix['Austria']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Plotting population of 4 countries"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plotCountries = ['Austria', 'Germany', 'United States', 'France']\n",
      "    \n",
      "for country in plotCountries:\n",
      "    row = df.ix[country]\n",
      "    plt.plot(row.index, row, label=row.name ) \n",
      "    \n",
      "plt.ylim(ymin=0) # start y axis at 0\n",
      "\n",
      "plt.xticks(rotation=70)\n",
      "plt.legend(loc='best')\n",
      "plt.xlabel(\"Year\")\n",
      "plt.ylabel(\"# people (million)\")\n",
      "plt.title(\"Population of countries\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Plot 5 most populous countries from 2010 and 2060"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def plot_populous(df, year):\n",
      "    # sort table depending on data value in year column\n",
      "    df_by_year = df.sort(year, ascending=False)\n",
      "    \n",
      "    plt.figure()\n",
      "    for i in range(5):  \n",
      "        row = df_by_year.ix[i]\n",
      "        plt.plot(row.index, row, label=row.name ) \n",
      "            \n",
      "    plt.ylim(ymin=0)\n",
      "    \n",
      "    plt.xticks(rotation=70)\n",
      "    plt.legend(loc='best')\n",
      "    plt.xlabel(\"Year\")\n",
      "    plt.ylabel(\"# people (million)\")\n",
      "    plt.title(\"Most populous countries in %d\" % year)\n",
      "\n",
      "plot_populous(df, 2010)\n",
      "plot_populous(df, 2050)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}

## README

      
    Raw
  

              README
            
          
            View raw
              (Sorry about that, but we can’t show files that are this big right now.)
        
    
## cs109style.py

      
    Raw
  

              cs109style.py
            
          
            View raw
              (Sorry about that, but we can’t show files that are this big right now.)
        
    
## custom.css

      
    Raw
  

              custom.css
            
          
            View raw
              (Sorry about that, but we can’t show files that are this big right now.)
        
    
## Lab_2_A_Live_Ray_Final.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              Lab_2_A_Live_Ray_Final.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## Lab_2_B.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              Lab_2_B.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## Lab_2_B_Live.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              Lab_2_B_Live.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
	{
	"metadata": {
	"name": ""
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import cs109style\n",
	"cs109style.customize_mpl()\n",
	"cs109style.customize_css()\n",
	"\n",
	"# special IPython command to prepare the notebook for matplotlib\n",
	"%matplotlib inline \n",
	"\n",
	"from collections import defaultdict\n",
	"\n",
	"import pandas as pd\n",
	"import matplotlib.pyplot as plt\n",
	"import requests\n",
	"from pattern import web\n",
	"\n"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Fetching population data from Wikipedia\n",
	"\n",
	"In this example we will fetch data about countries and their population from Wikipedia.\n",
	"\n",
	"http://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population has several tables for individual countries, subcontinents as well as different years. We will combine the data for all countries and all years in a single panda dataframe and visualize the change in population for different countries.\n",
	"\n",
	"###We will go through the following steps:\n",
	"* fetching html with embedded data\n",
	"* parsing html to extract the data\n",
	"* collecting the data in a panda dataframe\n",
	"* displaying the data\n",
	"\n",
	"To give you some starting points for your homework, we will also show the different sub-steps that can be taken to reach the presented solution."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Fetching the Wikipedia site"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"url = 'http://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population'\n",
	"website_html = requests.get(url).text\n",
	"#print website_html"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Parsing html data"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"def get_population_html_tables(html):\n",
	" \"\"\"Parse html and return html tables of wikipedia population data.\"\"\"\n",
	"\n",
	" dom = web.Element(html)\n",
	"\n",
	" ### 0. step: look at html source!\n",
	" \n",
	" #### 1. step: get all tables\n",
	"\n",
	" #### 2. step: get all tables we care about\n",
	"\n",
	" return tbls\n",
	"\n",
	"tables = get_population_html_tables(website_html)\n",
	"print \"table length: %d\" %len(tables)\n",
	"for t in tables:\n",
	" print t.attributes\n"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"def table_type(tbl):\n",
	" ### Extract the table type\n",
	"\n",
	"# group the tables by type\n",
	"tables_by_type = defaultdict(list) # defaultdicts have a default value that is inserted when a new key is accessed\n",
	"for tbl in tables:\n",
	" tables_by_type[table_type(tbl)].append(tbl)\n",
	"\n",
	"print tables_by_type"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Extracting data and filling it into a dictionary"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"def get_countries_population(tables):\n",
	" \"\"\"Extract population data for countries from all tables and store it in dictionary.\"\"\"\n",
	" \n",
	" result = defaultdict(dict)\n",
	"\n",
	" # 1. step: try to extract data for a single table\n",
	"\n",
	" # 2. step: iterate over all tables, extract headings and actual data and combine data into single dict\n",
	" \n",
	" return result\n",
	"\n",
	"\n",
	"result = get_countries_population(tables_by_type['Country or territory'])\n",
	"print result"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Creating a dataframe from a dictionary"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# create dataframe\n",
	"\n",
	"df = pd.DataFrame.from_dict(result, orient='index')\n",
	"# sort based on year\n",
	"df.sort(axis=1,inplace=True)\n",
	"print df\n"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Some data accessing functions for a panda dataframe"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"subtable = df.iloc[0:2, 0:2]\n",
	"print \"subtable\"\n",
	"print subtable\n",
	"print \"\"\n",
	"\n",
	"column = df[1955]\n",
	"print \"column\"\n",
	"print column\n",
	"print \"\"\n",
	"\n",
	"row = df.ix[0] #row 0\n",
	"print \"row\"\n",
	"print row\n",
	"print \"\"\n",
	"\n",
	"rows = df.ix[:2] #rows 0,1\n",
	"print \"rows\"\n",
	"print rows\n",
	"print \"\"\n",
	"\n",
	"element = df.ix[0,1955] #element\n",
	"print \"element\"\n",
	"print element\n",
	"print \"\"\n",
	"\n",
	"# max along column\n",
	"print \"max\"\n",
	"print df[1950].max()\n",
	"print \"\"\n",
	"\n",
	"# axes\n",
	"print \"axes\"\n",
	"print df.axes\n",
	"print \"\"\n",
	"\n",
	"row = df.ix[0]\n",
	"print \"row info\"\n",
	"print row.name\n",
	"print row.index\n",
	"print \"\"\n",
	"\n",
	"countries = df.index\n",
	"print \"countries\"\n",
	"print countries\n",
	"print \"\"\n",
	"\n",
	"print \"Austria\"\n",
	"print df.ix['Austria']"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Plotting population of 4 countries"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"plotCountries = ['Austria', 'Germany', 'United States', 'France']\n",
	" \n",
	"for country in plotCountries:\n",
	" row = df.ix[country]\n",
	" plt.plot(row.index, row, label=row.name ) \n",
	" \n",
	"plt.ylim(ymin=0) # start y axis at 0\n",
	"\n",
	"plt.xticks(rotation=70)\n",
	"plt.legend(loc='best')\n",
	"plt.xlabel(\"Year\")\n",
	"plt.ylabel(\"# people (million)\")\n",
	"plt.title(\"Population of countries\")"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Plot 5 most populous countries from 2010 and 2060"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"def plot_populous(df, year):\n",
	" # sort table depending on data value in year column\n",
	" df_by_year = df.sort(year, ascending=False)\n",
	" \n",
	" plt.figure()\n",
	" for i in range(5): \n",
	" row = df_by_year.ix[i]\n",
	" plt.plot(row.index, row, label=row.name ) \n",
	" \n",
	" plt.ylim(ymin=0)\n",
	" \n",
	" plt.xticks(rotation=70)\n",
	" plt.legend(loc='best')\n",
	" plt.xlabel(\"Year\")\n",
	" plt.ylabel(\"# people (million)\")\n",
	" plt.title(\"Most populous countries in %d\" % year)\n",
	"\n",
	"plot_populous(df, 2010)\n",
	"plot_populous(df, 2050)"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [],
	"language": "python",
	"metadata": {},
	"outputs": []
	}
	],
	"metadata": {}
	}
	]
	}