Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Created on Cognitive Class Labs
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"<a href=\"https://cognitiveclass.ai\"><img src = \"https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png\" width = 400> </a>\n",
"\n",
"<h1 align=center><font size = 5>Introduction to Matplotlib and Line Plots</font></h1>"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"## Introduction\n",
"\n",
"The aim of these labs is to introduce you to data visualization with Python as concrete and as consistent as possible. \n",
"Speaking of consistency, because there is no *best* data visualization library avaiblable for Python - up to creating these labs - we have to introduce different libraries and show their benefits when we are discussing new visualization concepts. Doing so, we hope to make students well-rounded with visualization libraries and concepts so that they are able to judge and decide on the best visualitzation technique and tool for a given problem _and_ audience.\n",
"\n",
"Please make sure that you have completed the prerequisites for this course, namely <a href='http://cocl.us/PY0101EN_DV0101EN_LAB1_Coursera'>**Python for Data Science**</a> and <a href='http://cocl.us/DA0101EN_DV0101EN_LAB1_Coursera'>**Data Analysis with Python**</a>, which are part of this specialization. \n",
"\n",
"**Note**: The majority of the plots and visualizations will be generated using data stored in *pandas* dataframes. Therefore, in this lab, we provide a brief crash course on *pandas*. However, if you are interested in learning more about the *pandas* library, detailed description and explanation of how to use it and how to clean, munge, and process data stored in a *pandas* dataframe are provided in our course <a href='http://cocl.us/DA0101EN_DV0101EN_LAB1_Coursera'>**Data Analysis with Python**</a>, which is also part of this specialization. \n",
"\n",
"------------"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"## Table of Contents\n",
"\n",
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
"\n",
"1. [Exploring Datasets with *pandas*](#0)<br>\n",
"1.1 [The Dataset: Immigration to Canada from 1980 to 2013](#2)<br>\n",
"1.2 [*pandas* Basics](#4) <br>\n",
"1.3 [*pandas* Intermediate: Indexing and Selection](#6) <br>\n",
"2. [Visualizing Data using Matplotlib](#8) <br>\n",
"2.1 [Matplotlib: Standard Python Visualization Library](#10) <br>\n",
"3. [Line Plots](#12)\n",
"</div>\n",
"<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"# Exploring Datasets with *pandas* <a id=\"0\"></a>\n",
"\n",
"*pandas* is an essential data analysis toolkit for Python. From their [website](http://pandas.pydata.org/):\n",
">*pandas* is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, **real world** data analysis in Python.\n",
"\n",
"The course heavily relies on *pandas* for data wrangling, analysis, and visualization. We encourage you to spend some time and familizare yourself with the *pandas* API Reference: http://pandas.pydata.org/pandas-docs/stable/api.html."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"## The Dataset: Immigration to Canada from 1980 to 2013 <a id=\"2\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Dataset Source: [International migration flows to and from selected countries - The 2015 revision](http://www.un.org/en/development/desa/population/migration/data/empirical2/migrationflows.shtml).\n",
"\n",
"The dataset contains annual data on the flows of international immigrants as recorded by the countries of destination. The data presents both inflows and outflows according to the place of birth, citizenship or place of previous / next residence both for foreigners and nationals. The current version presents data pertaining to 45 countries.\n",
"\n",
"In this lab, we will focus on the Canadian immigration data.\n",
"\n",
"<img src = \"https://ibm.box.com/shared/static/mb48k9fiylkd7z3a21cq38xxfy1wni2y.png\" align=\"center\" width=900>\n",
"\n",
"For sake of simplicity, Canada's immigration data has been extracted and uploaded to one of IBM servers. You can fetch the data from [here](https://ibm.box.com/shared/static/lw190pt9zpy5bd1ptyg2aw15awomz9pu.xlsx).\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"## *pandas* Basics<a id=\"4\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"The first thing we'll do is import two key data analysis modules: *pandas* and **Numpy**."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"button": false,
"collapsed": true,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
},
"scrolled": false
},
"outputs": [],
"source": [
"import numpy as np # useful for many scientific computing in Python\n",
"import pandas as pd # primary data structure library"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Let's download and import our primary Canadian Immigration dataset using *pandas* `read_excel()` method. Normally, before we can do that, we would need to download a module which *pandas* requires to read in excel files. This module is **xlrd**. For your convenience, we have pre-installed this module, so you would not have to worry about that. Otherwise, you would need to run the following line of code to install the **xlrd** module:\n",
"```\n",
"!conda install -c anaconda xlrd --yes\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Now we are ready to read in our data."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
},
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Data read into a pandas dataframe!\n"
]
}
],
"source": [
"df_can = pd.read_excel('https://ibm.box.com/shared/static/lw190pt9zpy5bd1ptyg2aw15awomz9pu.xlsx',\n",
" sheet_name='Canada by Citizenship',\n",
" skiprows=range(20),\n",
" skipfooter=2)\n",
"\n",
"print ('Data read into a pandas dataframe!')"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Let's view the top 5 rows of the dataset using the `head()` function."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Type</th>\n",
" <th>Coverage</th>\n",
" <th>OdName</th>\n",
" <th>AREA</th>\n",
" <th>AreaName</th>\n",
" <th>REG</th>\n",
" <th>RegName</th>\n",
" <th>DEV</th>\n",
" <th>DevName</th>\n",
" <th>1980</th>\n",
" <th>...</th>\n",
" <th>2004</th>\n",
" <th>2005</th>\n",
" <th>2006</th>\n",
" <th>2007</th>\n",
" <th>2008</th>\n",
" <th>2009</th>\n",
" <th>2010</th>\n",
" <th>2011</th>\n",
" <th>2012</th>\n",
" <th>2013</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Immigrants</td>\n",
" <td>Foreigners</td>\n",
" <td>Afghanistan</td>\n",
" <td>935</td>\n",
" <td>Asia</td>\n",
" <td>5501</td>\n",
" <td>Southern Asia</td>\n",
" <td>902</td>\n",
" <td>Developing regions</td>\n",
" <td>16</td>\n",
" <td>...</td>\n",
" <td>2978</td>\n",
" <td>3436</td>\n",
" <td>3009</td>\n",
" <td>2652</td>\n",
" <td>2111</td>\n",
" <td>1746</td>\n",
" <td>1758</td>\n",
" <td>2203</td>\n",
" <td>2635</td>\n",
" <td>2004</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Immigrants</td>\n",
" <td>Foreigners</td>\n",
" <td>Albania</td>\n",
" <td>908</td>\n",
" <td>Europe</td>\n",
" <td>925</td>\n",
" <td>Southern Europe</td>\n",
" <td>901</td>\n",
" <td>Developed regions</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>1450</td>\n",
" <td>1223</td>\n",
" <td>856</td>\n",
" <td>702</td>\n",
" <td>560</td>\n",
" <td>716</td>\n",
" <td>561</td>\n",
" <td>539</td>\n",
" <td>620</td>\n",
" <td>603</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Immigrants</td>\n",
" <td>Foreigners</td>\n",
" <td>Algeria</td>\n",
" <td>903</td>\n",
" <td>Africa</td>\n",
" <td>912</td>\n",
" <td>Northern Africa</td>\n",
" <td>902</td>\n",
" <td>Developing regions</td>\n",
" <td>80</td>\n",
" <td>...</td>\n",
" <td>3616</td>\n",
" <td>3626</td>\n",
" <td>4807</td>\n",
" <td>3623</td>\n",
" <td>4005</td>\n",
" <td>5393</td>\n",
" <td>4752</td>\n",
" <td>4325</td>\n",
" <td>3774</td>\n",
" <td>4331</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Immigrants</td>\n",
" <td>Foreigners</td>\n",
" <td>American Samoa</td>\n",
" <td>909</td>\n",
" <td>Oceania</td>\n",
" <td>957</td>\n",
" <td>Polynesia</td>\n",
" <td>902</td>\n",
" <td>Developing regions</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Immigrants</td>\n",
" <td>Foreigners</td>\n",
" <td>Andorra</td>\n",
" <td>908</td>\n",
" <td>Europe</td>\n",
" <td>925</td>\n",
" <td>Southern Europe</td>\n",
" <td>901</td>\n",
" <td>Developed regions</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 43 columns</p>\n",
"</div>"
],
"text/plain": [
" Type Coverage OdName AREA AreaName REG \\\n",
"0 Immigrants Foreigners Afghanistan 935 Asia 5501 \n",
"1 Immigrants Foreigners Albania 908 Europe 925 \n",
"2 Immigrants Foreigners Algeria 903 Africa 912 \n",
"3 Immigrants Foreigners American Samoa 909 Oceania 957 \n",
"4 Immigrants Foreigners Andorra 908 Europe 925 \n",
"\n",
" RegName DEV DevName 1980 ... 2004 2005 2006 \\\n",
"0 Southern Asia 902 Developing regions 16 ... 2978 3436 3009 \n",
"1 Southern Europe 901 Developed regions 1 ... 1450 1223 856 \n",
"2 Northern Africa 902 Developing regions 80 ... 3616 3626 4807 \n",
"3 Polynesia 902 Developing regions 0 ... 0 0 1 \n",
"4 Southern Europe 901 Developed regions 0 ... 0 0 1 \n",
"\n",
" 2007 2008 2009 2010 2011 2012 2013 \n",
"0 2652 2111 1746 1758 2203 2635 2004 \n",
"1 702 560 716 561 539 620 603 \n",
"2 3623 4005 5393 4752 4325 3774 4331 \n",
"3 0 0 0 0 0 0 0 \n",
"4 1 0 0 0 0 1 1 \n",
"\n",
"[5 rows x 43 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_can.head()\n",
"# tip: You can specify the number of rows you'd like to see as follows: df_can.head(10) "
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"We can also veiw the bottom 5 rows of the dataset using the `tail()` function."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Type</th>\n",
" <th>Coverage</th>\n",
" <th>OdName</th>\n",
" <th>AREA</th>\n",
" <th>AreaName</th>\n",
" <th>REG</th>\n",
" <th>RegName</th>\n",
" <th>DEV</th>\n",
" <th>DevName</th>\n",
" <th>1980</th>\n",
" <th>...</th>\n",
" <th>2004</th>\n",
" <th>2005</th>\n",
" <th>2006</th>\n",
" <th>2007</th>\n",
" <th>2008</th>\n",
" <th>2009</th>\n",
" <th>2010</th>\n",
" <th>2011</th>\n",
" <th>2012</th>\n",
" <th>2013</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>190</th>\n",
" <td>Immigrants</td>\n",
" <td>Foreigners</td>\n",
" <td>Viet Nam</td>\n",
" <td>935</td>\n",
" <td>Asia</td>\n",
" <td>920</td>\n",
" <td>South-Eastern Asia</td>\n",
" <td>902</td>\n",
" <td>Developing regions</td>\n",
" <td>1191</td>\n",
" <td>...</td>\n",
" <td>1816</td>\n",
" <td>1852</td>\n",
" <td>3153</td>\n",
" <td>2574</td>\n",
" <td>1784</td>\n",
" <td>2171</td>\n",
" <td>1942</td>\n",
" <td>1723</td>\n",
" <td>1731</td>\n",
" <td>2112</td>\n",
" </tr>\n",
" <tr>\n",
" <th>191</th>\n",
" <td>Immigrants</td>\n",
" <td>Foreigners</td>\n",
" <td>Western Sahara</td>\n",
" <td>903</td>\n",
" <td>Africa</td>\n",
" <td>912</td>\n",
" <td>Northern Africa</td>\n",
" <td>902</td>\n",
" <td>Developing regions</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192</th>\n",
" <td>Immigrants</td>\n",
" <td>Foreigners</td>\n",
" <td>Yemen</td>\n",
" <td>935</td>\n",
" <td>Asia</td>\n",
" <td>922</td>\n",
" <td>Western Asia</td>\n",
" <td>902</td>\n",
" <td>Developing regions</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>124</td>\n",
" <td>161</td>\n",
" <td>140</td>\n",
" <td>122</td>\n",
" <td>133</td>\n",
" <td>128</td>\n",
" <td>211</td>\n",
" <td>160</td>\n",
" <td>174</td>\n",
" <td>217</td>\n",
" </tr>\n",
" <tr>\n",
" <th>193</th>\n",
" <td>Immigrants</td>\n",
" <td>Foreigners</td>\n",
" <td>Zambia</td>\n",
" <td>903</td>\n",
" <td>Africa</td>\n",
" <td>910</td>\n",
" <td>Eastern Africa</td>\n",
" <td>902</td>\n",
" <td>Developing regions</td>\n",
" <td>11</td>\n",
" <td>...</td>\n",
" <td>56</td>\n",
" <td>91</td>\n",
" <td>77</td>\n",
" <td>71</td>\n",
" <td>64</td>\n",
" <td>60</td>\n",
" <td>102</td>\n",
" <td>69</td>\n",
" <td>46</td>\n",
" <td>59</td>\n",
" </tr>\n",
" <tr>\n",
" <th>194</th>\n",
" <td>Immigrants</td>\n",
" <td>Foreigners</td>\n",
" <td>Zimbabwe</td>\n",
" <td>903</td>\n",
" <td>Africa</td>\n",
" <td>910</td>\n",
" <td>Eastern Africa</td>\n",
" <td>902</td>\n",
" <td>Developing regions</td>\n",
" <td>72</td>\n",
" <td>...</td>\n",
" <td>1450</td>\n",
" <td>615</td>\n",
" <td>454</td>\n",
" <td>663</td>\n",
" <td>611</td>\n",
" <td>508</td>\n",
" <td>494</td>\n",
" <td>434</td>\n",
" <td>437</td>\n",
" <td>407</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 43 columns</p>\n",
"</div>"
],
"text/plain": [
" Type Coverage OdName AREA AreaName REG \\\n",
"190 Immigrants Foreigners Viet Nam 935 Asia 920 \n",
"191 Immigrants Foreigners Western Sahara 903 Africa 912 \n",
"192 Immigrants Foreigners Yemen 935 Asia 922 \n",
"193 Immigrants Foreigners Zambia 903 Africa 910 \n",
"194 Immigrants Foreigners Zimbabwe 903 Africa 910 \n",
"\n",
" RegName DEV DevName 1980 ... 2004 2005 2006 \\\n",
"190 South-Eastern Asia 902 Developing regions 1191 ... 1816 1852 3153 \n",
"191 Northern Africa 902 Developing regions 0 ... 0 0 1 \n",
"192 Western Asia 902 Developing regions 1 ... 124 161 140 \n",
"193 Eastern Africa 902 Developing regions 11 ... 56 91 77 \n",
"194 Eastern Africa 902 Developing regions 72 ... 1450 615 454 \n",
"\n",
" 2007 2008 2009 2010 2011 2012 2013 \n",
"190 2574 1784 2171 1942 1723 1731 2112 \n",
"191 0 0 0 0 0 0 0 \n",
"192 122 133 128 211 160 174 217 \n",
"193 71 64 60 102 69 46 59 \n",
"194 663 611 508 494 434 437 407 \n",
"\n",
"[5 rows x 43 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_can.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"When analyzing a dataset, it's always a good idea to start by getting basic information about your dataframe. We can do this by using the `info()` method."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
},
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 195 entries, 0 to 194\n",
"Data columns (total 43 columns):\n",
"Type 195 non-null object\n",
"Coverage 195 non-null object\n",
"OdName 195 non-null object\n",
"AREA 195 non-null int64\n",
"AreaName 195 non-null object\n",
"REG 195 non-null int64\n",
"RegName 195 non-null object\n",
"DEV 195 non-null int64\n",
"DevName 195 non-null object\n",
"1980 195 non-null int64\n",
"1981 195 non-null int64\n",
"1982 195 non-null int64\n",
"1983 195 non-null int64\n",
"1984 195 non-null int64\n",
"1985 195 non-null int64\n",
"1986 195 non-null int64\n",
"1987 195 non-null int64\n",
"1988 195 non-null int64\n",
"1989 195 non-null int64\n",
"1990 195 non-null int64\n",
"1991 195 non-null int64\n",
"1992 195 non-null int64\n",
"1993 195 non-null int64\n",
"1994 195 non-null int64\n",
"1995 195 non-null int64\n",
"1996 195 non-null int64\n",
"1997 195 non-null int64\n",
"1998 195 non-null int64\n",
"1999 195 non-null int64\n",
"2000 195 non-null int64\n",
"2001 195 non-null int64\n",
"2002 195 non-null int64\n",
"2003 195 non-null int64\n",
"2004 195 non-null int64\n",
"2005 195 non-null int64\n",
"2006 195 non-null int64\n",
"2007 195 non-null int64\n",
"2008 195 non-null int64\n",
"2009 195 non-null int64\n",
"2010 195 non-null int64\n",
"2011 195 non-null int64\n",
"2012 195 non-null int64\n",
"2013 195 non-null int64\n",
"dtypes: int64(37), object(6)\n",
"memory usage: 65.6+ KB\n"
]
}
],
"source": [
"df_can.info()"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"To get the list of column headers we can call upon the dataframe's `.columns` parameter."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
},
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"array(['Type', 'Coverage', 'OdName', 'AREA', 'AreaName', 'REG', 'RegName',\n",
" 'DEV', 'DevName', 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,\n",
" 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,\n",
" 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,\n",
" 2010, 2011, 2012, 2013], dtype=object)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_can.columns.values "
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Similarly, to get the list of indicies we use the `.index` parameter."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,\n",
" 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,\n",
" 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,\n",
" 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,\n",
" 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,\n",
" 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,\n",
" 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,\n",
" 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103,\n",
" 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,\n",
" 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,\n",
" 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,\n",
" 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,\n",
" 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,\n",
" 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,\n",
" 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_can.index.values"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Note: The default type of index and columns is NOT list."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
},
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.indexes.base.Index'>\n",
"<class 'pandas.core.indexes.range.RangeIndex'>\n"
]
}
],
"source": [
"print(type(df_can.columns))\n",
"print(type(df_can.index))"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"To get the index and columns as lists, we can use the `tolist()` method."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
},
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'list'>\n",
"<class 'list'>\n"
]
}
],
"source": [
"df_can.columns.tolist()\n",
"df_can.index.tolist()\n",
"\n",
"print (type(df_can.columns.tolist()))\n",
"print (type(df_can.index.tolist()))"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"To view the dimensions of the dataframe, we use the `.shape` parameter."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
},
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"(195, 43)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# size of dataframe (rows, columns)\n",
"df_can.shape "
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Note: The main types stored in *pandas* objects are *float*, *int*, *bool*, *datetime64[ns]* and *datetime64[ns, tz] (in >= 0.17.0)*, *timedelta[ns]*, *category (in >= 0.15.0)*, and *object* (string). In addition these dtypes have item sizes, e.g. int64 and int32. "
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Let's clean the data set to remove a few unnecessary columns. We can use *pandas* `drop()` method as follows:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
},
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>OdName</th>\n",
" <th>AreaName</th>\n",
" <th>RegName</th>\n",
" <th>DevName</th>\n",
" <th>1980</th>\n",
" <th>1981</th>\n",
" <th>1982</th>\n",
" <th>1983</th>\n",
" <th>1984</th>\n",
" <th>1985</th>\n",
" <th>...</th>\n",
" <th>2004</th>\n",
" <th>2005</th>\n",
" <th>2006</th>\n",
" <th>2007</th>\n",
" <th>2008</th>\n",
" <th>2009</th>\n",
" <th>2010</th>\n",
" <th>2011</th>\n",
" <th>2012</th>\n",
" <th>2013</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Afghanistan</td>\n",
" <td>Asia</td>\n",
" <td>Southern Asia</td>\n",
" <td>Developing regions</td>\n",
" <td>16</td>\n",
" <td>39</td>\n",
" <td>39</td>\n",
" <td>47</td>\n",
" <td>71</td>\n",
" <td>340</td>\n",
" <td>...</td>\n",
" <td>2978</td>\n",
" <td>3436</td>\n",
" <td>3009</td>\n",
" <td>2652</td>\n",
" <td>2111</td>\n",
" <td>1746</td>\n",
" <td>1758</td>\n",
" <td>2203</td>\n",
" <td>2635</td>\n",
" <td>2004</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Albania</td>\n",
" <td>Europe</td>\n",
" <td>Southern Europe</td>\n",
" <td>Developed regions</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>1450</td>\n",
" <td>1223</td>\n",
" <td>856</td>\n",
" <td>702</td>\n",
" <td>560</td>\n",
" <td>716</td>\n",
" <td>561</td>\n",
" <td>539</td>\n",
" <td>620</td>\n",
" <td>603</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2 rows × 38 columns</p>\n",
"</div>"
],
"text/plain": [
" OdName AreaName RegName DevName 1980 1981 \\\n",
"0 Afghanistan Asia Southern Asia Developing regions 16 39 \n",
"1 Albania Europe Southern Europe Developed regions 1 0 \n",
"\n",
" 1982 1983 1984 1985 ... 2004 2005 2006 2007 2008 2009 2010 \\\n",
"0 39 47 71 340 ... 2978 3436 3009 2652 2111 1746 1758 \n",
"1 0 0 0 0 ... 1450 1223 856 702 560 716 561 \n",
"\n",
" 2011 2012 2013 \n",
"0 2203 2635 2004 \n",
"1 539 620 603 \n",
"\n",
"[2 rows x 38 columns]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# in pandas axis=0 represents rows (default) and axis=1 represents columns.\n",
"df_can.drop(['AREA','REG','DEV','Type','Coverage'], axis=1, inplace=True)\n",
"df_can.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Let's rename the columns so that they make sense. We can use `rename()` method by passing in a dictionary of old and new names as follows:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
},
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"Index([ 'Country', 'Continent', 'Region', 'DevName', 1980,\n",
" 1981, 1982, 1983, 1984, 1985,\n",
" 1986, 1987, 1988, 1989, 1990,\n",
" 1991, 1992, 1993, 1994, 1995,\n",
" 1996, 1997, 1998, 1999, 2000,\n",
" 2001, 2002, 2003, 2004, 2005,\n",
" 2006, 2007, 2008, 2009, 2010,\n",
" 2011, 2012, 2013],\n",
" dtype='object')"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent', 'RegName':'Region'}, inplace=True)\n",
"df_can.columns"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"We will also add a 'Total' column that sums up the total immigrants by country over the entire period 1980 - 2013, as follows:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"button": false,
"collapsed": true,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
},
"scrolled": false
},
"outputs": [],
"source": [
"df_can['Total'] = df_can.sum(axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Country</th>\n",
" <th>Continent</th>\n",
" <th>Region</th>\n",
" <th>DevName</th>\n",
" <th>1980</th>\n",
" <th>1981</th>\n",
" <th>1982</th>\n",
" <th>1983</th>\n",
" <th>1984</th>\n",
" <th>1985</th>\n",
" <th>...</th>\n",
" <th>2005</th>\n",
" <th>2006</th>\n",
" <th>2007</th>\n",
" <th>2008</th>\n",
" <th>2009</th>\n",
" <th>2010</th>\n",
" <th>2011</th>\n",
" <th>2012</th>\n",
" <th>2013</th>\n",
" <th>Total</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Afghanistan</td>\n",
" <td>Asia</td>\n",
" <td>Southern Asia</td>\n",
" <td>Developing regions</td>\n",
" <td>16</td>\n",
" <td>39</td>\n",
" <td>39</td>\n",
" <td>47</td>\n",
" <td>71</td>\n",
" <td>340</td>\n",
" <td>...</td>\n",
" <td>3436</td>\n",
" <td>3009</td>\n",
" <td>2652</td>\n",
" <td>2111</td>\n",
" <td>1746</td>\n",
" <td>1758</td>\n",
" <td>2203</td>\n",
" <td>2635</td>\n",
" <td>2004</td>\n",
" <td>58639</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Albania</td>\n",
" <td>Europe</td>\n",
" <td>Southern Europe</td>\n",
" <td>Developed regions</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>1223</td>\n",
" <td>856</td>\n",
" <td>702</td>\n",
" <td>560</td>\n",
" <td>716</td>\n",
" <td>561</td>\n",
" <td>539</td>\n",
" <td>620</td>\n",
" <td>603</td>\n",
" <td>15699</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Algeria</td>\n",
" <td>Africa</td>\n",
" <td>Northern Africa</td>\n",
" <td>Developing regions</td>\n",
" <td>80</td>\n",
" <td>67</td>\n",
" <td>71</td>\n",
" <td>69</td>\n",
" <td>63</td>\n",
" <td>44</td>\n",
" <td>...</td>\n",
" <td>3626</td>\n",
" <td>4807</td>\n",
" <td>3623</td>\n",
" <td>4005</td>\n",
" <td>5393</td>\n",
" <td>4752</td>\n",
" <td>4325</td>\n",
" <td>3774</td>\n",
" <td>4331</td>\n",
" <td>69439</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>American Samoa</td>\n",
" <td>Oceania</td>\n",
" <td>Polynesia</td>\n",
" <td>Developing regions</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Andorra</td>\n",
" <td>Europe</td>\n",
" <td>Southern Europe</td>\n",
" <td>Developed regions</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>15</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 39 columns</p>\n",
"</div>"
],
"text/plain": [
" Country Continent Region DevName 1980 1981 \\\n",
"0 Afghanistan Asia Southern Asia Developing regions 16 39 \n",
"1 Albania Europe Southern Europe Developed regions 1 0 \n",
"2 Algeria Africa Northern Africa Developing regions 80 67 \n",
"3 American Samoa Oceania Polynesia Developing regions 0 1 \n",
"4 Andorra Europe Southern Europe Developed regions 0 0 \n",
"\n",
" 1982 1983 1984 1985 ... 2005 2006 2007 2008 2009 2010 2011 \\\n",
"0 39 47 71 340 ... 3436 3009 2652 2111 1746 1758 2203 \n",
"1 0 0 0 0 ... 1223 856 702 560 716 561 539 \n",
"2 71 69 63 44 ... 3626 4807 3623 4005 5393 4752 4325 \n",
"3 0 0 0 0 ... 0 1 0 0 0 0 0 \n",
"4 0 0 0 0 ... 0 1 1 0 0 0 0 \n",
"\n",
" 2012 2013 Total \n",
"0 2635 2004 58639 \n",
"1 620 603 15699 \n",
"2 3774 4331 69439 \n",
"3 0 0 6 \n",
"4 1 1 15 \n",
"\n",
"[5 rows x 39 columns]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_can.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"We can check to see how many null objects we have in the dataset as follows:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Country 0\n",
"Continent 0\n",
"Region 0\n",
"DevName 0\n",
"1980 0\n",
"1981 0\n",
"1982 0\n",
"1983 0\n",
"1984 0\n",
"1985 0\n",
"1986 0\n",
"1987 0\n",
"1988 0\n",
"1989 0\n",
"1990 0\n",
"1991 0\n",
"1992 0\n",
"1993 0\n",
"1994 0\n",
"1995 0\n",
"1996 0\n",
"1997 0\n",
"1998 0\n",
"1999 0\n",
"2000 0\n",
"2001 0\n",
"2002 0\n",
"2003 0\n",
"2004 0\n",
"2005 0\n",
"2006 0\n",
"2007 0\n",
"2008 0\n",
"2009 0\n",
"2010 0\n",
"2011 0\n",
"2012 0\n",
"2013 0\n",
"Total 0\n",
"dtype: int64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_can.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Finally, let's view a quick summary of each column in our dataframe using the `describe()` method."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
},
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Country</th>\n",
" <th>Continent</th>\n",
" <th>Region</th>\n",
" <th>DevName</th>\n",
" <th>1980</th>\n",
" <th>1981</th>\n",
" <th>1982</th>\n",
" <th>1983</th>\n",
" <th>1984</th>\n",
" <th>1985</th>\n",
" <th>...</th>\n",
" <th>2005</th>\n",
" <th>2006</th>\n",
" <th>2007</th>\n",
" <th>2008</th>\n",
" <th>2009</th>\n",
" <th>2010</th>\n",
" <th>2011</th>\n",
" <th>2012</th>\n",
" <th>2013</th>\n",
" <th>Total</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>195</td>\n",
" <td>195</td>\n",
" <td>195</td>\n",
" <td>195</td>\n",
" <td>195.000000</td>\n",
" <td>195.000000</td>\n",
" <td>195.000000</td>\n",
" <td>195.000000</td>\n",
" <td>195.000000</td>\n",
" <td>195.000000</td>\n",
" <td>...</td>\n",
" <td>195.000000</td>\n",
" <td>195.000000</td>\n",
" <td>195.000000</td>\n",
" <td>195.000000</td>\n",
" <td>195.000000</td>\n",
" <td>195.000000</td>\n",
" <td>195.000000</td>\n",
" <td>195.000000</td>\n",
" <td>195.000000</td>\n",
" <td>195.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>195</td>\n",
" <td>6</td>\n",
" <td>22</td>\n",
" <td>2</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>Albania</td>\n",
" <td>Africa</td>\n",
" <td>Western Asia</td>\n",
" <td>Developing regions</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>1</td>\n",
" <td>54</td>\n",
" <td>18</td>\n",
" <td>147</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>508.394872</td>\n",
" <td>566.989744</td>\n",
" <td>534.723077</td>\n",
" <td>387.435897</td>\n",
" <td>376.497436</td>\n",
" <td>358.861538</td>\n",
" <td>...</td>\n",
" <td>1320.292308</td>\n",
" <td>1266.958974</td>\n",
" <td>1191.820513</td>\n",
" <td>1246.394872</td>\n",
" <td>1275.733333</td>\n",
" <td>1420.287179</td>\n",
" <td>1262.533333</td>\n",
" <td>1313.958974</td>\n",
" <td>1320.702564</td>\n",
" <td>32867.451282</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1949.588546</td>\n",
" <td>2152.643752</td>\n",
" <td>1866.997511</td>\n",
" <td>1204.333597</td>\n",
" <td>1198.246371</td>\n",
" <td>1079.309600</td>\n",
" <td>...</td>\n",
" <td>4425.957828</td>\n",
" <td>3926.717747</td>\n",
" <td>3443.542409</td>\n",
" <td>3694.573544</td>\n",
" <td>3829.630424</td>\n",
" <td>4462.946328</td>\n",
" <td>4030.084313</td>\n",
" <td>4247.555161</td>\n",
" <td>4237.951988</td>\n",
" <td>91785.498686</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>28.500000</td>\n",
" <td>25.000000</td>\n",
" <td>31.000000</td>\n",
" <td>31.000000</td>\n",
" <td>36.000000</td>\n",
" <td>40.500000</td>\n",
" <td>37.500000</td>\n",
" <td>42.500000</td>\n",
" <td>45.000000</td>\n",
" <td>952.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>13.000000</td>\n",
" <td>10.000000</td>\n",
" <td>11.000000</td>\n",
" <td>12.000000</td>\n",
" <td>13.000000</td>\n",
" <td>17.000000</td>\n",
" <td>...</td>\n",
" <td>210.000000</td>\n",
" <td>218.000000</td>\n",
" <td>198.000000</td>\n",
" <td>205.000000</td>\n",
" <td>214.000000</td>\n",
" <td>211.000000</td>\n",
" <td>179.000000</td>\n",
" <td>233.000000</td>\n",
" <td>213.000000</td>\n",
" <td>5018.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>251.500000</td>\n",
" <td>295.500000</td>\n",
" <td>275.000000</td>\n",
" <td>173.000000</td>\n",
" <td>181.000000</td>\n",
" <td>197.000000</td>\n",
" <td>...</td>\n",
" <td>832.000000</td>\n",
" <td>842.000000</td>\n",
" <td>899.000000</td>\n",
" <td>934.500000</td>\n",
" <td>888.000000</td>\n",
" <td>932.000000</td>\n",
" <td>772.000000</td>\n",
" <td>783.000000</td>\n",
" <td>796.000000</td>\n",
" <td>22239.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>22045.000000</td>\n",
" <td>24796.000000</td>\n",
" <td>20620.000000</td>\n",
" <td>10015.000000</td>\n",
" <td>10170.000000</td>\n",
" <td>9564.000000</td>\n",
" <td>...</td>\n",
" <td>42584.000000</td>\n",
" <td>33848.000000</td>\n",
" <td>28742.000000</td>\n",
" <td>30037.000000</td>\n",
" <td>29622.000000</td>\n",
" <td>38617.000000</td>\n",
" <td>36765.000000</td>\n",
" <td>34315.000000</td>\n",
" <td>34129.000000</td>\n",
" <td>691904.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>11 rows × 39 columns</p>\n",
"</div>"
],
"text/plain": [
" Country Continent Region DevName 1980 \\\n",
"count 195 195 195 195 195.000000 \n",
"unique 195 6 22 2 NaN \n",
"top Albania Africa Western Asia Developing regions NaN \n",
"freq 1 54 18 147 NaN \n",
"mean NaN NaN NaN NaN 508.394872 \n",
"std NaN NaN NaN NaN 1949.588546 \n",
"min NaN NaN NaN NaN 0.000000 \n",
"25% NaN NaN NaN NaN 0.000000 \n",
"50% NaN NaN NaN NaN 13.000000 \n",
"75% NaN NaN NaN NaN 251.500000 \n",
"max NaN NaN NaN NaN 22045.000000 \n",
"\n",
" 1981 1982 1983 1984 1985 \\\n",
"count 195.000000 195.000000 195.000000 195.000000 195.000000 \n",
"unique NaN NaN NaN NaN NaN \n",
"top NaN NaN NaN NaN NaN \n",
"freq NaN NaN NaN NaN NaN \n",
"mean 566.989744 534.723077 387.435897 376.497436 358.861538 \n",
"std 2152.643752 1866.997511 1204.333597 1198.246371 1079.309600 \n",
"min 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
"25% 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
"50% 10.000000 11.000000 12.000000 13.000000 17.000000 \n",
"75% 295.500000 275.000000 173.000000 181.000000 197.000000 \n",
"max 24796.000000 20620.000000 10015.000000 10170.000000 9564.000000 \n",
"\n",
" ... 2005 2006 2007 2008 \\\n",
"count ... 195.000000 195.000000 195.000000 195.000000 \n",
"unique ... NaN NaN NaN NaN \n",
"top ... NaN NaN NaN NaN \n",
"freq ... NaN NaN NaN NaN \n",
"mean ... 1320.292308 1266.958974 1191.820513 1246.394872 \n",
"std ... 4425.957828 3926.717747 3443.542409 3694.573544 \n",
"min ... 0.000000 0.000000 0.000000 0.000000 \n",
"25% ... 28.500000 25.000000 31.000000 31.000000 \n",
"50% ... 210.000000 218.000000 198.000000 205.000000 \n",
"75% ... 832.000000 842.000000 899.000000 934.500000 \n",
"max ... 42584.000000 33848.000000 28742.000000 30037.000000 \n",
"\n",
" 2009 2010 2011 2012 2013 \\\n",
"count 195.000000 195.000000 195.000000 195.000000 195.000000 \n",
"unique NaN NaN NaN NaN NaN \n",
"top NaN NaN NaN NaN NaN \n",
"freq NaN NaN NaN NaN NaN \n",
"mean 1275.733333 1420.287179 1262.533333 1313.958974 1320.702564 \n",
"std 3829.630424 4462.946328 4030.084313 4247.555161 4237.951988 \n",
"min 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
"25% 36.000000 40.500000 37.500000 42.500000 45.000000 \n",
"50% 214.000000 211.000000 179.000000 233.000000 213.000000 \n",
"75% 888.000000 932.000000 772.000000 783.000000 796.000000 \n",
"max 29622.000000 38617.000000 36765.000000 34315.000000 34129.000000 \n",
"\n",
" Total \n",
"count 195.000000 \n",
"unique NaN \n",
"top NaN \n",
"freq NaN \n",
"mean 32867.451282 \n",
"std 91785.498686 \n",
"min 1.000000 \n",
"25% 952.000000 \n",
"50% 5018.000000 \n",
"75% 22239.500000 \n",
"max 691904.000000 \n",
"\n",
"[11 rows x 39 columns]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_can.describe(include='all')"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"---\n",
"## *pandas* Intermediate: Indexing and Selection (slicing)<a id=\"6\"></a>\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"### Select Column\n",
"**There are two ways to filter on a column name:**\n",
"\n",
"Method 1: Quick and easy, but only works if the column name does NOT have spaces or special characters.\n",
"```python\n",
" df.column_name \n",
" (returns series)\n",
"```\n",
"\n",
"Method 2: More robust, and can filter on multiple columns.\n",
"\n",
"```python\n",
" df['column'] \n",
" (returns series)\n",
"```\n",
"\n",
"```python \n",
" df[['column 1', 'column 2']] \n",
" (returns dataframe)\n",
"```\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Example: Let's try filtering on the list of countries ('Country')."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"0 Afghanistan\n",
"1 Albania\n",
"2 Algeria\n",
"3 American Samoa\n",
"4 Andorra\n",
"5 Angola\n",
"6 Antigua and Barbuda\n",
"7 Argentina\n",
"8 Armenia\n",
"9 Australia\n",
"10 Austria\n",
"11 Azerbaijan\n",
"12 Bahamas\n",
"13 Bahrain\n",
"14 Bangladesh\n",
"15 Barbados\n",
"16 Belarus\n",
"17 Belgium\n",
"18 Belize\n",
"19 Benin\n",
"20 Bhutan\n",
"21 Bolivia (Plurinational State of)\n",
"22 Bosnia and Herzegovina\n",
"23 Botswana\n",
"24 Brazil\n",
"25 Brunei Darussalam\n",
"26 Bulgaria\n",
"27 Burkina Faso\n",
"28 Burundi\n",
"29 Cabo Verde\n",
" ... \n",
"165 Suriname\n",
"166 Swaziland\n",
"167 Sweden\n",
"168 Switzerland\n",
"169 Syrian Arab Republic\n",
"170 Tajikistan\n",
"171 Thailand\n",
"172 The former Yugoslav Republic of Macedonia\n",
"173 Togo\n",
"174 Tonga\n",
"175 Trinidad and Tobago\n",
"176 Tunisia\n",
"177 Turkey\n",
"178 Turkmenistan\n",
"179 Tuvalu\n",
"180 Uganda\n",
"181 Ukraine\n",
"182 United Arab Emirates\n",
"183 United Kingdom of Great Britain and Northern I...\n",
"184 United Republic of Tanzania\n",
"185 United States of America\n",
"186 Uruguay\n",
"187 Uzbekistan\n",
"188 Vanuatu\n",
"189 Venezuela (Bolivarian Republic of)\n",
"190 Viet Nam\n",
"191 Western Sahara\n",
"192 Yemen\n",
"193 Zambia\n",
"194 Zimbabwe\n",
"Name: Country, Length: 195, dtype: object"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_can.Country # returns a series"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,