Skip to content

Instantly share code, notes, and snippets.

@ulgens
Last active June 17, 2018 22:51
Show Gist options
  • Save ulgens/a967f9a971940c83a922f3f4cf450912 to your computer and use it in GitHub Desktop.
Save ulgens/a967f9a971940c83a922f3f4cf450912 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Agenda\n",
"- Python Basics\n",
"- Data Wrangling & Understanding with Pandas\n",
"- Data Visualizing with Matplotlib\n",
"- Model Building with Scikit-Learn"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Python Basics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### There are many ways to run Python code.\n",
"Command Line Interpreter \n",
" - Usually installed as /usr/local/bin/python\n",
"\n",
"Command Line File Execution \n",
" - Basic: SublimeText2, Atom, Notepad ++\n",
" - Advanced: Vim, Emacs \n",
"\n",
"IDE (Integrated Development Environments)\n",
" - Canopy\n",
" - PyCharm\n",
" - Spyder\n",
"\n",
"Jupyter Notebook (formerly iPython Notebook)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What is Jupyter Notebook?\n",
"It is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What are the key Python libraries for data science?\n",
"\n",
"1. **Pandas** – R-like dataframes, data cleaning, data munging, data exploration\n",
"\n",
"2. **Matplotlib** – plotting environment\n",
"\n",
"3. **NumPy** – fast arrays/matrices and vector operations\n",
"\n",
"4. **Scikit-learn** – maching learning, data mining\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import Python Libraries"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"\n",
"import matplotlib\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"from sklearn.datasets import make_blobs\n",
"from sklearn.cross_validation import train_test_split\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.linear_model import LogisticRegression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandas: Data Wrangling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. The Basics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data1 = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],\n",
" 'year': [2000, 2001, 2002, 2001, 2002],\n",
" 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]\n",
" }\n",
"df1 = pd.DataFrame(data1, index=['one', 'two', 'three', 'four', 'five'])"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>pop</th>\n",
" <th>state</th>\n",
" <th>year</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>one</th>\n",
" <td>1.5</td>\n",
" <td>Ohio</td>\n",
" <td>2000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>two</th>\n",
" <td>1.7</td>\n",
" <td>Ohio</td>\n",
" <td>2001</td>\n",
" </tr>\n",
" <tr>\n",
" <th>three</th>\n",
" <td>3.6</td>\n",
" <td>Ohio</td>\n",
" <td>2002</td>\n",
" </tr>\n",
" <tr>\n",
" <th>four</th>\n",
" <td>2.4</td>\n",
" <td>Nevada</td>\n",
" <td>2001</td>\n",
" </tr>\n",
" <tr>\n",
" <th>five</th>\n",
" <td>2.9</td>\n",
" <td>Nevada</td>\n",
" <td>2002</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" pop state year\n",
"one 1.5 Ohio 2000\n",
"two 1.7 Ohio 2001\n",
"three 3.6 Ohio 2002\n",
"four 2.4 Nevada 2001\n",
"five 2.9 Nevada 2002"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"one Ohio\n",
"two Ohio\n",
"three Ohio\n",
"four Nevada\n",
"five Nevada\n",
"Name: state, dtype: object"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1['state']"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"one Ohio\n",
"two Ohio\n",
"three Ohio\n",
"four Nevada\n",
"five Nevada\n",
"Name: state, dtype: object"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.state"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rows can also be retrieved by position or name by a couple of methods, such as the ix indexing field:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"pop 3.6\n",
"state Ohio\n",
"year 2002\n",
"Name: three, dtype: object"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.ix['three']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Columns can be modified by assignment. For example, the empty 'debt' column could be assigned a scalar value or an array of values:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>pop</th>\n",
" <th>state</th>\n",
" <th>year</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>one</th>\n",
" <td>1.5</td>\n",
" <td>Ohio</td>\n",
" <td>2000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>two</th>\n",
" <td>1.7</td>\n",
" <td>Ohio</td>\n",
" <td>2001</td>\n",
" </tr>\n",
" <tr>\n",
" <th>three</th>\n",
" <td>3.6</td>\n",
" <td>Ohio</td>\n",
" <td>2002</td>\n",
" </tr>\n",
" <tr>\n",
" <th>four</th>\n",
" <td>2.4</td>\n",
" <td>Nevada</td>\n",
" <td>2001</td>\n",
" </tr>\n",
" <tr>\n",
" <th>five</th>\n",
" <td>2.9</td>\n",
" <td>Nevada</td>\n",
" <td>2002</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" pop state year\n",
"one 1.5 Ohio 2000\n",
"two 1.7 Ohio 2001\n",
"three 3.6 Ohio 2002\n",
"four 2.4 Nevada 2001\n",
"five 2.9 Nevada 2002"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df1['new_var'] = 16.5"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>pop</th>\n",
" <th>state</th>\n",
" <th>year</th>\n",
" <th>new_var</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>one</th>\n",
" <td>1.5</td>\n",
" <td>Ohio</td>\n",
" <td>2000</td>\n",
" <td>16.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>two</th>\n",
" <td>1.7</td>\n",
" <td>Ohio</td>\n",
" <td>2001</td>\n",
" <td>16.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>three</th>\n",
" <td>3.6</td>\n",
" <td>Ohio</td>\n",
" <td>2002</td>\n",
" <td>16.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>four</th>\n",
" <td>2.4</td>\n",
" <td>Nevada</td>\n",
" <td>2001</td>\n",
" <td>16.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>five</th>\n",
" <td>2.9</td>\n",
" <td>Nevada</td>\n",
" <td>2002</td>\n",
" <td>16.5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" pop state year new_var\n",
"one 1.5 Ohio 2000 16.5\n",
"two 1.7 Ohio 2001 16.5\n",
"three 3.6 Ohio 2002 16.5\n",
"four 2.4 Nevada 2001 16.5\n",
"five 2.9 Nevada 2002 16.5"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0., 1., 2., 3., 4.])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.arange(5.)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df1['new_var'] = np.arange(5.)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>pop</th>\n",
" <th>state</th>\n",
" <th>year</th>\n",
" <th>new_var</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>one</th>\n",
" <td>1.5</td>\n",
" <td>Ohio</td>\n",
" <td>2000</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>two</th>\n",
" <td>1.7</td>\n",
" <td>Ohio</td>\n",
" <td>2001</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>three</th>\n",
" <td>3.6</td>\n",
" <td>Ohio</td>\n",
" <td>2002</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>four</th>\n",
" <td>2.4</td>\n",
" <td>Nevada</td>\n",
" <td>2001</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>five</th>\n",
" <td>2.9</td>\n",
" <td>Nevada</td>\n",
" <td>2002</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" pop state year new_var\n",
"one 1.5 Ohio 2000 0\n",
"two 1.7 Ohio 2001 1\n",
"three 3.6 Ohio 2002 2\n",
"four 2.4 Nevada 2001 3\n",
"five 2.9 Nevada 2002 4"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, it will be instead conformed exactly to the DataFrame’s index, inserting missing values in any holes:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"two -1.2\n",
"four -1.5\n",
"five -1.7\n",
"dtype: float64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"val"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>pop</th>\n",
" <th>state</th>\n",
" <th>year</th>\n",
" <th>new_var</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>one</th>\n",
" <td>1.5</td>\n",
" <td>Ohio</td>\n",
" <td>2000</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>two</th>\n",
" <td>1.7</td>\n",
" <td>Ohio</td>\n",
" <td>2001</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>three</th>\n",
" <td>3.6</td>\n",
" <td>Ohio</td>\n",
" <td>2002</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>four</th>\n",
" <td>2.4</td>\n",
" <td>Nevada</td>\n",
" <td>2001</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>five</th>\n",
" <td>2.9</td>\n",
" <td>Nevada</td>\n",
" <td>2002</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" pop state year new_var\n",
"one 1.5 Ohio 2000 0\n",
"two 1.7 Ohio 2001 1\n",
"three 3.6 Ohio 2002 2\n",
"four 2.4 Nevada 2001 3\n",
"five 2.9 Nevada 2002 4"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df1['new_var'] = val"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>pop</th>\n",
" <th>state</th>\n",
" <th>year</th>\n",
" <th>new_var</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>one</th>\n",
" <td>1.5</td>\n",
" <td>Ohio</td>\n",
" <td>2000</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>two</th>\n",
" <td>1.7</td>\n",
" <td>Ohio</td>\n",
" <td>2001</td>\n",
" <td>-1.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>three</th>\n",
" <td>3.6</td>\n",
" <td>Ohio</td>\n",
" <td>2002</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>four</th>\n",
" <td>2.4</td>\n",
" <td>Nevada</td>\n",
" <td>2001</td>\n",
" <td>-1.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>five</th>\n",
" <td>2.9</td>\n",
" <td>Nevada</td>\n",
" <td>2002</td>\n",
" <td>-1.7</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" pop state year new_var\n",
"one 1.5 Ohio 2000 NaN\n",
"two 1.7 Ohio 2001 -1.2\n",
"three 3.6 Ohio 2002 NaN\n",
"four 2.4 Nevada 2001 -1.5\n",
"five 2.9 Nevada 2002 -1.7"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The del keyword will delete columns as with a dict:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>pop</th>\n",
" <th>state</th>\n",
" <th>year</th>\n",
" <th>new_var</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>one</th>\n",
" <td>1.5</td>\n",
" <td>Ohio</td>\n",
" <td>2000</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>two</th>\n",
" <td>1.7</td>\n",
" <td>Ohio</td>\n",
" <td>2001</td>\n",
" <td>-1.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>three</th>\n",
" <td>3.6</td>\n",
" <td>Ohio</td>\n",
" <td>2002</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>four</th>\n",
" <td>2.4</td>\n",
" <td>Nevada</td>\n",
" <td>2001</td>\n",
" <td>-1.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>five</th>\n",
" <td>2.9</td>\n",
" <td>Nevada</td>\n",
" <td>2002</td>\n",
" <td>-1.7</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" pop state year new_var\n",
"one 1.5 Ohio 2000 NaN\n",
"two 1.7 Ohio 2001 -1.2\n",
"three 3.6 Ohio 2002 NaN\n",
"four 2.4 Nevada 2001 -1.5\n",
"five 2.9 Nevada 2002 -1.7"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"del df1['new_var']"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>pop</th>\n",
" <th>state</th>\n",
" <th>year</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>one</th>\n",
" <td>1.5</td>\n",
" <td>Ohio</td>\n",
" <td>2000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>two</th>\n",
" <td>1.7</td>\n",
" <td>Ohio</td>\n",
" <td>2001</td>\n",
" </tr>\n",
" <tr>\n",
" <th>three</th>\n",
" <td>3.6</td>\n",
" <td>Ohio</td>\n",
" <td>2002</td>\n",
" </tr>\n",
" <tr>\n",
" <th>four</th>\n",
" <td>2.4</td>\n",
" <td>Nevada</td>\n",
" <td>2001</td>\n",
" </tr>\n",
" <tr>\n",
" <th>five</th>\n",
" <td>2.9</td>\n",
" <td>Nevada</td>\n",
" <td>2002</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" pop state year\n",
"one 1.5 Ohio 2000\n",
"two 1.7 Ohio 2001\n",
"three 3.6 Ohio 2002\n",
"four 2.4 Nevada 2001\n",
"five 2.9 Nevada 2002"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data2 = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],\n",
" 'year': [2000, 2001, 2002, 2001, 2002],\n",
" 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]\n",
" }\n",
"df2 = pd.DataFrame(data2, index=['one', 'two', 'three', 'four', 'five'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print the entire dataframe \"df2\""
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>pop</th>\n",
" <th>state</th>\n",
" <th>year</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>one</th>\n",
" <td>1.5</td>\n",
" <td>Ohio</td>\n",
" <td>2000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>two</th>\n",
" <td>1.7</td>\n",
" <td>Ohio</td>\n",
" <td>2001</td>\n",
" </tr>\n",
" <tr>\n",
" <th>three</th>\n",
" <td>3.6</td>\n",
" <td>Ohio</td>\n",
" <td>2002</td>\n",
" </tr>\n",
" <tr>\n",
" <th>four</th>\n",
" <td>2.4</td>\n",
" <td>Nevada</td>\n",
" <td>2001</td>\n",
" </tr>\n",
" <tr>\n",
" <th>five</th>\n",
" <td>2.9</td>\n",
" <td>Nevada</td>\n",
" <td>2002</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" pop state year\n",
"one 1.5 Ohio 2000\n",
"two 1.7 Ohio 2001\n",
"three 3.6 Ohio 2002\n",
"four 2.4 Nevada 2001\n",
"five 2.9 Nevada 2002"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print only column \"pop\""
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"one 1.5\n",
"two 1.7\n",
"three 3.6\n",
"four 2.4\n",
"five 2.9\n",
"Name: pop, dtype: float64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2[\"pop\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print only row 2"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"pop 1.7\n",
"state Ohio\n",
"year 2001\n",
"new_var2 2\n",
"Name: two, dtype: object"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.ix[\"two\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a new column \"new_var2\" in with values (1,2,3,4,5). (Hint: there are multiple ways of doing this.)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df2[\"new_var2\"] = np.arange(1,6)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>pop</th>\n",
" <th>state</th>\n",
" <th>year</th>\n",
" <th>new_var2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>one</th>\n",
" <td>1.5</td>\n",
" <td>Ohio</td>\n",
" <td>2000</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>two</th>\n",
" <td>1.7</td>\n",
" <td>Ohio</td>\n",
" <td>2001</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>three</th>\n",
" <td>3.6</td>\n",
" <td>Ohio</td>\n",
" <td>2002</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>four</th>\n",
" <td>2.4</td>\n",
" <td>Nevada</td>\n",
" <td>2001</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>five</th>\n",
" <td>2.9</td>\n",
" <td>Nevada</td>\n",
" <td>2002</td>\n",
" <td>5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" pop state year new_var2\n",
"one 1.5 Ohio 2000 1\n",
"two 1.7 Ohio 2001 2\n",
"three 3.6 Ohio 2002 3\n",
"four 2.4 Nevada 2001 4\n",
"five 2.9 Nevada 2002 5"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Select and Filter"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df3 = pd.DataFrame(np.arange(16).reshape((4, 4)),\n",
" index=['Ohio', 'Colorado', 'Utah', 'New York'],\n",
" columns=['one', 'two', 'three', 'four'])"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>three</th>\n",
" <th>four</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Ohio</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Colorado</th>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Utah</th>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" <td>10</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>12</td>\n",
" <td>13</td>\n",
" <td>14</td>\n",
" <td>15</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two three four\n",
"Ohio 0 1 2 3\n",
"Colorado 4 5 6 7\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>three</th>\n",
" <th>four</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Ohio</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Colorado</th>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>7</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two three four\n",
"Ohio 0 1 2 3\n",
"Colorado 4 5 6 7"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3[0:2]"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Ohio 1\n",
"Colorado 5\n",
"Utah 9\n",
"New York 13\n",
"Name: two, dtype: int64"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3['two']"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>three</th>\n",
" <th>one</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Ohio</th>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Colorado</th>\n",
" <td>6</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Utah</th>\n",
" <td>10</td>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>14</td>\n",
" <td>12</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" three one\n",
"Ohio 2 0\n",
"Colorado 6 4\n",
"Utah 10 8\n",
"New York 14 12"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3[['three', 'one']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can select rows by slicing or a boolean array:"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>three</th>\n",
" <th>four</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Ohio</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Colorado</th>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Utah</th>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" <td>10</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>12</td>\n",
" <td>13</td>\n",
" <td>14</td>\n",
" <td>15</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two three four\n",
"Ohio 0 1 2 3\n",
"Colorado 4 5 6 7\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>three</th>\n",
" <th>four</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Ohio</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Colorado</th>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>7</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two three four\n",
"Ohio 0 1 2 3\n",
"Colorado 4 5 6 7"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3[:2]"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>three</th>\n",
" <th>four</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Colorado</th>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Utah</th>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" <td>10</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>12</td>\n",
" <td>13</td>\n",
" <td>14</td>\n",
" <td>15</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two three four\n",
"Colorado 4 5 6 7\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3[df3['three'] > 5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>three</th>\n",
" <th>four</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Ohio</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Colorado</th>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Utah</th>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" <td>10</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>12</td>\n",
" <td>13</td>\n",
" <td>14</td>\n",
" <td>15</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two three four\n",
"Ohio 0 1 2 3\n",
"Colorado 4 5 6 7\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>three</th>\n",
" <th>four</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Ohio</th>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Colorado</th>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Utah</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two three four\n",
"Ohio True True True True\n",
"Colorado True False False False\n",
"Utah False False False False\n",
"New York False False False False"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3 < 5"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>three</th>\n",
" <th>four</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Ohio</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Colorado</th>\n",
" <td>0</td>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Utah</th>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" <td>10</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>12</td>\n",
" <td>13</td>\n",
" <td>14</td>\n",
" <td>15</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two three four\n",
"Ohio 0 0 0 0\n",
"Colorado 0 5 6 7\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3[df3 < 5] = 0\n",
"df3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also drop columns using index:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>three</th>\n",
" <th>four</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Ohio</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Colorado</th>\n",
" <td>0</td>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Utah</th>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" <td>10</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>12</td>\n",
" <td>13</td>\n",
" <td>14</td>\n",
" <td>15</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two three four\n",
"Ohio 0 0 0 0\n",
"Colorado 0 5 6 7\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>three</th>\n",
" <th>four</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Utah</th>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" <td>10</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>12</td>\n",
" <td>13</td>\n",
" <td>14</td>\n",
" <td>15</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two three four\n",
"Utah 8 9 10 11\n",
"New York 12 13 14 15"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3.drop(['Colorado', 'Ohio'])"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>three</th>\n",
" <th>four</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Ohio</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Colorado</th>\n",
" <td>4</td>\n",
" <td>6</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Utah</th>\n",
" <td>8</td>\n",
" <td>10</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>12</td>\n",
" <td>14</td>\n",
" <td>15</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one three four\n",
"Ohio 0 2 3\n",
"Colorado 4 6 7\n",
"Utah 8 10 11\n",
"New York 12 14 15"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3.drop('two', axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>three</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Ohio</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Colorado</th>\n",
" <td>0</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Utah</th>\n",
" <td>8</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>12</td>\n",
" <td>14</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one three\n",
"Ohio 0 0\n",
"Colorado 0 6\n",
"Utah 8 10\n",
"New York 12 14"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3.drop(['two', 'four'], axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One Final Note: While many pandas functions (like reindex) require that the labels be unique, it’s not mandatory. Data selection is one of the main things that behaves differently with duplicates. "
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 0\n",
"a 1\n",
"b 2\n",
"b 3\n",
"c 4\n",
"dtype: int64"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The index’s is_unique property can tell you whether its values are unique or not. "
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj.index.is_unique"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Indexing a value with multiple entries returns a Series "
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 0\n",
"a 1\n",
"dtype: int64"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj['a']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Indexing a value with single entry return a scalar value:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obj['c']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df4 = pd.DataFrame(np.arange(25).reshape((5, 5)),\n",
" index=['Ohio', 'Colorado', 'Utah', 'New York', 'California'],\n",
" columns=['one', 'two', 'three', 'four', 'five'])"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>three</th>\n",
" <th>four</th>\n",
" <th>five</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Ohio</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Colorado</th>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>7</td>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Utah</th>\n",
" <td>10</td>\n",
" <td>11</td>\n",
" <td>12</td>\n",
" <td>13</td>\n",
" <td>14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>15</td>\n",
" <td>16</td>\n",
" <td>17</td>\n",
" <td>18</td>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>California</th>\n",
" <td>20</td>\n",
" <td>21</td>\n",
" <td>22</td>\n",
" <td>23</td>\n",
" <td>24</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two three four five\n",
"Ohio 0 1 2 3 4\n",
"Colorado 5 6 7 8 9\n",
"Utah 10 11 12 13 14\n",
"New York 15 16 17 18 19\n",
"California 20 21 22 23 24"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"select rows \"Ohio\", \"Colorado\", and all columns"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>three</th>\n",
" <th>four</th>\n",
" <th>five</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Ohio</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Colorado</th>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>7</td>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two three four five\n",
"Ohio 0 1 2 3 4\n",
"Colorado 5 6 7 8 9"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4[0:2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"select columns \"one\", \"two\", and all rows"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Ohio</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Colorado</th>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Utah</th>\n",
" <td>10</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>15</td>\n",
" <td>16</td>\n",
" </tr>\n",
" <tr>\n",
" <th>California</th>\n",
" <td>20</td>\n",
" <td>21</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two\n",
"Ohio 0 1\n",
"Colorado 5 6\n",
"Utah 10 11\n",
"New York 15 16\n",
"California 20 21"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4[[\"one\", \"two\"]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"select all rows where column \"three\" = 7"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>three</th>\n",
" <th>four</th>\n",
" <th>five</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Colorado</th>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>7</td>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two three four five\n",
"Colorado 5 6 7 8 9"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4[df4[\"three\"]==7]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"drop column \"three\""
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>four</th>\n",
" <th>five</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Ohio</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Colorado</th>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Utah</th>\n",
" <td>10</td>\n",
" <td>11</td>\n",
" <td>13</td>\n",
" <td>14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>15</td>\n",
" <td>16</td>\n",
" <td>18</td>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>California</th>\n",
" <td>20</td>\n",
" <td>21</td>\n",
" <td>23</td>\n",
" <td>24</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two four five\n",
"Ohio 0 1 3 4\n",
"Colorado 5 6 8 9\n",
"Utah 10 11 13 14\n",
"New York 15 16 18 19\n",
"California 20 21 23 24"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4.drop([\"three\"], axis = 1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"drop column \"three\" again, but in a different way"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>four</th>\n",
" <th>five</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Ohio</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Colorado</th>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Utah</th>\n",
" <td>10</td>\n",
" <td>11</td>\n",
" <td>13</td>\n",
" <td>14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>New York</th>\n",
" <td>15</td>\n",
" <td>16</td>\n",
" <td>18</td>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>California</th>\n",
" <td>20</td>\n",
" <td>21</td>\n",
" <td>23</td>\n",
" <td>24</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two four five\n",
"Ohio 0 1 3 4\n",
"Colorado 5 6 8 9\n",
"Utah 10 11 13 14\n",
"New York 15 16 18 19\n",
"California 20 21 23 24"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"del df4[\"three\"]\n",
"df4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**CHALLENGE**: select all columns where row \"Colorado\" >= 7. [Hint](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.transpose.html)"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Ohio</th>\n",
" <th>Colorado</th>\n",
" <th>Utah</th>\n",
" <th>New York</th>\n",
" <th>California</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>four</th>\n",
" <td>3</td>\n",
" <td>8</td>\n",
" <td>13</td>\n",
" <td>18</td>\n",
" <td>23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>five</th>\n",
" <td>4</td>\n",
" <td>9</td>\n",
" <td>14</td>\n",
" <td>19</td>\n",
" <td>24</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Ohio Colorado Utah New York California\n",
"four 3 8 13 18 23\n",
"five 4 9 14 19 24"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4.transpose = df4.T\n",
"df4.transpose\n",
"df4.transpose[df4.transpose[\"Colorado\"] >= 7]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Join Dataframes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are multiple ways of joining dataframes using pandas. The three most common ways are pandas.merge, pandas.concat, and combine_first. The method that is most familiar to relationship database (SQL) users is the pandas.merge command, so that's what we will focus on."
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df5_a = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})\n",
"df5_b = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>data1</th>\n",
" <th>key</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>b</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>b</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>a</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>c</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>a</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5</td>\n",
" <td>a</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>6</td>\n",
" <td>b</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" data1 key\n",
"0 0 b\n",
"1 1 b\n",
"2 2 a\n",
"3 3 c\n",
"4 4 a\n",
"5 5 a\n",
"6 6 b"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df5_a"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>data2</th>\n",
" <th>key</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>a</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>b</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>d</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" data2 key\n",
"0 0 a\n",
"1 1 b\n",
"2 2 d"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df5_b"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is an example of a many-to-one merge situation; the data in df5_a has multiple rows labeled a and b, whereas df5_b has only merge with these objects we obtain:"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>data1</th>\n",
" <th>key</th>\n",
" <th>data2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>b</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>b</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>6</td>\n",
" <td>b</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>a</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>a</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5</td>\n",
" <td>a</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" data1 key data2\n",
"0 0 b 1\n",
"1 1 b 1\n",
"2 6 b 1\n",
"3 2 a 0\n",
"4 4 a 0\n",
"5 5 a 0"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(df5_a, df5_b)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that I didn’t specify which column to join on. If not specified, merge uses the overlapping column names as the keys. It’s a good practice to specify explicitly, though:"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>data1</th>\n",
" <th>key</th>\n",
" <th>data2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>b</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>b</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>6</td>\n",
" <td>b</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>a</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>a</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5</td>\n",
" <td>a</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" data1 key data2\n",
"0 0 b 1\n",
"1 1 b 1\n",
"2 6 b 1\n",
"3 2 a 0\n",
"4 4 a 0\n",
"5 5 a 0"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(df5_a, df5_b, on='key')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the column names are different in each object, you can specify the key to join by explicitly. Join df6_a and df6_b below, where the parameters are left_on=? and right_on= ?"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df6_a = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})\n",
"df6_b = pd.DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)})"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>data1</th>\n",
" <th>lkey</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>b</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>b</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>a</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>c</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>a</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5</td>\n",
" <td>a</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>6</td>\n",
" <td>b</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" data1 lkey\n",
"0 0 b\n",
"1 1 b\n",
"2 2 a\n",
"3 3 c\n",
"4 4 a\n",
"5 5 a\n",
"6 6 b"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df6_a"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>data2</th>\n",
" <th>rkey</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>a</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>b</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>d</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" data2 rkey\n",
"0 0 a\n",
"1 1 b\n",
"2 2 d"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df6_b"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>data1</th>\n",
" <th>lkey</th>\n",
" <th>data2</th>\n",
" <th>rkey</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>b</td>\n",
" <td>1</td>\n",
" <td>b</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>b</td>\n",
" <td>1</td>\n",
" <td>b</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>6</td>\n",
" <td>b</td>\n",
" <td>1</td>\n",
" <td>b</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>a</td>\n",
" <td>0</td>\n",
" <td>a</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>a</td>\n",
" <td>0</td>\n",
" <td>a</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5</td>\n",
" <td>a</td>\n",
" <td>0</td>\n",
" <td>a</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>d</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" data1 lkey data2 rkey\n",
"0 0 b 1 b\n",
"1 1 b 1 b\n",
"2 6 b 1 b\n",
"3 2 a 0 a\n",
"4 4 a 0 a\n",
"5 5 a 0 a\n",
"6 NaN NaN 2 d"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(df6_a, df6_b, right_on=\"rkey\", left_on=\"lkey\", how=\"right\")\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Questions?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandas: Data Understanding"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Descriptive Statistics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame. Compared with the equivalent methods of vanilla NumPy arrays, they are all built from the ground up to exclude missing data."
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df7 = pd.DataFrame([[1.4, np.nan]\n",
" , [7.1, -4.5]\n",
" , [np.nan, np.nan]\n",
" , [0.75, -1.3]]\n",
" , index=['a', 'b', 'c', 'd']\n",
" , columns=['one', 'two'])"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>a</th>\n",
" <td>1.40</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>b</th>\n",
" <td>7.10</td>\n",
" <td>-4.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>c</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>d</th>\n",
" <td>0.75</td>\n",
" <td>-1.3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two\n",
"a 1.40 NaN\n",
"b 7.10 -4.5\n",
"c NaN NaN\n",
"d 0.75 -1.3"
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df7"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calling DataFrame’s sum method returns a Series containing column sums:"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"one 9.25\n",
"two -5.80\n",
"dtype: float64"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df7.sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Passing axis=1 sums over the rows instead:"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 1.40\n",
"b 2.60\n",
"c 0.00\n",
"d -0.55\n",
"dtype: float64"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df7.sum(axis = 1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NA values are excluded unless the entire slice (row or column in this case) is NA. This can be disabled using the skipna option:"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a NaN\n",
"b 1.300\n",
"c NaN\n",
"d -0.275\n",
"dtype: float64"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df7.mean(axis=1, skipna=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Other cumulative functions also include: sum, mean, median, var, std, skew, etc."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df8 = pd.DataFrame([[1.4, np.nan, 1]\n",
" , [7.1, -4.5, 2]\n",
" , [np.nan, np.nan, 3]\n",
" , [0.75, -1.3, np.nan]]\n",
" , index=['a', 'b', 'c', 'd']\n",
" , columns=['one', 'two', 'three'])"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>three</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>a</th>\n",
" <td>1.40</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>b</th>\n",
" <td>7.10</td>\n",
" <td>-4.5</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>c</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>d</th>\n",
" <td>0.75</td>\n",
" <td>-1.3</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two three\n",
"a 1.40 NaN 1\n",
"b 7.10 -4.5 2\n",
"c NaN NaN 3\n",
"d 0.75 -1.3 NaN"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df8"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find the median of df8 along axis = 1, skipping over missing values"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a 1.200000\n",
"b 1.533333\n",
"c 3.000000\n",
"d -0.275000\n",
"dtype: float64"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df8.mean(axis = 1, skipna = True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find the standard deviation (std) of df8 along axis = 0, including the missing values"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"one NaN\n",
"two NaN\n",
"three NaN\n",
"dtype: float64"
]
},
"execution_count": 96,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df8.std(axis = 0, skipna = False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Count() and Describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Describe() produces multiple summary statistics in one shot:"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df9 = pd.DataFrame([[1.4, np.nan, 1]\n",
" , [7.1, -4.5, 2]\n",
" , [np.nan, np.nan, 3]\n",
" , [0.75, -1.3, np.nan]]\n",
" , index=['a', 'b', 'c', 'd']\n",
" , columns=['one', 'two', 'three'])"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>three</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>a</th>\n",
" <td>1.40</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>b</th>\n",
" <td>7.10</td>\n",
" <td>-4.5</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>c</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>d</th>\n",
" <td>0.75</td>\n",
" <td>-1.3</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two three\n",
"a 1.40 NaN 1\n",
"b 7.10 -4.5 2\n",
"c NaN NaN 3\n",
"d 0.75 -1.3 NaN"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df9"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>three</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>3.000000</td>\n",
" <td>2.000000</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>3.083333</td>\n",
" <td>-2.900000</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>3.493685</td>\n",
" <td>2.262742</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>0.750000</td>\n",
" <td>-4.500000</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>1.075000</td>\n",
" <td>-3.700000</td>\n",
" <td>1.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>1.400000</td>\n",
" <td>-2.900000</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>4.250000</td>\n",
" <td>-2.100000</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>7.100000</td>\n",
" <td>-1.300000</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two three\n",
"count 3.000000 2.000000 3.0\n",
"mean 3.083333 -2.900000 2.0\n",
"std 3.493685 2.262742 1.0\n",
"min 0.750000 -4.500000 1.0\n",
"25% 1.075000 -3.700000 1.5\n",
"50% 1.400000 -2.900000 2.0\n",
"75% 4.250000 -2.100000 2.5\n",
"max 7.100000 -1.300000 3.0"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df9.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df10 = pd.DataFrame([[1.4, np.nan, 1, 3]\n",
" , [7.1, -4.5, 2, 5]\n",
" , [np.nan, np.nan, 3, 7]\n",
" , [0.75, -1.3, np.nan, 8]]\n",
" , index=['a', 'b', 'c', 'd']\n",
" , columns=['one', 'two', 'three', 'four'])"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>three</th>\n",
" <th>four</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>a</th>\n",
" <td>1.40</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>b</th>\n",
" <td>7.10</td>\n",
" <td>-4.5</td>\n",
" <td>2</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>c</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>3</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>d</th>\n",
" <td>0.75</td>\n",
" <td>-1.3</td>\n",
" <td>NaN</td>\n",
" <td>8</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two three four\n",
"a 1.40 NaN 1 3\n",
"b 7.10 -4.5 2 5\n",
"c NaN NaN 3 7\n",
"d 0.75 -1.3 NaN 8"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df10 "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the describe function on df10. Anything interesting?"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>two</th>\n",
" <th>three</th>\n",
" <th>four</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>3.000000</td>\n",
" <td>2.000000</td>\n",
" <td>3.0</td>\n",
" <td>4.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>3.083333</td>\n",
" <td>-2.900000</td>\n",
" <td>2.0</td>\n",
" <td>5.750000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>3.493685</td>\n",
" <td>2.262742</td>\n",
" <td>1.0</td>\n",
" <td>2.217356</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>0.750000</td>\n",
" <td>-4.500000</td>\n",
" <td>1.0</td>\n",
" <td>3.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>1.075000</td>\n",
" <td>-3.700000</td>\n",
" <td>1.5</td>\n",
" <td>4.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>1.400000</td>\n",
" <td>-2.900000</td>\n",
" <td>2.0</td>\n",
" <td>6.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>4.250000</td>\n",
" <td>-2.100000</td>\n",
" <td>2.5</td>\n",
" <td>7.250000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>7.100000</td>\n",
" <td>-1.300000</td>\n",
" <td>3.0</td>\n",
" <td>8.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one two three four\n",
"count 3.000000 2.000000 3.0 4.000000\n",
"mean 3.083333 -2.900000 2.0 5.750000\n",
"std 3.493685 2.262742 1.0 2.217356\n",
"min 0.750000 -4.500000 1.0 3.000000\n",
"25% 1.075000 -3.700000 1.5 4.500000\n",
"50% 1.400000 -2.900000 2.0 6.000000\n",
"75% 4.250000 -2.100000 2.5 7.250000\n",
"max 7.100000 -1.300000 3.0 8.000000"
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df10.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Duplicates"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df11 = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4, 4, 4],\n",
" 'Qu2': [2, 3, 1, 2, 3, 3, 3],\n",
" 'Qu3': [1, 5, 2, 4, 4, 4, 4]})"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Qu1</th>\n",
" <th>Qu2</th>\n",
" <th>Qu3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Qu1 Qu2 Qu3\n",
"0 1 2 1\n",
"1 3 3 5\n",
"2 4 1 2\n",
"3 3 2 4\n",
"4 4 3 4\n",
"5 4 3 4\n",
"6 4 3 4"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df11"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By default, drop_duplicates removes entire rows of duplicates, keeping the first occurence of duplication."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Qu1</th>\n",
" <th>Qu2</th>\n",
" <th>Qu3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Qu1 Qu2 Qu3\n",
"0 1 2 1\n",
"1 3 3 5\n",
"2 4 1 2\n",
"3 3 2 4\n",
"4 4 3 4"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df11.drop_duplicates()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exerise"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df12 = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4, 4, 4, 1, 0, 1],\n",
" 'Qu2': [2, 3, 1, 2, 3, 3, 3, 1, 0, 1],\n",
" 'Qu3': [1, 5, 2, 4, 4, 4, 4, 1, 0, 1]})"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Qu1</th>\n",
" <th>Qu2</th>\n",
" <th>Qu3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Qu1 Qu2 Qu3\n",
"0 1 2 1\n",
"1 3 3 5\n",
"2 4 1 2\n",
"3 3 2 4\n",
"4 4 3 4\n",
"5 4 3 4\n",
"6 4 3 4\n",
"7 1 1 1\n",
"8 0 0 0\n",
"9 1 1 1"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df12"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read up documentation for drop_duplicates and discuss. What other alternatives are there for this command?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Qu1</th>\n",
" <th>Qu2</th>\n",
" <th>Qu3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Qu1 Qu2 Qu3\n",
"0 1 2 1\n",
"1 3 3 5\n",
"2 4 1 2\n",
"3 3 2 4\n",
"4 4 3 4\n",
"7 1 1 1\n",
"8 0 0 0"
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df12.drop_duplicates()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How would you drop duplicates in place?"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Qu1</th>\n",
" <th>Qu2</th>\n",
" <th>Qu3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Qu1 Qu2 Qu3\n",
"0 1 2 1\n",
"1 3 3 5\n",
"2 4 1 2\n",
"3 3 2 4\n",
"4 4 3 4\n",
"7 1 1 1\n",
"8 0 0 0"
]
},
"execution_count": 112,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df12.drop_duplicates(inplace = True)\n",
"df12"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How would you drop all duplicates?"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Qu1</th>\n",
" <th>Qu2</th>\n",
" <th>Qu3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Qu1 Qu2 Qu3\n",
"0 1 2 1\n",
"1 3 3 5\n",
"2 4 1 2\n",
"3 3 2 4\n",
"4 4 3 4\n",
"7 1 1 1\n",
"8 0 0 0"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df12.drop_duplicates(keep = False)\n",
"df12"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4. Correlation and Covariance"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some summary statistics, like correlation and covariance, are computed from pairs of arguments. Let’s consider some DataFrames of stock prices and volumes obtained from Yahoo! Finance:"
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df14_returns = pd.DataFrame([\n",
" [-0.084311, 0.014634, -0.086538]\n",
" , [-0.033944, 0.035137, -0.017241]\n",
" , [-0.033780, 0.010544, -0.033498]\n",
" ]\n",
" , index=['2000-01-03', '2000-01-03', '2000-01-03']\n",
" , columns=['AAPL', 'IBM', 'MSFT'])"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>AAPL</th>\n",
" <th>IBM</th>\n",
" <th>MSFT</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2000-01-03</th>\n",
" <td>-0.084311</td>\n",
" <td>0.014634</td>\n",
" <td>-0.086538</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2000-01-03</th>\n",
" <td>-0.033944</td>\n",
" <td>0.035137</td>\n",
" <td>-0.017241</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2000-01-03</th>\n",
" <td>-0.033780</td>\n",
" <td>0.010544</td>\n",
" <td>-0.033498</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" AAPL IBM MSFT\n",
"2000-01-03 -0.084311 0.014634 -0.086538\n",
"2000-01-03 -0.033944 0.035137 -0.017241\n",
"2000-01-03 -0.033780 0.010544 -0.033498"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df14_returns.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"DataFrame’s corr and cov methods, return a full correlation or covariance matrix as a DataFrame, respectively:"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>AAPL</th>\n",
" <th>IBM</th>\n",
" <th>MSFT</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>AAPL</th>\n",
" <td>1.000000</td>\n",
" <td>0.356919</td>\n",
" <td>0.973884</td>\n",
" </tr>\n",
" <tr>\n",
" <th>IBM</th>\n",
" <td>0.356919</td>\n",
" <td>1.000000</td>\n",
" <td>0.559689</td>\n",
" </tr>\n",
" <tr>\n",
" <th>MSFT</th>\n",
" <td>0.973884</td>\n",
" <td>0.559689</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" AAPL IBM MSFT\n",
"AAPL 1.000000 0.356919 0.973884\n",
"IBM 0.356919 1.000000 0.559689\n",
"MSFT 0.973884 0.559689 1.000000"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df14_returns.corr()"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>AAPL</th>\n",
" <th>IBM</th>\n",
" <th>MSFT</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>AAPL</th>\n",
" <td>0.000848</td>\n",
" <td>0.000137</td>\n",
" <td>0.001028</td>\n",
" </tr>\n",
" <tr>\n",
" <th>IBM</th>\n",
" <td>0.000137</td>\n",
" <td>0.000174</td>\n",
" <td>0.000267</td>\n",
" </tr>\n",
" <tr>\n",
" <th>MSFT</th>\n",
" <td>0.001028</td>\n",
" <td>0.000267</td>\n",
" <td>0.001313</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" AAPL IBM MSFT\n",
"AAPL 0.000848 0.000137 0.001028\n",
"IBM 0.000137 0.000174 0.000267\n",
"MSFT 0.001028 0.000267 0.001313"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df14_returns.cov()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we only want to look at the correlation and covariance between two stock prices:"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.55968873895232241"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df14_returns.MSFT.corr(df14_returns.IBM)"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.00026727853200000001"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df14_returns.MSFT.cov(df14_returns.IBM)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the same dataset as above, what is the correlation between AAPL and IBM?"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.35691922517121777"
]
},
"execution_count": 117,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df14_returns.AAPL.corr(df14_returns.IBM)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the same dataset as above, what is the covariance between AAPL and IBM?"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.00013699492649999997"
]
},
"execution_count": 118,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df14_returns.AAPL.cov(df14_returns.IBM)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This will take a bit of researching online, but instead of Pearson correlation, how do I output the Spearman correlation instead? Once you find out, please output the correlation matrix for Spearman."
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>AAPL</th>\n",
" <th>IBM</th>\n",
" <th>MSFT</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>AAPL</th>\n",
" <td>1.0</td>\n",
" <td>-0.5</td>\n",
" <td>0.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>IBM</th>\n",
" <td>-0.5</td>\n",
" <td>1.0</td>\n",
" <td>0.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>MSFT</th>\n",
" <td>0.5</td>\n",
" <td>0.5</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" AAPL IBM MSFT\n",
"AAPL 1.0 -0.5 0.5\n",
"IBM -0.5 1.0 0.5\n",
"MSFT 0.5 0.5 1.0"
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df14_returns.corr(method = \"spearman\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5. How to Handle Missing Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Missing data is common in most data analysis applications. The default for most commands on pandas objects is to exclude missing data. Pandas uses the floating point value NaN (Not a Number) to represent missing data in both floating as well as in non-floating point arrays."
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df15 = pd.DataFrame([[1., 6.5, 3.], \n",
" [1., np.nan, np.nan],\n",
" [np.nan, np.nan, np.nan], \n",
" [np.nan, 6.5, 3.]])"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1 6.5 3\n",
"1 1 NaN NaN\n",
"2 NaN NaN NaN\n",
"3 NaN 6.5 3"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df15"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are a few ways to filter out / drop missing values. By default, dropna() drops all rows that have any missing values"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1 6.5 3"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df15.dropna()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By passing how='all', this will only drop rows that are all NA:"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1 6.5 3\n",
"1 1 NaN NaN\n",
"3 NaN 6.5 3"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df15.dropna(how = 'all')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If instead of dropping rows, we want to drop columns, then change the axis:"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1 6.5 3\n",
"1 1 NaN NaN\n",
"2 NaN NaN NaN\n",
"3 NaN 6.5 3"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df15.dropna(axis = 1, how = 'all')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Okay, what if we don't want to drop, and we want to impute the data instead? Use fillna()"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1 6.5 3\n",
"1 1 6.5 3\n",
"2 1 6.5 3\n",
"3 1 6.5 3"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df15.fillna(method='ffill')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are the 5 options for \"method\" in the fillna command? What does each of them mean?"
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1 6.5 3\n",
"1 1 6.5 3\n",
"2 NaN 6.5 3\n",
"3 NaN 6.5 3"
]
},
"execution_count": 122,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df15.fillna(method='bfill')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the following dataset, visualize each of the five fillna() methods. What is the difference?"
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df16 = pd.DataFrame([[1., 6.5, 3.], \n",
" [1., np.nan, np.nan],\n",
" [np.nan, np.nan, np.nan], \n",
" [np.nan, 6.5, 3.]])"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1 6.5 3\n",
"1 1 NaN NaN\n",
"2 NaN NaN NaN\n",
"3 NaN 6.5 3"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df16"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1 6.5 3\n",
"1 1 6.5 3\n",
"2 NaN 6.5 3\n",
"3 NaN 6.5 3"
]
},
"execution_count": 125,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df16.fillna(method='backfill')"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1 6.5 3\n",
"1 1 6.5 3\n",
"2 NaN 6.5 3\n",
"3 NaN 6.5 3"
]
},
"execution_count": 129,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df16.fillna(method='bfill')"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1 6.5 3\n",
"1 1 6.5 3\n",
"2 1 6.5 3\n",
"3 1 6.5 3"
]
},
"execution_count": 130,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df16.fillna(method='pad')"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1 6.5 3\n",
"1 1 6.5 3\n",
"2 1 6.5 3\n",
"3 1 6.5 3"
]
},
"execution_count": 131,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df16.fillna(method='ffill')"
]
},
{
"cell_type": "code",
"execution_count": 145,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>None</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1 6.5 3\n",
"1 1 None None\n",
"2 None None None\n",
"3 None 6.5 3"
]
},
"execution_count": 145,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df16.fillna('None')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the replace() command and df16, how would you replace all missing values with -999 instead?\n",
"Hint: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html"
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>-999.0</td>\n",
" <td>-999</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-999</td>\n",
" <td>-999.0</td>\n",
" <td>-999</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>-999</td>\n",
" <td>6.5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1 6.5 3\n",
"1 1 -999.0 -999\n",
"2 -999 -999.0 -999\n",
"3 -999 6.5 3"
]
},
"execution_count": 142,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df15.replace(to_replace = \"NaN\", value = -999)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note**: \n",
"\n",
"The ffill() function is equivalent to fillna(method='ffill') and bfill() is equivalent to fillna(method='bfill').\n",
"\n",
"pad / ffill = fill values forward\n",
"bfill / backfill = fill values backward"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Questions?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Matplotlib: Data Visualizing"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Matplotlib is the most popular Python library for producing plots and other 2D data visualizations. Basic matplotlib graphs are easy to make. The tricker parts are in the details."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"My favorite tutorial on matplotlib: https://github.com/rougier/matplotlib-tutorial"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Two Simple Sine Graphs"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {
"collapsed": false
},
"source": [
"X = np.linspace(-np.pi, np.pi, 256, endpoint=True)\n",
"C, S = np.cos(X), np.sin(X)\n",
"\n",
"plt.plot(X, C)\n",
"plt.plot(X, S)\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Scatterplot"
]
},
{
"cell_type": "code",
"execution_count": 153,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<matplotlib.figure.Figure at 0x1177f53d0>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"n = 1024\n",
"X = np.random.normal(0,1,n)\n",
"Y = np.random.normal(0,1,n)\n",
"\n",
"plt.scatter(X,Y)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Adding some formatting, can you tell what we did?"
]
},
{
"cell_type": "code",
"execution_count": 148,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<matplotlib.figure.Figure at 0x1177e6350>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"n = 1024\n",
"X = np.random.normal(0,1,n)\n",
"Y = np.random.normal(0,1,n)\n",
"T = np.arctan2(Y,X)\n",
"\n",
"plt.axes([0.025,0.025,0.95,0.95])\n",
"plt.scatter(X,Y, s=75, c=T, alpha=.5)\n",
"\n",
"plt.xlim(-1.5,1.5), plt.xticks([])\n",
"plt.ylim(-1.5,1.5), plt.yticks([])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Bar Plots"
]
},
{
"cell_type": "code",
"execution_count": 159,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<matplotlib.figure.Figure at 0x1181cab50>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"n = 12\n",
"X = np.arange(n)\n",
"Y2 = (1-X/float(n)) * np.random.uniform(0.5,1.0,n)\n",
"\n",
"plt.bar(X, Y2, facecolor = 'pink', edgecolor='None')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Challenge Exercise\n",
"Supposed we have want to combine the two graphs above, and make the graph below, what should we do?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![hello](http://www.labri.fr/perso/nrougier/teaching/matplotlib/figures/bar_ex.png)"
]
},
{
"cell_type": "code",
"execution_count": 170,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<matplotlib.figure.Figure at 0x117882650>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"n = 12\n",
"X = np.arange(n)\n",
"Y1 = (1-X/float(n)) * np.random.uniform(0.5,1.0,n)\n",
"Y2 = (1-X/float(n)) * np.random.uniform(0.5,1.0,n)\n",
"\n",
"## enter code here\n",
"plt.bar(X, Y1, facecolor = 'blue', edgecolor='None')\n",
"plt.bar(X, -Y2, facecolor = 'pink', edgecolor='None')\n",
"\n",
"for x,y in zip(X,Y1):\n",
" plt.text(x+0.4, y+0.05, '%.2f' % y, ha='center', va= 'bottom')\n",
" \n",
"plt.ylim(-1.25, +1.25)\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Questions?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Scikit-Learn: Model Building"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Where does data cleaning, data exploration, data visualization and modeling building all fit in the grand scheme of things?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![hello](http://cacm.acm.org/system/assets/0001/3678/rp-overview.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###### Source: http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### To understand some of the common terminologies of machine learning, first go through this interactive tutorial:\n",
"### [R2D3's Visual Introduction to Machine Learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Let's go over the following terminology:\n",
"- Features / predictors / variables / X\n",
"- Outcome / predictions / Y\n",
"- Classification / regression \n",
"- Training dataset / testing dataset\n",
"- Overfitting"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.datasets import make_blobs\n",
"from sklearn import datasets\n",
"from sklearn.cross_validation import train_test_split\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.linear_model import LogisticRegression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Regression Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In regression, the predictor variable is continuous. Let's start with a simple dataset of sin curve with noise."
]
},
{
"cell_type": "code",
"execution_count": 172,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"x = np.linspace(-3, 3, 100)\n",
"y = x + np.random.normal(0,1,100)"
]
},
{
"cell_type": "code",
"execution_count": 173,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[<matplotlib.lines.Line2D at 0x117e98410>]"
]
},
"execution_count": 173,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"<matplotlib.figure.Figure at 0x117a2e410>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.plot(x, y, 'o')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the simplest models again is a regression, which fits a line to the scatterplot data. This is called ordinary least squares linear regression. The function in sklearn is LinearRegression. \n",
"\n",
"Before applying the regression however, X must be a 2d-array:"
]
},
{
"cell_type": "code",
"execution_count": 174,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(100,)\n",
"(100, 1)\n"
]
}
],
"source": [
"print(x.shape)\n",
"X = x[:, np.newaxis]\n",
"print(X.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We split our data in a training and a test set, to alleviate overfitting."
]
},
{
"cell_type": "code",
"execution_count": 175,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Build our regression model on training data"
]
},
{
"cell_type": "code",
"execution_count": 176,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
]
},
"execution_count": 176,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"regressor = LinearRegression()\n",
"regressor.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, evaluate the model on training data, the same data we built the model on."
]
},
{
"cell_type": "code",
"execution_count": 177,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"y_pred_train = regressor.predict(X_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's visualize the line and the data."
]
},
{
"cell_type": "code",
"execution_count": 178,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.legend.Legend at 0x11789a6d0>"
]
},
"execution_count": 178,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"<matplotlib.figure.Figure at 0x11826c390>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.plot(X_train, y_train, 'o', label=\"data\")\n",
"plt.plot(X_train, y_pred_train, 'o', label=\"prediction\")\n",
"plt.legend(loc='best')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, evaluate the model on testing data, which was purposely kept out of the model training process. "
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"y_pred_test = regressor.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.legend.Legend at 0x7f535ec11390>"
]
},
"execution_count": 98,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"<matplotlib.figure.Figure at 0x7f535edabb90>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.plot(X_test, y_test, 'o', label=\"data\")\n",
"plt.plot(X_test, y_pred_test, 'o', label=\"prediction\")\n",
"plt.legend(loc='best')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To quantitatively evaluate how well the model is performing, we use the score function. For regression models, the common methodology for scoring is using R2 or mean squared error (MSE)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compare model performance for training and testing."
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.8045065263359914"
]
},
"execution_count": 99,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"regressor.score(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.65666716237561662"
]
},
"execution_count": 100,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"regressor.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fit LinearRegression on the boston housing dataset. You can load the dataset using sklearn.datasets.load_boston."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"((506, 13), (506,))\n",
"(379, 13) (127, 13) (379,) (127,)\n"
]
},
{
"data": {
"text/plain": [
"<matplotlib.legend.Legend at 0x10e3c2590>"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"<matplotlib.figure.Figure at 0x10b0d1390>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"boston = datasets.load_boston()\n",
"Xboston = boston.data\n",
"Yboston = boston.target\n",
"print (Xboston.shape, Yboston.shape)\n",
" \n",
"Xboston_train, Xboston_test, Yboston_train, Yboston_test = train_test_split(Xboston, Yboston)\n",
"print Xboston_train.shape, Xboston_test.shape, Yboston_train.shape, Yboston_test.shape\n",
" \n",
"regressor = LinearRegression()\n",
"regressor.fit(Xboston_train, Yboston_train)\n",
" \n",
"y_pred_train = regressor.predict(Xboston_train) \n",
"\n",
"plt.plot(Xboston_train, Yboston_train, 'o', label=\"data\")\n",
"plt.plot(Xboston_train, y_pred_train, 'o', label=\"prediction\")\n",
"plt.legend(loc='best')\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interactive Demo: Classification Model"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(150, 2) (150,)\n",
"(112, 2) (38, 2) (112,) (38,)\n",
"[2 0 2 2 0 2 1 2 0 1 1 2 0 2 1 1 1 2 2 2 1 2 0 0 2 2 2 0 2 2 0 0 0 0 0 2 1\n",
" 0] [2 0 1 2 0 1 1 2 0 2 2 2 0 2 1 2 1 1 2 2 1 1 0 0 2 2 1 0 1 2 0 0 0 0 0 2 1\n",
" 0]\n",
"0.776785714286\n",
"0.763157894737\n"
]
}
],
"source": [
"# # import some data to play with\n",
"iris = datasets.load_iris()\n",
"X = iris.data[:, :2]\n",
"y = iris.target\n",
"print X.shape, y.shape\n",
"\n",
"# # split the data\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
"print X_train.shape, X_test.shape, y_train.shape, y_test.shape\n",
"\n",
"classifier = LogisticRegression()\n",
"classifier.fit(X_train, y_train)\n",
"prediction = classifier.predict(X_test)\n",
"print prediction, y_test\n",
"\n",
"print classifier.score(X_train, y_train)\n",
"print classifier.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Re-run the classification model above, but this time use the entire iris dataset."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(150, 4) (150,)\n",
"(112, 4) (38, 4) (112,) (38,)\n",
"[2 0 2 1 2 1 1 0 2 1 1 1 0 0 2 2 1 0 0 0 2 0 0 0 1 0 1 0 0 2 2 0 2 0 2 2 0\n",
" 1] [2 0 2 1 2 1 1 0 2 1 1 1 0 0 1 2 1 0 0 0 2 0 0 0 1 0 1 0 0 2 2 0 2 0 2 1 0\n",
" 1]\n",
"0.973214285714\n",
"0.947368421053\n"
]
}
],
"source": [
"xiris = iris.data\n",
"yiris = iris.target\n",
"print xiris.shape, yiris.shape\n",
"\n",
"xiris_train, xiris_test, yiris_train, yiris_test = train_test_split(xiris, yiris)\n",
"print xiris_train.shape, xiris_test.shape, yiris_train.shape, yiris_test.shape\n",
"\n",
"classifier = LogisticRegression()\n",
"classifier.fit(xiris_train, yiris_train)\n",
"prediction = classifier.predict(xiris_test)\n",
"print prediction, yiris_test\n",
"\n",
"print classifier.score(xiris_train, yiris_train)\n",
"print classifier.score(xiris_test, yiris_test)\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Further Resources"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Learn Python at your own pace:\n",
"- [Learn Python the Hard Way](http://learnpythonthehardway.org/)\n",
"- [Python Codeacademy](https://www.codecademy.com/learn/python)\n",
"- [Hackerrank Challenges in Python](https://www.hackerrank.com/)\n",
"\n",
"Online data science & Python classes:\n",
"- [Coursera: Learn to Program and Analyze Data with Python](https://www.coursera.org/specializations/python)\n",
"- [Coursera: Machine Learning with University of Washington](https://www.coursera.org/specializations/machine-learning)\n",
"\n",
"Python Reference Book:\n",
"- [Python for Data Analysis, O'Reilly Media](http://shop.oreilly.com/product/0636920023784.do)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment