Last active
June 17, 2018 22:51
-
-
Save ulgens/a967f9a971940c83a922f3f4cf450912 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Agenda\n", | |
"- Python Basics\n", | |
"- Data Wrangling & Understanding with Pandas\n", | |
"- Data Visualizing with Matplotlib\n", | |
"- Model Building with Scikit-Learn" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Python Basics" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### There are many ways to run Python code.\n", | |
"Command Line Interpreter \n", | |
" - Usually installed as /usr/local/bin/python\n", | |
"\n", | |
"Command Line File Execution \n", | |
" - Basic: SublimeText2, Atom, Notepad ++\n", | |
" - Advanced: Vim, Emacs \n", | |
"\n", | |
"IDE (Integrated Development Environments)\n", | |
" - Canopy\n", | |
" - PyCharm\n", | |
" - Spyder\n", | |
"\n", | |
"Jupyter Notebook (formerly iPython Notebook)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### What is Jupyter Notebook?\n", | |
"It is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### What are the key Python libraries for data science?\n", | |
"\n", | |
"1. **Pandas** – R-like dataframes, data cleaning, data munging, data exploration\n", | |
"\n", | |
"2. **Matplotlib** – plotting environment\n", | |
"\n", | |
"3. **NumPy** – fast arrays/matrices and vector operations\n", | |
"\n", | |
"4. **Scikit-learn** – maching learning, data mining\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Import Python Libraries" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"%matplotlib inline\n", | |
"\n", | |
"import matplotlib\n", | |
"import numpy as np\n", | |
"import pandas as pd\n", | |
"import matplotlib.pyplot as plt\n", | |
"\n", | |
"from sklearn.datasets import make_blobs\n", | |
"from sklearn.cross_validation import train_test_split\n", | |
"from sklearn.linear_model import LinearRegression\n", | |
"from sklearn.linear_model import LogisticRegression" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Pandas: Data Wrangling" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 1. The Basics" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"data1 = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],\n", | |
" 'year': [2000, 2001, 2002, 2001, 2002],\n", | |
" 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]\n", | |
" }\n", | |
"df1 = pd.DataFrame(data1, index=['one', 'two', 'three', 'four', 'five'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>pop</th>\n", | |
" <th>state</th>\n", | |
" <th>year</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>one</th>\n", | |
" <td>1.5</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>two</th>\n", | |
" <td>1.7</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2001</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>three</th>\n", | |
" <td>3.6</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2002</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>four</th>\n", | |
" <td>2.4</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2001</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>five</th>\n", | |
" <td>2.9</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2002</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" pop state year\n", | |
"one 1.5 Ohio 2000\n", | |
"two 1.7 Ohio 2001\n", | |
"three 3.6 Ohio 2002\n", | |
"four 2.4 Nevada 2001\n", | |
"five 2.9 Nevada 2002" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df1" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"one Ohio\n", | |
"two Ohio\n", | |
"three Ohio\n", | |
"four Nevada\n", | |
"five Nevada\n", | |
"Name: state, dtype: object" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df1['state']" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"one Ohio\n", | |
"two Ohio\n", | |
"three Ohio\n", | |
"four Nevada\n", | |
"five Nevada\n", | |
"Name: state, dtype: object" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df1.state" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Rows can also be retrieved by position or name by a couple of methods, such as the ix indexing field:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"pop 3.6\n", | |
"state Ohio\n", | |
"year 2002\n", | |
"Name: three, dtype: object" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df1.ix['three']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Columns can be modified by assignment. For example, the empty 'debt' column could be assigned a scalar value or an array of values:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>pop</th>\n", | |
" <th>state</th>\n", | |
" <th>year</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>one</th>\n", | |
" <td>1.5</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>two</th>\n", | |
" <td>1.7</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2001</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>three</th>\n", | |
" <td>3.6</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2002</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>four</th>\n", | |
" <td>2.4</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2001</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>five</th>\n", | |
" <td>2.9</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2002</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" pop state year\n", | |
"one 1.5 Ohio 2000\n", | |
"two 1.7 Ohio 2001\n", | |
"three 3.6 Ohio 2002\n", | |
"four 2.4 Nevada 2001\n", | |
"five 2.9 Nevada 2002" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"df1['new_var'] = 16.5" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>pop</th>\n", | |
" <th>state</th>\n", | |
" <th>year</th>\n", | |
" <th>new_var</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>one</th>\n", | |
" <td>1.5</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2000</td>\n", | |
" <td>16.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>two</th>\n", | |
" <td>1.7</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2001</td>\n", | |
" <td>16.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>three</th>\n", | |
" <td>3.6</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2002</td>\n", | |
" <td>16.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>four</th>\n", | |
" <td>2.4</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2001</td>\n", | |
" <td>16.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>five</th>\n", | |
" <td>2.9</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2002</td>\n", | |
" <td>16.5</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" pop state year new_var\n", | |
"one 1.5 Ohio 2000 16.5\n", | |
"two 1.7 Ohio 2001 16.5\n", | |
"three 3.6 Ohio 2002 16.5\n", | |
"four 2.4 Nevada 2001 16.5\n", | |
"five 2.9 Nevada 2002 16.5" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([ 0., 1., 2., 3., 4.])" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"np.arange(5.)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"df1['new_var'] = np.arange(5.)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>pop</th>\n", | |
" <th>state</th>\n", | |
" <th>year</th>\n", | |
" <th>new_var</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>one</th>\n", | |
" <td>1.5</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2000</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>two</th>\n", | |
" <td>1.7</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2001</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>three</th>\n", | |
" <td>3.6</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2002</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>four</th>\n", | |
" <td>2.4</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2001</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>five</th>\n", | |
" <td>2.9</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2002</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" pop state year new_var\n", | |
"one 1.5 Ohio 2000 0\n", | |
"two 1.7 Ohio 2001 1\n", | |
"three 3.6 Ohio 2002 2\n", | |
"four 2.4 Nevada 2001 3\n", | |
"five 2.9 Nevada 2002 4" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df1" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"When assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, it will be instead conformed exactly to the DataFrame’s index, inserting missing values in any holes:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"two -1.2\n", | |
"four -1.5\n", | |
"five -1.7\n", | |
"dtype: float64" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"val" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>pop</th>\n", | |
" <th>state</th>\n", | |
" <th>year</th>\n", | |
" <th>new_var</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>one</th>\n", | |
" <td>1.5</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2000</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>two</th>\n", | |
" <td>1.7</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2001</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>three</th>\n", | |
" <td>3.6</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2002</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>four</th>\n", | |
" <td>2.4</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2001</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>five</th>\n", | |
" <td>2.9</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2002</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" pop state year new_var\n", | |
"one 1.5 Ohio 2000 0\n", | |
"two 1.7 Ohio 2001 1\n", | |
"three 3.6 Ohio 2002 2\n", | |
"four 2.4 Nevada 2001 3\n", | |
"five 2.9 Nevada 2002 4" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"df1['new_var'] = val" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>pop</th>\n", | |
" <th>state</th>\n", | |
" <th>year</th>\n", | |
" <th>new_var</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>one</th>\n", | |
" <td>1.5</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2000</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>two</th>\n", | |
" <td>1.7</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2001</td>\n", | |
" <td>-1.2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>three</th>\n", | |
" <td>3.6</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2002</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>four</th>\n", | |
" <td>2.4</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2001</td>\n", | |
" <td>-1.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>five</th>\n", | |
" <td>2.9</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2002</td>\n", | |
" <td>-1.7</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" pop state year new_var\n", | |
"one 1.5 Ohio 2000 NaN\n", | |
"two 1.7 Ohio 2001 -1.2\n", | |
"three 3.6 Ohio 2002 NaN\n", | |
"four 2.4 Nevada 2001 -1.5\n", | |
"five 2.9 Nevada 2002 -1.7" | |
] | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df1" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The del keyword will delete columns as with a dict:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>pop</th>\n", | |
" <th>state</th>\n", | |
" <th>year</th>\n", | |
" <th>new_var</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>one</th>\n", | |
" <td>1.5</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2000</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>two</th>\n", | |
" <td>1.7</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2001</td>\n", | |
" <td>-1.2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>three</th>\n", | |
" <td>3.6</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2002</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>four</th>\n", | |
" <td>2.4</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2001</td>\n", | |
" <td>-1.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>five</th>\n", | |
" <td>2.9</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2002</td>\n", | |
" <td>-1.7</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" pop state year new_var\n", | |
"one 1.5 Ohio 2000 NaN\n", | |
"two 1.7 Ohio 2001 -1.2\n", | |
"three 3.6 Ohio 2002 NaN\n", | |
"four 2.4 Nevada 2001 -1.5\n", | |
"five 2.9 Nevada 2002 -1.7" | |
] | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"del df1['new_var']" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>pop</th>\n", | |
" <th>state</th>\n", | |
" <th>year</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>one</th>\n", | |
" <td>1.5</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>two</th>\n", | |
" <td>1.7</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2001</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>three</th>\n", | |
" <td>3.6</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2002</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>four</th>\n", | |
" <td>2.4</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2001</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>five</th>\n", | |
" <td>2.9</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2002</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" pop state year\n", | |
"one 1.5 Ohio 2000\n", | |
"two 1.7 Ohio 2001\n", | |
"three 3.6 Ohio 2002\n", | |
"four 2.4 Nevada 2001\n", | |
"five 2.9 Nevada 2002" | |
] | |
}, | |
"execution_count": 19, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df1" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Exercise" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"data2 = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],\n", | |
" 'year': [2000, 2001, 2002, 2001, 2002],\n", | |
" 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]\n", | |
" }\n", | |
"df2 = pd.DataFrame(data2, index=['one', 'two', 'three', 'four', 'five'])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Print the entire dataframe \"df2\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>pop</th>\n", | |
" <th>state</th>\n", | |
" <th>year</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>one</th>\n", | |
" <td>1.5</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>two</th>\n", | |
" <td>1.7</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2001</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>three</th>\n", | |
" <td>3.6</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2002</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>four</th>\n", | |
" <td>2.4</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2001</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>five</th>\n", | |
" <td>2.9</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2002</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" pop state year\n", | |
"one 1.5 Ohio 2000\n", | |
"two 1.7 Ohio 2001\n", | |
"three 3.6 Ohio 2002\n", | |
"four 2.4 Nevada 2001\n", | |
"five 2.9 Nevada 2002" | |
] | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df2" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Print only column \"pop\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"one 1.5\n", | |
"two 1.7\n", | |
"three 3.6\n", | |
"four 2.4\n", | |
"five 2.9\n", | |
"Name: pop, dtype: float64" | |
] | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df2[\"pop\"]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Print only row 2" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 48, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"pop 1.7\n", | |
"state Ohio\n", | |
"year 2001\n", | |
"new_var2 2\n", | |
"Name: two, dtype: object" | |
] | |
}, | |
"execution_count": 48, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df2.ix[\"two\"]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Create a new column \"new_var2\" in with values (1,2,3,4,5). (Hint: there are multiple ways of doing this.)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"df2[\"new_var2\"] = np.arange(1,6)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>pop</th>\n", | |
" <th>state</th>\n", | |
" <th>year</th>\n", | |
" <th>new_var2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>one</th>\n", | |
" <td>1.5</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2000</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>two</th>\n", | |
" <td>1.7</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2001</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>three</th>\n", | |
" <td>3.6</td>\n", | |
" <td>Ohio</td>\n", | |
" <td>2002</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>four</th>\n", | |
" <td>2.4</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2001</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>five</th>\n", | |
" <td>2.9</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>2002</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" pop state year new_var2\n", | |
"one 1.5 Ohio 2000 1\n", | |
"two 1.7 Ohio 2001 2\n", | |
"three 3.6 Ohio 2002 3\n", | |
"four 2.4 Nevada 2001 4\n", | |
"five 2.9 Nevada 2002 5" | |
] | |
}, | |
"execution_count": 34, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df2" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 2. Select and Filter" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"df3 = pd.DataFrame(np.arange(16).reshape((4, 4)),\n", | |
" index=['Ohio', 'Colorado', 'Utah', 'New York'],\n", | |
" columns=['one', 'two', 'three', 'four'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>three</th>\n", | |
" <th>four</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Ohio</th>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Colorado</th>\n", | |
" <td>4</td>\n", | |
" <td>5</td>\n", | |
" <td>6</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Utah</th>\n", | |
" <td>8</td>\n", | |
" <td>9</td>\n", | |
" <td>10</td>\n", | |
" <td>11</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>New York</th>\n", | |
" <td>12</td>\n", | |
" <td>13</td>\n", | |
" <td>14</td>\n", | |
" <td>15</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two three four\n", | |
"Ohio 0 1 2 3\n", | |
"Colorado 4 5 6 7\n", | |
"Utah 8 9 10 11\n", | |
"New York 12 13 14 15" | |
] | |
}, | |
"execution_count": 36, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df3" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 37, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>three</th>\n", | |
" <th>four</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Ohio</th>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Colorado</th>\n", | |
" <td>4</td>\n", | |
" <td>5</td>\n", | |
" <td>6</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two three four\n", | |
"Ohio 0 1 2 3\n", | |
"Colorado 4 5 6 7" | |
] | |
}, | |
"execution_count": 37, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df3[0:2]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Ohio 1\n", | |
"Colorado 5\n", | |
"Utah 9\n", | |
"New York 13\n", | |
"Name: two, dtype: int64" | |
] | |
}, | |
"execution_count": 24, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df3['two']" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>three</th>\n", | |
" <th>one</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Ohio</th>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Colorado</th>\n", | |
" <td>6</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Utah</th>\n", | |
" <td>10</td>\n", | |
" <td>8</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>New York</th>\n", | |
" <td>14</td>\n", | |
" <td>12</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" three one\n", | |
"Ohio 2 0\n", | |
"Colorado 6 4\n", | |
"Utah 10 8\n", | |
"New York 14 12" | |
] | |
}, | |
"execution_count": 25, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df3[['three', 'one']]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"You can select rows by slicing or a boolean array:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 42, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>three</th>\n", | |
" <th>four</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Ohio</th>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Colorado</th>\n", | |
" <td>4</td>\n", | |
" <td>5</td>\n", | |
" <td>6</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Utah</th>\n", | |
" <td>8</td>\n", | |
" <td>9</td>\n", | |
" <td>10</td>\n", | |
" <td>11</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>New York</th>\n", | |
" <td>12</td>\n", | |
" <td>13</td>\n", | |
" <td>14</td>\n", | |
" <td>15</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two three four\n", | |
"Ohio 0 1 2 3\n", | |
"Colorado 4 5 6 7\n", | |
"Utah 8 9 10 11\n", | |
"New York 12 13 14 15" | |
] | |
}, | |
"execution_count": 42, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df3" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 39, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>three</th>\n", | |
" <th>four</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Ohio</th>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Colorado</th>\n", | |
" <td>4</td>\n", | |
" <td>5</td>\n", | |
" <td>6</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two three four\n", | |
"Ohio 0 1 2 3\n", | |
"Colorado 4 5 6 7" | |
] | |
}, | |
"execution_count": 39, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df3[:2]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>three</th>\n", | |
" <th>four</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Colorado</th>\n", | |
" <td>4</td>\n", | |
" <td>5</td>\n", | |
" <td>6</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Utah</th>\n", | |
" <td>8</td>\n", | |
" <td>9</td>\n", | |
" <td>10</td>\n", | |
" <td>11</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>New York</th>\n", | |
" <td>12</td>\n", | |
" <td>13</td>\n", | |
" <td>14</td>\n", | |
" <td>15</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two three four\n", | |
"Colorado 4 5 6 7\n", | |
"Utah 8 9 10 11\n", | |
"New York 12 13 14 15" | |
] | |
}, | |
"execution_count": 28, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df3[df3['three'] > 5]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>three</th>\n", | |
" <th>four</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Ohio</th>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Colorado</th>\n", | |
" <td>4</td>\n", | |
" <td>5</td>\n", | |
" <td>6</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Utah</th>\n", | |
" <td>8</td>\n", | |
" <td>9</td>\n", | |
" <td>10</td>\n", | |
" <td>11</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>New York</th>\n", | |
" <td>12</td>\n", | |
" <td>13</td>\n", | |
" <td>14</td>\n", | |
" <td>15</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two three four\n", | |
"Ohio 0 1 2 3\n", | |
"Colorado 4 5 6 7\n", | |
"Utah 8 9 10 11\n", | |
"New York 12 13 14 15" | |
] | |
}, | |
"execution_count": 29, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df3" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>three</th>\n", | |
" <th>four</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Ohio</th>\n", | |
" <td>True</td>\n", | |
" <td>True</td>\n", | |
" <td>True</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Colorado</th>\n", | |
" <td>True</td>\n", | |
" <td>False</td>\n", | |
" <td>False</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Utah</th>\n", | |
" <td>False</td>\n", | |
" <td>False</td>\n", | |
" <td>False</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>New York</th>\n", | |
" <td>False</td>\n", | |
" <td>False</td>\n", | |
" <td>False</td>\n", | |
" <td>False</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two three four\n", | |
"Ohio True True True True\n", | |
"Colorado True False False False\n", | |
"Utah False False False False\n", | |
"New York False False False False" | |
] | |
}, | |
"execution_count": 30, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df3 < 5" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>three</th>\n", | |
" <th>four</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Ohio</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Colorado</th>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>6</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Utah</th>\n", | |
" <td>8</td>\n", | |
" <td>9</td>\n", | |
" <td>10</td>\n", | |
" <td>11</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>New York</th>\n", | |
" <td>12</td>\n", | |
" <td>13</td>\n", | |
" <td>14</td>\n", | |
" <td>15</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two three four\n", | |
"Ohio 0 0 0 0\n", | |
"Colorado 0 5 6 7\n", | |
"Utah 8 9 10 11\n", | |
"New York 12 13 14 15" | |
] | |
}, | |
"execution_count": 31, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df3[df3 < 5] = 0\n", | |
"df3" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"You can also drop columns using index:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>three</th>\n", | |
" <th>four</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Ohio</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Colorado</th>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>6</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Utah</th>\n", | |
" <td>8</td>\n", | |
" <td>9</td>\n", | |
" <td>10</td>\n", | |
" <td>11</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>New York</th>\n", | |
" <td>12</td>\n", | |
" <td>13</td>\n", | |
" <td>14</td>\n", | |
" <td>15</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two three four\n", | |
"Ohio 0 0 0 0\n", | |
"Colorado 0 5 6 7\n", | |
"Utah 8 9 10 11\n", | |
"New York 12 13 14 15" | |
] | |
}, | |
"execution_count": 32, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df3" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>three</th>\n", | |
" <th>four</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Utah</th>\n", | |
" <td>8</td>\n", | |
" <td>9</td>\n", | |
" <td>10</td>\n", | |
" <td>11</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>New York</th>\n", | |
" <td>12</td>\n", | |
" <td>13</td>\n", | |
" <td>14</td>\n", | |
" <td>15</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two three four\n", | |
"Utah 8 9 10 11\n", | |
"New York 12 13 14 15" | |
] | |
}, | |
"execution_count": 33, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df3.drop(['Colorado', 'Ohio'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 43, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>three</th>\n", | |
" <th>four</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Ohio</th>\n", | |
" <td>0</td>\n", | |
" <td>2</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Colorado</th>\n", | |
" <td>4</td>\n", | |
" <td>6</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Utah</th>\n", | |
" <td>8</td>\n", | |
" <td>10</td>\n", | |
" <td>11</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>New York</th>\n", | |
" <td>12</td>\n", | |
" <td>14</td>\n", | |
" <td>15</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one three four\n", | |
"Ohio 0 2 3\n", | |
"Colorado 4 6 7\n", | |
"Utah 8 10 11\n", | |
"New York 12 14 15" | |
] | |
}, | |
"execution_count": 43, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df3.drop('two', axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>three</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Ohio</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Colorado</th>\n", | |
" <td>0</td>\n", | |
" <td>6</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Utah</th>\n", | |
" <td>8</td>\n", | |
" <td>10</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>New York</th>\n", | |
" <td>12</td>\n", | |
" <td>14</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one three\n", | |
"Ohio 0 0\n", | |
"Colorado 0 6\n", | |
"Utah 8 10\n", | |
"New York 12 14" | |
] | |
}, | |
"execution_count": 35, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df3.drop(['two', 'four'], axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"One Final Note: While many pandas functions (like reindex) require that the labels be unique, it’s not mandatory. Data selection is one of the main things that behaves differently with duplicates. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 37, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"a 0\n", | |
"a 1\n", | |
"b 2\n", | |
"b 3\n", | |
"c 4\n", | |
"dtype: int64" | |
] | |
}, | |
"execution_count": 37, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"obj" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The index’s is_unique property can tell you whether its values are unique or not. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 38, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"False" | |
] | |
}, | |
"execution_count": 38, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"obj.index.is_unique" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Indexing a value with multiple entries returns a Series " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 39, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"a 0\n", | |
"a 1\n", | |
"dtype: int64" | |
] | |
}, | |
"execution_count": 39, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"obj['a']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Indexing a value with single entry return a scalar value:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 40, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"4" | |
] | |
}, | |
"execution_count": 40, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"obj['c']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Exercise" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 44, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"df4 = pd.DataFrame(np.arange(25).reshape((5, 5)),\n", | |
" index=['Ohio', 'Colorado', 'Utah', 'New York', 'California'],\n", | |
" columns=['one', 'two', 'three', 'four', 'five'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 45, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>three</th>\n", | |
" <th>four</th>\n", | |
" <th>five</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Ohio</th>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>3</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Colorado</th>\n", | |
" <td>5</td>\n", | |
" <td>6</td>\n", | |
" <td>7</td>\n", | |
" <td>8</td>\n", | |
" <td>9</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Utah</th>\n", | |
" <td>10</td>\n", | |
" <td>11</td>\n", | |
" <td>12</td>\n", | |
" <td>13</td>\n", | |
" <td>14</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>New York</th>\n", | |
" <td>15</td>\n", | |
" <td>16</td>\n", | |
" <td>17</td>\n", | |
" <td>18</td>\n", | |
" <td>19</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>California</th>\n", | |
" <td>20</td>\n", | |
" <td>21</td>\n", | |
" <td>22</td>\n", | |
" <td>23</td>\n", | |
" <td>24</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two three four five\n", | |
"Ohio 0 1 2 3 4\n", | |
"Colorado 5 6 7 8 9\n", | |
"Utah 10 11 12 13 14\n", | |
"New York 15 16 17 18 19\n", | |
"California 20 21 22 23 24" | |
] | |
}, | |
"execution_count": 45, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df4" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"select rows \"Ohio\", \"Colorado\", and all columns" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 51, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>three</th>\n", | |
" <th>four</th>\n", | |
" <th>five</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Ohio</th>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>3</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Colorado</th>\n", | |
" <td>5</td>\n", | |
" <td>6</td>\n", | |
" <td>7</td>\n", | |
" <td>8</td>\n", | |
" <td>9</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two three four five\n", | |
"Ohio 0 1 2 3 4\n", | |
"Colorado 5 6 7 8 9" | |
] | |
}, | |
"execution_count": 51, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df4[0:2]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"select columns \"one\", \"two\", and all rows" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 53, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Ohio</th>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Colorado</th>\n", | |
" <td>5</td>\n", | |
" <td>6</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Utah</th>\n", | |
" <td>10</td>\n", | |
" <td>11</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>New York</th>\n", | |
" <td>15</td>\n", | |
" <td>16</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>California</th>\n", | |
" <td>20</td>\n", | |
" <td>21</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two\n", | |
"Ohio 0 1\n", | |
"Colorado 5 6\n", | |
"Utah 10 11\n", | |
"New York 15 16\n", | |
"California 20 21" | |
] | |
}, | |
"execution_count": 53, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df4[[\"one\", \"two\"]]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"select all rows where column \"three\" = 7" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 56, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>three</th>\n", | |
" <th>four</th>\n", | |
" <th>five</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Colorado</th>\n", | |
" <td>5</td>\n", | |
" <td>6</td>\n", | |
" <td>7</td>\n", | |
" <td>8</td>\n", | |
" <td>9</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two three four five\n", | |
"Colorado 5 6 7 8 9" | |
] | |
}, | |
"execution_count": 56, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df4[df4[\"three\"]==7]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"drop column \"three\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 59, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>four</th>\n", | |
" <th>five</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Ohio</th>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>3</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Colorado</th>\n", | |
" <td>5</td>\n", | |
" <td>6</td>\n", | |
" <td>8</td>\n", | |
" <td>9</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Utah</th>\n", | |
" <td>10</td>\n", | |
" <td>11</td>\n", | |
" <td>13</td>\n", | |
" <td>14</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>New York</th>\n", | |
" <td>15</td>\n", | |
" <td>16</td>\n", | |
" <td>18</td>\n", | |
" <td>19</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>California</th>\n", | |
" <td>20</td>\n", | |
" <td>21</td>\n", | |
" <td>23</td>\n", | |
" <td>24</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two four five\n", | |
"Ohio 0 1 3 4\n", | |
"Colorado 5 6 8 9\n", | |
"Utah 10 11 13 14\n", | |
"New York 15 16 18 19\n", | |
"California 20 21 23 24" | |
] | |
}, | |
"execution_count": 59, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df4.drop([\"three\"], axis = 1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"drop column \"three\" again, but in a different way" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 62, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>four</th>\n", | |
" <th>five</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Ohio</th>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>3</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Colorado</th>\n", | |
" <td>5</td>\n", | |
" <td>6</td>\n", | |
" <td>8</td>\n", | |
" <td>9</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Utah</th>\n", | |
" <td>10</td>\n", | |
" <td>11</td>\n", | |
" <td>13</td>\n", | |
" <td>14</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>New York</th>\n", | |
" <td>15</td>\n", | |
" <td>16</td>\n", | |
" <td>18</td>\n", | |
" <td>19</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>California</th>\n", | |
" <td>20</td>\n", | |
" <td>21</td>\n", | |
" <td>23</td>\n", | |
" <td>24</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two four five\n", | |
"Ohio 0 1 3 4\n", | |
"Colorado 5 6 8 9\n", | |
"Utah 10 11 13 14\n", | |
"New York 15 16 18 19\n", | |
"California 20 21 23 24" | |
] | |
}, | |
"execution_count": 62, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"del df4[\"three\"]\n", | |
"df4" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**CHALLENGE**: select all columns where row \"Colorado\" >= 7. [Hint](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.transpose.html)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 66, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Ohio</th>\n", | |
" <th>Colorado</th>\n", | |
" <th>Utah</th>\n", | |
" <th>New York</th>\n", | |
" <th>California</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>four</th>\n", | |
" <td>3</td>\n", | |
" <td>8</td>\n", | |
" <td>13</td>\n", | |
" <td>18</td>\n", | |
" <td>23</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>five</th>\n", | |
" <td>4</td>\n", | |
" <td>9</td>\n", | |
" <td>14</td>\n", | |
" <td>19</td>\n", | |
" <td>24</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Ohio Colorado Utah New York California\n", | |
"four 3 8 13 18 23\n", | |
"five 4 9 14 19 24" | |
] | |
}, | |
"execution_count": 66, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df4.transpose = df4.T\n", | |
"df4.transpose\n", | |
"df4.transpose[df4.transpose[\"Colorado\"] >= 7]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 3. Join Dataframes" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"There are multiple ways of joining dataframes using pandas. The three most common ways are pandas.merge, pandas.concat, and combine_first. The method that is most familiar to relationship database (SQL) users is the pandas.merge command, so that's what we will focus on." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 67, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"df5_a = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})\n", | |
"df5_b = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 68, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>data1</th>\n", | |
" <th>key</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0</td>\n", | |
" <td>b</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>b</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2</td>\n", | |
" <td>a</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>3</td>\n", | |
" <td>c</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>4</td>\n", | |
" <td>a</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>5</td>\n", | |
" <td>a</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>6</td>\n", | |
" <td>b</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" data1 key\n", | |
"0 0 b\n", | |
"1 1 b\n", | |
"2 2 a\n", | |
"3 3 c\n", | |
"4 4 a\n", | |
"5 5 a\n", | |
"6 6 b" | |
] | |
}, | |
"execution_count": 68, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df5_a" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 69, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>data2</th>\n", | |
" <th>key</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0</td>\n", | |
" <td>a</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>b</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2</td>\n", | |
" <td>d</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" data2 key\n", | |
"0 0 a\n", | |
"1 1 b\n", | |
"2 2 d" | |
] | |
}, | |
"execution_count": 69, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df5_b" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This is an example of a many-to-one merge situation; the data in df5_a has multiple rows labeled a and b, whereas df5_b has only merge with these objects we obtain:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 70, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>data1</th>\n", | |
" <th>key</th>\n", | |
" <th>data2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0</td>\n", | |
" <td>b</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>b</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>6</td>\n", | |
" <td>b</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2</td>\n", | |
" <td>a</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>4</td>\n", | |
" <td>a</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>5</td>\n", | |
" <td>a</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" data1 key data2\n", | |
"0 0 b 1\n", | |
"1 1 b 1\n", | |
"2 6 b 1\n", | |
"3 2 a 0\n", | |
"4 4 a 0\n", | |
"5 5 a 0" | |
] | |
}, | |
"execution_count": 70, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"pd.merge(df5_a, df5_b)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Note that I didn’t specify which column to join on. If not specified, merge uses the overlapping column names as the keys. It’s a good practice to specify explicitly, though:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 71, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>data1</th>\n", | |
" <th>key</th>\n", | |
" <th>data2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0</td>\n", | |
" <td>b</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>b</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>6</td>\n", | |
" <td>b</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2</td>\n", | |
" <td>a</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>4</td>\n", | |
" <td>a</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>5</td>\n", | |
" <td>a</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" data1 key data2\n", | |
"0 0 b 1\n", | |
"1 1 b 1\n", | |
"2 6 b 1\n", | |
"3 2 a 0\n", | |
"4 4 a 0\n", | |
"5 5 a 0" | |
] | |
}, | |
"execution_count": 71, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"pd.merge(df5_a, df5_b, on='key')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Exercise" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"If the column names are different in each object, you can specify the key to join by explicitly. Join df6_a and df6_b below, where the parameters are left_on=? and right_on= ?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 81, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"df6_a = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})\n", | |
"df6_b = pd.DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)})" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 80, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>data1</th>\n", | |
" <th>lkey</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0</td>\n", | |
" <td>b</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>b</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2</td>\n", | |
" <td>a</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>3</td>\n", | |
" <td>c</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>4</td>\n", | |
" <td>a</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>5</td>\n", | |
" <td>a</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>6</td>\n", | |
" <td>b</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" data1 lkey\n", | |
"0 0 b\n", | |
"1 1 b\n", | |
"2 2 a\n", | |
"3 3 c\n", | |
"4 4 a\n", | |
"5 5 a\n", | |
"6 6 b" | |
] | |
}, | |
"execution_count": 80, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df6_a" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 50, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>data2</th>\n", | |
" <th>rkey</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0</td>\n", | |
" <td>a</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>b</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2</td>\n", | |
" <td>d</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" data2 rkey\n", | |
"0 0 a\n", | |
"1 1 b\n", | |
"2 2 d" | |
] | |
}, | |
"execution_count": 50, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df6_b" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 82, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>data1</th>\n", | |
" <th>lkey</th>\n", | |
" <th>data2</th>\n", | |
" <th>rkey</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0</td>\n", | |
" <td>b</td>\n", | |
" <td>1</td>\n", | |
" <td>b</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>b</td>\n", | |
" <td>1</td>\n", | |
" <td>b</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>6</td>\n", | |
" <td>b</td>\n", | |
" <td>1</td>\n", | |
" <td>b</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2</td>\n", | |
" <td>a</td>\n", | |
" <td>0</td>\n", | |
" <td>a</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>4</td>\n", | |
" <td>a</td>\n", | |
" <td>0</td>\n", | |
" <td>a</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>5</td>\n", | |
" <td>a</td>\n", | |
" <td>0</td>\n", | |
" <td>a</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>2</td>\n", | |
" <td>d</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" data1 lkey data2 rkey\n", | |
"0 0 b 1 b\n", | |
"1 1 b 1 b\n", | |
"2 6 b 1 b\n", | |
"3 2 a 0 a\n", | |
"4 4 a 0 a\n", | |
"5 5 a 0 a\n", | |
"6 NaN NaN 2 d" | |
] | |
}, | |
"execution_count": 82, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"pd.merge(df6_a, df6_b, right_on=\"rkey\", left_on=\"lkey\", how=\"right\")\n", | |
"\n", | |
"\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Questions?" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Pandas: Data Understanding" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 1. Descriptive Statistics" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame. Compared with the equivalent methods of vanilla NumPy arrays, they are all built from the ground up to exclude missing data." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 83, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"df7 = pd.DataFrame([[1.4, np.nan]\n", | |
" , [7.1, -4.5]\n", | |
" , [np.nan, np.nan]\n", | |
" , [0.75, -1.3]]\n", | |
" , index=['a', 'b', 'c', 'd']\n", | |
" , columns=['one', 'two'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 84, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>a</th>\n", | |
" <td>1.40</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>b</th>\n", | |
" <td>7.10</td>\n", | |
" <td>-4.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>c</th>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>d</th>\n", | |
" <td>0.75</td>\n", | |
" <td>-1.3</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two\n", | |
"a 1.40 NaN\n", | |
"b 7.10 -4.5\n", | |
"c NaN NaN\n", | |
"d 0.75 -1.3" | |
] | |
}, | |
"execution_count": 84, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df7" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Calling DataFrame’s sum method returns a Series containing column sums:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 85, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"one 9.25\n", | |
"two -5.80\n", | |
"dtype: float64" | |
] | |
}, | |
"execution_count": 85, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df7.sum()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Passing axis=1 sums over the rows instead:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 86, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"a 1.40\n", | |
"b 2.60\n", | |
"c 0.00\n", | |
"d -0.55\n", | |
"dtype: float64" | |
] | |
}, | |
"execution_count": 86, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df7.sum(axis = 1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"NA values are excluded unless the entire slice (row or column in this case) is NA. This can be disabled using the skipna option:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 55, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"a NaN\n", | |
"b 1.300\n", | |
"c NaN\n", | |
"d -0.275\n", | |
"dtype: float64" | |
] | |
}, | |
"execution_count": 55, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df7.mean(axis=1, skipna=False)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Other cumulative functions also include: sum, mean, median, var, std, skew, etc." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Exercise" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 87, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"df8 = pd.DataFrame([[1.4, np.nan, 1]\n", | |
" , [7.1, -4.5, 2]\n", | |
" , [np.nan, np.nan, 3]\n", | |
" , [0.75, -1.3, np.nan]]\n", | |
" , index=['a', 'b', 'c', 'd']\n", | |
" , columns=['one', 'two', 'three'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 57, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>three</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>a</th>\n", | |
" <td>1.40</td>\n", | |
" <td>NaN</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>b</th>\n", | |
" <td>7.10</td>\n", | |
" <td>-4.5</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>c</th>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>d</th>\n", | |
" <td>0.75</td>\n", | |
" <td>-1.3</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two three\n", | |
"a 1.40 NaN 1\n", | |
"b 7.10 -4.5 2\n", | |
"c NaN NaN 3\n", | |
"d 0.75 -1.3 NaN" | |
] | |
}, | |
"execution_count": 57, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df8" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Find the median of df8 along axis = 1, skipping over missing values" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 97, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"a 1.200000\n", | |
"b 1.533333\n", | |
"c 3.000000\n", | |
"d -0.275000\n", | |
"dtype: float64" | |
] | |
}, | |
"execution_count": 97, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df8.mean(axis = 1, skipna = True)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Find the standard deviation (std) of df8 along axis = 0, including the missing values" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 96, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"one NaN\n", | |
"two NaN\n", | |
"three NaN\n", | |
"dtype: float64" | |
] | |
}, | |
"execution_count": 96, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df8.std(axis = 0, skipna = False)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 2. Count() and Describe()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Describe() produces multiple summary statistics in one shot:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 58, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"df9 = pd.DataFrame([[1.4, np.nan, 1]\n", | |
" , [7.1, -4.5, 2]\n", | |
" , [np.nan, np.nan, 3]\n", | |
" , [0.75, -1.3, np.nan]]\n", | |
" , index=['a', 'b', 'c', 'd']\n", | |
" , columns=['one', 'two', 'three'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 59, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>three</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>a</th>\n", | |
" <td>1.40</td>\n", | |
" <td>NaN</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>b</th>\n", | |
" <td>7.10</td>\n", | |
" <td>-4.5</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>c</th>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>d</th>\n", | |
" <td>0.75</td>\n", | |
" <td>-1.3</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two three\n", | |
"a 1.40 NaN 1\n", | |
"b 7.10 -4.5 2\n", | |
"c NaN NaN 3\n", | |
"d 0.75 -1.3 NaN" | |
] | |
}, | |
"execution_count": 59, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df9" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 60, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>three</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>count</th>\n", | |
" <td>3.000000</td>\n", | |
" <td>2.000000</td>\n", | |
" <td>3.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>mean</th>\n", | |
" <td>3.083333</td>\n", | |
" <td>-2.900000</td>\n", | |
" <td>2.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>std</th>\n", | |
" <td>3.493685</td>\n", | |
" <td>2.262742</td>\n", | |
" <td>1.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>min</th>\n", | |
" <td>0.750000</td>\n", | |
" <td>-4.500000</td>\n", | |
" <td>1.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>25%</th>\n", | |
" <td>1.075000</td>\n", | |
" <td>-3.700000</td>\n", | |
" <td>1.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>50%</th>\n", | |
" <td>1.400000</td>\n", | |
" <td>-2.900000</td>\n", | |
" <td>2.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>75%</th>\n", | |
" <td>4.250000</td>\n", | |
" <td>-2.100000</td>\n", | |
" <td>2.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>max</th>\n", | |
" <td>7.100000</td>\n", | |
" <td>-1.300000</td>\n", | |
" <td>3.0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two three\n", | |
"count 3.000000 2.000000 3.0\n", | |
"mean 3.083333 -2.900000 2.0\n", | |
"std 3.493685 2.262742 1.0\n", | |
"min 0.750000 -4.500000 1.0\n", | |
"25% 1.075000 -3.700000 1.5\n", | |
"50% 1.400000 -2.900000 2.0\n", | |
"75% 4.250000 -2.100000 2.5\n", | |
"max 7.100000 -1.300000 3.0" | |
] | |
}, | |
"execution_count": 60, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df9.describe()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Exercise" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 100, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"df10 = pd.DataFrame([[1.4, np.nan, 1, 3]\n", | |
" , [7.1, -4.5, 2, 5]\n", | |
" , [np.nan, np.nan, 3, 7]\n", | |
" , [0.75, -1.3, np.nan, 8]]\n", | |
" , index=['a', 'b', 'c', 'd']\n", | |
" , columns=['one', 'two', 'three', 'four'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 62, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>three</th>\n", | |
" <th>four</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>a</th>\n", | |
" <td>1.40</td>\n", | |
" <td>NaN</td>\n", | |
" <td>1</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>b</th>\n", | |
" <td>7.10</td>\n", | |
" <td>-4.5</td>\n", | |
" <td>2</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>c</th>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>3</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>d</th>\n", | |
" <td>0.75</td>\n", | |
" <td>-1.3</td>\n", | |
" <td>NaN</td>\n", | |
" <td>8</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two three four\n", | |
"a 1.40 NaN 1 3\n", | |
"b 7.10 -4.5 2 5\n", | |
"c NaN NaN 3 7\n", | |
"d 0.75 -1.3 NaN 8" | |
] | |
}, | |
"execution_count": 62, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df10 " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Use the describe function on df10. Anything interesting?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 102, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>one</th>\n", | |
" <th>two</th>\n", | |
" <th>three</th>\n", | |
" <th>four</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>count</th>\n", | |
" <td>3.000000</td>\n", | |
" <td>2.000000</td>\n", | |
" <td>3.0</td>\n", | |
" <td>4.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>mean</th>\n", | |
" <td>3.083333</td>\n", | |
" <td>-2.900000</td>\n", | |
" <td>2.0</td>\n", | |
" <td>5.750000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>std</th>\n", | |
" <td>3.493685</td>\n", | |
" <td>2.262742</td>\n", | |
" <td>1.0</td>\n", | |
" <td>2.217356</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>min</th>\n", | |
" <td>0.750000</td>\n", | |
" <td>-4.500000</td>\n", | |
" <td>1.0</td>\n", | |
" <td>3.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>25%</th>\n", | |
" <td>1.075000</td>\n", | |
" <td>-3.700000</td>\n", | |
" <td>1.5</td>\n", | |
" <td>4.500000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>50%</th>\n", | |
" <td>1.400000</td>\n", | |
" <td>-2.900000</td>\n", | |
" <td>2.0</td>\n", | |
" <td>6.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>75%</th>\n", | |
" <td>4.250000</td>\n", | |
" <td>-2.100000</td>\n", | |
" <td>2.5</td>\n", | |
" <td>7.250000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>max</th>\n", | |
" <td>7.100000</td>\n", | |
" <td>-1.300000</td>\n", | |
" <td>3.0</td>\n", | |
" <td>8.000000</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" one two three four\n", | |
"count 3.000000 2.000000 3.0 4.000000\n", | |
"mean 3.083333 -2.900000 2.0 5.750000\n", | |
"std 3.493685 2.262742 1.0 2.217356\n", | |
"min 0.750000 -4.500000 1.0 3.000000\n", | |
"25% 1.075000 -3.700000 1.5 4.500000\n", | |
"50% 1.400000 -2.900000 2.0 6.000000\n", | |
"75% 4.250000 -2.100000 2.5 7.250000\n", | |
"max 7.100000 -1.300000 3.0 8.000000" | |
] | |
}, | |
"execution_count": 102, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df10.describe()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 3. Duplicates" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 63, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"df11 = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4, 4, 4],\n", | |
" 'Qu2': [2, 3, 1, 2, 3, 3, 3],\n", | |
" 'Qu3': [1, 5, 2, 4, 4, 4, 4]})" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 64, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Qu1</th>\n", | |
" <th>Qu2</th>\n", | |
" <th>Qu3</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>4</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>3</td>\n", | |
" <td>2</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>4</td>\n", | |
" <td>3</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>4</td>\n", | |
" <td>3</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>4</td>\n", | |
" <td>3</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Qu1 Qu2 Qu3\n", | |
"0 1 2 1\n", | |
"1 3 3 5\n", | |
"2 4 1 2\n", | |
"3 3 2 4\n", | |
"4 4 3 4\n", | |
"5 4 3 4\n", | |
"6 4 3 4" | |
] | |
}, | |
"execution_count": 64, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df11" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"By default, drop_duplicates removes entire rows of duplicates, keeping the first occurence of duplication." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 65, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Qu1</th>\n", | |
" <th>Qu2</th>\n", | |
" <th>Qu3</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>4</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>3</td>\n", | |
" <td>2</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>4</td>\n", | |
" <td>3</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Qu1 Qu2 Qu3\n", | |
"0 1 2 1\n", | |
"1 3 3 5\n", | |
"2 4 1 2\n", | |
"3 3 2 4\n", | |
"4 4 3 4" | |
] | |
}, | |
"execution_count": 65, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df11.drop_duplicates()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Exerise" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 103, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"df12 = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4, 4, 4, 1, 0, 1],\n", | |
" 'Qu2': [2, 3, 1, 2, 3, 3, 3, 1, 0, 1],\n", | |
" 'Qu3': [1, 5, 2, 4, 4, 4, 4, 1, 0, 1]})" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 67, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Qu1</th>\n", | |
" <th>Qu2</th>\n", | |
" <th>Qu3</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>4</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>3</td>\n", | |
" <td>2</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>4</td>\n", | |
" <td>3</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>4</td>\n", | |
" <td>3</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>4</td>\n", | |
" <td>3</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Qu1 Qu2 Qu3\n", | |
"0 1 2 1\n", | |
"1 3 3 5\n", | |
"2 4 1 2\n", | |
"3 3 2 4\n", | |
"4 4 3 4\n", | |
"5 4 3 4\n", | |
"6 4 3 4\n", | |
"7 1 1 1\n", | |
"8 0 0 0\n", | |
"9 1 1 1" | |
] | |
}, | |
"execution_count": 67, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df12" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Read up documentation for drop_duplicates and discuss. What other alternatives are there for this command?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"#http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 109, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Qu1</th>\n", | |
" <th>Qu2</th>\n", | |
" <th>Qu3</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>4</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>3</td>\n", | |
" <td>2</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>4</td>\n", | |
" <td>3</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Qu1 Qu2 Qu3\n", | |
"0 1 2 1\n", | |
"1 3 3 5\n", | |
"2 4 1 2\n", | |
"3 3 2 4\n", | |
"4 4 3 4\n", | |
"7 1 1 1\n", | |
"8 0 0 0" | |
] | |
}, | |
"execution_count": 109, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df12.drop_duplicates()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"How would you drop duplicates in place?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 112, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Qu1</th>\n", | |
" <th>Qu2</th>\n", | |
" <th>Qu3</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>4</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>3</td>\n", | |
" <td>2</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>4</td>\n", | |
" <td>3</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Qu1 Qu2 Qu3\n", | |
"0 1 2 1\n", | |
"1 3 3 5\n", | |
"2 4 1 2\n", | |
"3 3 2 4\n", | |
"4 4 3 4\n", | |
"7 1 1 1\n", | |
"8 0 0 0" | |
] | |
}, | |
"execution_count": 112, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df12.drop_duplicates(inplace = True)\n", | |
"df12" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"How would you drop all duplicates?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 113, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Qu1</th>\n", | |
" <th>Qu2</th>\n", | |
" <th>Qu3</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>4</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>3</td>\n", | |
" <td>2</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>4</td>\n", | |
" <td>3</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Qu1 Qu2 Qu3\n", | |
"0 1 2 1\n", | |
"1 3 3 5\n", | |
"2 4 1 2\n", | |
"3 3 2 4\n", | |
"4 4 3 4\n", | |
"7 1 1 1\n", | |
"8 0 0 0" | |
] | |
}, | |
"execution_count": 113, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df12.drop_duplicates(keep = False)\n", | |
"df12" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 4. Correlation and Covariance" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Some summary statistics, like correlation and covariance, are computed from pairs of arguments. Let’s consider some DataFrames of stock prices and volumes obtained from Yahoo! Finance:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 114, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"df14_returns = pd.DataFrame([\n", | |
" [-0.084311, 0.014634, -0.086538]\n", | |
" , [-0.033944, 0.035137, -0.017241]\n", | |
" , [-0.033780, 0.010544, -0.033498]\n", | |
" ]\n", | |
" , index=['2000-01-03', '2000-01-03', '2000-01-03']\n", | |
" , columns=['AAPL', 'IBM', 'MSFT'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 69, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>AAPL</th>\n", | |
" <th>IBM</th>\n", | |
" <th>MSFT</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>2000-01-03</th>\n", | |
" <td>-0.084311</td>\n", | |
" <td>0.014634</td>\n", | |
" <td>-0.086538</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2000-01-03</th>\n", | |
" <td>-0.033944</td>\n", | |
" <td>0.035137</td>\n", | |
" <td>-0.017241</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2000-01-03</th>\n", | |
" <td>-0.033780</td>\n", | |
" <td>0.010544</td>\n", | |
" <td>-0.033498</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" AAPL IBM MSFT\n", | |
"2000-01-03 -0.084311 0.014634 -0.086538\n", | |
"2000-01-03 -0.033944 0.035137 -0.017241\n", | |
"2000-01-03 -0.033780 0.010544 -0.033498" | |
] | |
}, | |
"execution_count": 69, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df14_returns.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"DataFrame’s corr and cov methods, return a full correlation or covariance matrix as a DataFrame, respectively:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 70, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>AAPL</th>\n", | |
" <th>IBM</th>\n", | |
" <th>MSFT</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>AAPL</th>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.356919</td>\n", | |
" <td>0.973884</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>IBM</th>\n", | |
" <td>0.356919</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.559689</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>MSFT</th>\n", | |
" <td>0.973884</td>\n", | |
" <td>0.559689</td>\n", | |
" <td>1.000000</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" AAPL IBM MSFT\n", | |
"AAPL 1.000000 0.356919 0.973884\n", | |
"IBM 0.356919 1.000000 0.559689\n", | |
"MSFT 0.973884 0.559689 1.000000" | |
] | |
}, | |
"execution_count": 70, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df14_returns.corr()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 71, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>AAPL</th>\n", | |
" <th>IBM</th>\n", | |
" <th>MSFT</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>AAPL</th>\n", | |
" <td>0.000848</td>\n", | |
" <td>0.000137</td>\n", | |
" <td>0.001028</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>IBM</th>\n", | |
" <td>0.000137</td>\n", | |
" <td>0.000174</td>\n", | |
" <td>0.000267</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>MSFT</th>\n", | |
" <td>0.001028</td>\n", | |
" <td>0.000267</td>\n", | |
" <td>0.001313</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" AAPL IBM MSFT\n", | |
"AAPL 0.000848 0.000137 0.001028\n", | |
"IBM 0.000137 0.000174 0.000267\n", | |
"MSFT 0.001028 0.000267 0.001313" | |
] | |
}, | |
"execution_count": 71, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df14_returns.cov()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"If we only want to look at the correlation and covariance between two stock prices:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 72, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.55968873895232241" | |
] | |
}, | |
"execution_count": 72, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df14_returns.MSFT.corr(df14_returns.IBM)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 73, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.00026727853200000001" | |
] | |
}, | |
"execution_count": 73, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df14_returns.MSFT.cov(df14_returns.IBM)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Exercise" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Using the same dataset as above, what is the correlation between AAPL and IBM?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 117, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.35691922517121777" | |
] | |
}, | |
"execution_count": 117, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df14_returns.AAPL.corr(df14_returns.IBM)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Using the same dataset as above, what is the covariance between AAPL and IBM?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 118, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.00013699492649999997" | |
] | |
}, | |
"execution_count": 118, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df14_returns.AAPL.cov(df14_returns.IBM)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This will take a bit of researching online, but instead of Pearson correlation, how do I output the Spearman correlation instead? Once you find out, please output the correlation matrix for Spearman." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 119, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>AAPL</th>\n", | |
" <th>IBM</th>\n", | |
" <th>MSFT</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>AAPL</th>\n", | |
" <td>1.0</td>\n", | |
" <td>-0.5</td>\n", | |
" <td>0.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>IBM</th>\n", | |
" <td>-0.5</td>\n", | |
" <td>1.0</td>\n", | |
" <td>0.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>MSFT</th>\n", | |
" <td>0.5</td>\n", | |
" <td>0.5</td>\n", | |
" <td>1.0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" AAPL IBM MSFT\n", | |
"AAPL 1.0 -0.5 0.5\n", | |
"IBM -0.5 1.0 0.5\n", | |
"MSFT 0.5 0.5 1.0" | |
] | |
}, | |
"execution_count": 119, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df14_returns.corr(method = \"spearman\")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 5. How to Handle Missing Data" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Missing data is common in most data analysis applications. The default for most commands on pandas objects is to exclude missing data. Pandas uses the floating point value NaN (Not a Number) to represent missing data in both floating as well as in non-floating point arrays." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 121, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"df15 = pd.DataFrame([[1., 6.5, 3.], \n", | |
" [1., np.nan, np.nan],\n", | |
" [np.nan, np.nan, np.nan], \n", | |
" [np.nan, 6.5, 3.]])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 75, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>NaN</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2\n", | |
"0 1 6.5 3\n", | |
"1 1 NaN NaN\n", | |
"2 NaN NaN NaN\n", | |
"3 NaN 6.5 3" | |
] | |
}, | |
"execution_count": 75, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df15" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"There are a few ways to filter out / drop missing values. By default, dropna() drops all rows that have any missing values" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 76, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2\n", | |
"0 1 6.5 3" | |
] | |
}, | |
"execution_count": 76, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df15.dropna()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"By passing how='all', this will only drop rows that are all NA:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 77, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>NaN</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2\n", | |
"0 1 6.5 3\n", | |
"1 1 NaN NaN\n", | |
"3 NaN 6.5 3" | |
] | |
}, | |
"execution_count": 77, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df15.dropna(how = 'all')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"If instead of dropping rows, we want to drop columns, then change the axis:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 78, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>NaN</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2\n", | |
"0 1 6.5 3\n", | |
"1 1 NaN NaN\n", | |
"2 NaN NaN NaN\n", | |
"3 NaN 6.5 3" | |
] | |
}, | |
"execution_count": 78, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df15.dropna(axis = 1, how = 'all')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Okay, what if we don't want to drop, and we want to impute the data instead? Use fillna()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 79, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2\n", | |
"0 1 6.5 3\n", | |
"1 1 6.5 3\n", | |
"2 1 6.5 3\n", | |
"3 1 6.5 3" | |
] | |
}, | |
"execution_count": 79, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df15.fillna(method='ffill')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Exercise" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"What are the 5 options for \"method\" in the fillna command? What does each of them mean?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 122, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>NaN</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>NaN</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2\n", | |
"0 1 6.5 3\n", | |
"1 1 6.5 3\n", | |
"2 NaN 6.5 3\n", | |
"3 NaN 6.5 3" | |
] | |
}, | |
"execution_count": 122, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df15.fillna(method='bfill')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Use the following dataset, visualize each of the five fillna() methods. What is the difference?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 123, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"df16 = pd.DataFrame([[1., 6.5, 3.], \n", | |
" [1., np.nan, np.nan],\n", | |
" [np.nan, np.nan, np.nan], \n", | |
" [np.nan, 6.5, 3.]])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 81, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>NaN</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2\n", | |
"0 1 6.5 3\n", | |
"1 1 NaN NaN\n", | |
"2 NaN NaN NaN\n", | |
"3 NaN 6.5 3" | |
] | |
}, | |
"execution_count": 81, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df16" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 125, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>NaN</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>NaN</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2\n", | |
"0 1 6.5 3\n", | |
"1 1 6.5 3\n", | |
"2 NaN 6.5 3\n", | |
"3 NaN 6.5 3" | |
] | |
}, | |
"execution_count": 125, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df16.fillna(method='backfill')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 129, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>NaN</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>NaN</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2\n", | |
"0 1 6.5 3\n", | |
"1 1 6.5 3\n", | |
"2 NaN 6.5 3\n", | |
"3 NaN 6.5 3" | |
] | |
}, | |
"execution_count": 129, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df16.fillna(method='bfill')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 130, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2\n", | |
"0 1 6.5 3\n", | |
"1 1 6.5 3\n", | |
"2 1 6.5 3\n", | |
"3 1 6.5 3" | |
] | |
}, | |
"execution_count": 130, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df16.fillna(method='pad')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 131, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2\n", | |
"0 1 6.5 3\n", | |
"1 1 6.5 3\n", | |
"2 1 6.5 3\n", | |
"3 1 6.5 3" | |
] | |
}, | |
"execution_count": 131, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df16.fillna(method='ffill')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 145, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>None</td>\n", | |
" <td>None</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>None</td>\n", | |
" <td>None</td>\n", | |
" <td>None</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>None</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2\n", | |
"0 1 6.5 3\n", | |
"1 1 None None\n", | |
"2 None None None\n", | |
"3 None 6.5 3" | |
] | |
}, | |
"execution_count": 145, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df16.fillna('None')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Using the replace() command and df16, how would you replace all missing values with -999 instead?\n", | |
"Hint: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 142, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>-999.0</td>\n", | |
" <td>-999</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>-999</td>\n", | |
" <td>-999.0</td>\n", | |
" <td>-999</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>-999</td>\n", | |
" <td>6.5</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2\n", | |
"0 1 6.5 3\n", | |
"1 1 -999.0 -999\n", | |
"2 -999 -999.0 -999\n", | |
"3 -999 6.5 3" | |
] | |
}, | |
"execution_count": 142, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df15.replace(to_replace = \"NaN\", value = -999)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**Note**: \n", | |
"\n", | |
"The ffill() function is equivalent to fillna(method='ffill') and bfill() is equivalent to fillna(method='bfill').\n", | |
"\n", | |
"pad / ffill = fill values forward\n", | |
"bfill / backfill = fill values backward" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Questions?" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Matplotlib: Data Visualizing" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 37, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"%matplotlib inline\n", | |
"import matplotlib\n", | |
"import matplotlib.pyplot as plt" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Matplotlib is the most popular Python library for producing plots and other 2D data visualizations. Basic matplotlib graphs are easy to make. The tricker parts are in the details." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"My favorite tutorial on matplotlib: https://github.com/rougier/matplotlib-tutorial" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 1. Two Simple Sine Graphs" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 83, | |
"metadata": { | |
"collapsed": false | |
}, | |
"source": [ | |
"X = np.linspace(-np.pi, np.pi, 256, endpoint=True)\n", | |
"C, S = np.cos(X), np.sin(X)\n", | |
"\n", | |
"plt.plot(X, C)\n", | |
"plt.plot(X, S)\n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 2. Scatterplot" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 153, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x1177f53d0>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"n = 1024\n", | |
"X = np.random.normal(0,1,n)\n", | |
"Y = np.random.normal(0,1,n)\n", | |
"\n", | |
"plt.scatter(X,Y)\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Adding some formatting, can you tell what we did?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 148, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x1177e6350>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"n = 1024\n", | |
"X = np.random.normal(0,1,n)\n", | |
"Y = np.random.normal(0,1,n)\n", | |
"T = np.arctan2(Y,X)\n", | |
"\n", | |
"plt.axes([0.025,0.025,0.95,0.95])\n", | |
"plt.scatter(X,Y, s=75, c=T, alpha=.5)\n", | |
"\n", | |
"plt.xlim(-1.5,1.5), plt.xticks([])\n", | |
"plt.ylim(-1.5,1.5), plt.yticks([])\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 3. Bar Plots" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 159, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x1181cab50>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"n = 12\n", | |
"X = np.arange(n)\n", | |
"Y2 = (1-X/float(n)) * np.random.uniform(0.5,1.0,n)\n", | |
"\n", | |
"plt.bar(X, Y2, facecolor = 'pink', edgecolor='None')\n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Challenge Exercise\n", | |
"Supposed we have want to combine the two graphs above, and make the graph below, what should we do?" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"![hello](http://www.labri.fr/perso/nrougier/teaching/matplotlib/figures/bar_ex.png)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 170, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x117882650>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"n = 12\n", | |
"X = np.arange(n)\n", | |
"Y1 = (1-X/float(n)) * np.random.uniform(0.5,1.0,n)\n", | |
"Y2 = (1-X/float(n)) * np.random.uniform(0.5,1.0,n)\n", | |
"\n", | |
"## enter code here\n", | |
"plt.bar(X, Y1, facecolor = 'blue', edgecolor='None')\n", | |
"plt.bar(X, -Y2, facecolor = 'pink', edgecolor='None')\n", | |
"\n", | |
"for x,y in zip(X,Y1):\n", | |
" plt.text(x+0.4, y+0.05, '%.2f' % y, ha='center', va= 'bottom')\n", | |
" \n", | |
"plt.ylim(-1.25, +1.25)\n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Questions?" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Scikit-Learn: Model Building" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Where does data cleaning, data exploration, data visualization and modeling building all fit in the grand scheme of things?" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"![hello](http://cacm.acm.org/system/assets/0001/3678/rp-overview.jpg)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"###### Source: http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### To understand some of the common terminologies of machine learning, first go through this interactive tutorial:\n", | |
"### [R2D3's Visual Introduction to Machine Learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)\n", | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Let's go over the following terminology:\n", | |
"- Features / predictors / variables / X\n", | |
"- Outcome / predictions / Y\n", | |
"- Classification / regression \n", | |
"- Training dataset / testing dataset\n", | |
"- Overfitting" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.datasets import make_blobs\n", | |
"from sklearn import datasets\n", | |
"from sklearn.cross_validation import train_test_split\n", | |
"from sklearn.linear_model import LinearRegression\n", | |
"from sklearn.linear_model import LogisticRegression" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 1. Regression Model" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In regression, the predictor variable is continuous. Let's start with a simple dataset of sin curve with noise." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 172, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"x = np.linspace(-3, 3, 100)\n", | |
"y = x + np.random.normal(0,1,100)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 173, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[<matplotlib.lines.Line2D at 0x117e98410>]" | |
] | |
}, | |
"execution_count": 173, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x117a2e410>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"plt.plot(x, y, 'o')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"One of the simplest models again is a regression, which fits a line to the scatterplot data. This is called ordinary least squares linear regression. The function in sklearn is LinearRegression. \n", | |
"\n", | |
"Before applying the regression however, X must be a 2d-array:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 174, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"(100,)\n", | |
"(100, 1)\n" | |
] | |
} | |
], | |
"source": [ | |
"print(x.shape)\n", | |
"X = x[:, np.newaxis]\n", | |
"print(X.shape)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We split our data in a training and a test set, to alleviate overfitting." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 175, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"X_train, X_test, y_train, y_test = train_test_split(X, y)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Build our regression model on training data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 176, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)" | |
] | |
}, | |
"execution_count": 176, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"regressor = LinearRegression()\n", | |
"regressor.fit(X_train, y_train)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"First, evaluate the model on training data, the same data we built the model on." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 177, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"y_pred_train = regressor.predict(X_train)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let's visualize the line and the data." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 178, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<matplotlib.legend.Legend at 0x11789a6d0>" | |
] | |
}, | |
"execution_count": 178, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x11826c390>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"plt.plot(X_train, y_train, 'o', label=\"data\")\n", | |
"plt.plot(X_train, y_pred_train, 'o', label=\"prediction\")\n", | |
"plt.legend(loc='best')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now, evaluate the model on testing data, which was purposely kept out of the model training process. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 97, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"y_pred_test = regressor.predict(X_test)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 98, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<matplotlib.legend.Legend at 0x7f535ec11390>" | |
] | |
}, | |
"execution_count": 98, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x7f535edabb90>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"plt.plot(X_test, y_test, 'o', label=\"data\")\n", | |
"plt.plot(X_test, y_pred_test, 'o', label=\"prediction\")\n", | |
"plt.legend(loc='best')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"To quantitatively evaluate how well the model is performing, we use the score function. For regression models, the common methodology for scoring is using R2 or mean squared error (MSE)." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Compare model performance for training and testing." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 99, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.8045065263359914" | |
] | |
}, | |
"execution_count": 99, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"regressor.score(X_train, y_train)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 100, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.65666716237561662" | |
] | |
}, | |
"execution_count": 100, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"regressor.score(X_test, y_test)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Exercise" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Fit LinearRegression on the boston housing dataset. You can load the dataset using sklearn.datasets.load_boston." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 38, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"((506, 13), (506,))\n", | |
"(379, 13) (127, 13) (379,) (127,)\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"<matplotlib.legend.Legend at 0x10e3c2590>" | |
] | |
}, | |
"execution_count": 38, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x10b0d1390>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"boston = datasets.load_boston()\n", | |
"Xboston = boston.data\n", | |
"Yboston = boston.target\n", | |
"print (Xboston.shape, Yboston.shape)\n", | |
" \n", | |
"Xboston_train, Xboston_test, Yboston_train, Yboston_test = train_test_split(Xboston, Yboston)\n", | |
"print Xboston_train.shape, Xboston_test.shape, Yboston_train.shape, Yboston_test.shape\n", | |
" \n", | |
"regressor = LinearRegression()\n", | |
"regressor.fit(Xboston_train, Yboston_train)\n", | |
" \n", | |
"y_pred_train = regressor.predict(Xboston_train) \n", | |
"\n", | |
"plt.plot(Xboston_train, Yboston_train, 'o', label=\"data\")\n", | |
"plt.plot(Xboston_train, y_pred_train, 'o', label=\"prediction\")\n", | |
"plt.legend(loc='best')\n", | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Interactive Demo: Classification Model" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"(150, 2) (150,)\n", | |
"(112, 2) (38, 2) (112,) (38,)\n", | |
"[2 0 2 2 0 2 1 2 0 1 1 2 0 2 1 1 1 2 2 2 1 2 0 0 2 2 2 0 2 2 0 0 0 0 0 2 1\n", | |
" 0] [2 0 1 2 0 1 1 2 0 2 2 2 0 2 1 2 1 1 2 2 1 1 0 0 2 2 1 0 1 2 0 0 0 0 0 2 1\n", | |
" 0]\n", | |
"0.776785714286\n", | |
"0.763157894737\n" | |
] | |
} | |
], | |
"source": [ | |
"# # import some data to play with\n", | |
"iris = datasets.load_iris()\n", | |
"X = iris.data[:, :2]\n", | |
"y = iris.target\n", | |
"print X.shape, y.shape\n", | |
"\n", | |
"# # split the data\n", | |
"X_train, X_test, y_train, y_test = train_test_split(X, y)\n", | |
"print X_train.shape, X_test.shape, y_train.shape, y_test.shape\n", | |
"\n", | |
"classifier = LogisticRegression()\n", | |
"classifier.fit(X_train, y_train)\n", | |
"prediction = classifier.predict(X_test)\n", | |
"print prediction, y_test\n", | |
"\n", | |
"print classifier.score(X_train, y_train)\n", | |
"print classifier.score(X_test, y_test)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Exercise" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Re-run the classification model above, but this time use the entire iris dataset." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"(150, 4) (150,)\n", | |
"(112, 4) (38, 4) (112,) (38,)\n", | |
"[2 0 2 1 2 1 1 0 2 1 1 1 0 0 2 2 1 0 0 0 2 0 0 0 1 0 1 0 0 2 2 0 2 0 2 2 0\n", | |
" 1] [2 0 2 1 2 1 1 0 2 1 1 1 0 0 1 2 1 0 0 0 2 0 0 0 1 0 1 0 0 2 2 0 2 0 2 1 0\n", | |
" 1]\n", | |
"0.973214285714\n", | |
"0.947368421053\n" | |
] | |
} | |
], | |
"source": [ | |
"xiris = iris.data\n", | |
"yiris = iris.target\n", | |
"print xiris.shape, yiris.shape\n", | |
"\n", | |
"xiris_train, xiris_test, yiris_train, yiris_test = train_test_split(xiris, yiris)\n", | |
"print xiris_train.shape, xiris_test.shape, yiris_train.shape, yiris_test.shape\n", | |
"\n", | |
"classifier = LogisticRegression()\n", | |
"classifier.fit(xiris_train, yiris_train)\n", | |
"prediction = classifier.predict(xiris_test)\n", | |
"print prediction, yiris_test\n", | |
"\n", | |
"print classifier.score(xiris_train, yiris_train)\n", | |
"print classifier.score(xiris_test, yiris_test)\n", | |
"\n", | |
"\n", | |
"\n", | |
"\n", | |
"\n", | |
"\n", | |
"\n", | |
"\n", | |
"\n", | |
"\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Further Resources" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Learn Python at your own pace:\n", | |
"- [Learn Python the Hard Way](http://learnpythonthehardway.org/)\n", | |
"- [Python Codeacademy](https://www.codecademy.com/learn/python)\n", | |
"- [Hackerrank Challenges in Python](https://www.hackerrank.com/)\n", | |
"\n", | |
"Online data science & Python classes:\n", | |
"- [Coursera: Learn to Program and Analyze Data with Python](https://www.coursera.org/specializations/python)\n", | |
"- [Coursera: Machine Learning with University of Washington](https://www.coursera.org/specializations/machine-learning)\n", | |
"\n", | |
"Python Reference Book:\n", | |
"- [Python for Data Analysis, O'Reilly Media](http://shop.oreilly.com/product/0636920023784.do)" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.11" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment