Skip to content

Instantly share code, notes, and snippets.

@rdempsey
Last active October 10, 2015 13:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save rdempsey/35219dcc03a23e9295aa to your computer and use it in GitHub Desktop.
Save rdempsey/35219dcc03a23e9295aa to your computer and use it in GitHub Desktop.
Three Pandas Tips for Pandas Noobs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Three Pandas Tips For Noobs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For those new to Pandas, you'll learn a number of tips that will help with your data engineering and analysis tasks. You may find these buried in the documentation or StackOverflow posts, but I'm consolidating them here for you.\n",
"\n",
"Here's what's covered:\n",
"\n",
"1. Ensuring changes you make to DataFrames stick\n",
"2. Applying a function with no arguments to a DataFrame\n",
"3. Applying a function with arguments to a DataFrame\n",
"\n",
"Here's the link to the original dataset we're using:\n",
"\n",
"* [UK Road Safety Data](https://data.gov.uk/dataset/road-accidents-safety-data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Additional Resources"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This and much more is covered in my upcoming book: [Python Business Intelligence Cookbook](http://robertwdempsey.com/uo6b), now available for pre-order from Packt Publishing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import The Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first thing we need to do is import the data into a DataFrame. I suggest using the [read_csv()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) method from Pandas for this."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Import the Python libraries we need\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>Accident_Index</th>\n",
" <th>Location_Easting_OSGR</th>\n",
" <th>Location_Northing_OSGR</th>\n",
" <th>Longitude</th>\n",
" <th>Latitude</th>\n",
" <th>Police_Force</th>\n",
" <th>Accident_Severity</th>\n",
" <th>Number_of_Vehicles</th>\n",
" <th>Number_of_Casualties</th>\n",
" <th>...</th>\n",
" <th>Pedestrian_Crossing-Human_Control</th>\n",
" <th>Pedestrian_Crossing-Physical_Facilities</th>\n",
" <th>Light_Conditions</th>\n",
" <th>Weather_Conditions</th>\n",
" <th>Road_Surface_Conditions</th>\n",
" <th>Special_Conditions_at_Site</th>\n",
" <th>Carriageway_Hazards</th>\n",
" <th>Urban_or_Rural_Area</th>\n",
" <th>Did_Police_Officer_Attend_Scene_of_Accident</th>\n",
" <th>LSOA_of_Accident_Location</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>197901A11AD14</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>1</td>\n",
" <td>8</td>\n",
" <td>1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>197901A1BAW34</td>\n",
" <td>198460</td>\n",
" <td>894000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>4</td>\n",
" <td>8</td>\n",
" <td>3</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>197901A1BFD77</td>\n",
" <td>406380</td>\n",
" <td>307000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>4</td>\n",
" <td>8</td>\n",
" <td>3</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>197901A1BGC20</td>\n",
" <td>281680</td>\n",
" <td>440000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>4</td>\n",
" <td>8</td>\n",
" <td>3</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>197901A1BGF95</td>\n",
" <td>153960</td>\n",
" <td>795000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 33 columns</p>\n",
"</div>"
],
"text/plain": [
" Unnamed: 0 Accident_Index Location_Easting_OSGR Location_Northing_OSGR \\\n",
"0 0 197901A11AD14 NaN NaN \n",
"1 1 197901A1BAW34 198460 894000 \n",
"2 2 197901A1BFD77 406380 307000 \n",
"3 3 197901A1BGC20 281680 440000 \n",
"4 4 197901A1BGF95 153960 795000 \n",
"\n",
" Longitude Latitude Police_Force Accident_Severity Number_of_Vehicles \\\n",
"0 NaN NaN 1 3 2 \n",
"1 NaN NaN 1 3 1 \n",
"2 NaN NaN 1 3 2 \n",
"3 NaN NaN 1 3 2 \n",
"4 NaN NaN 1 2 2 \n",
"\n",
" Number_of_Casualties ... \\\n",
"0 1 ... \n",
"1 1 ... \n",
"2 3 ... \n",
"3 2 ... \n",
"4 1 ... \n",
"\n",
" Pedestrian_Crossing-Human_Control Pedestrian_Crossing-Physical_Facilities \\\n",
"0 -1 -1 \n",
"1 -1 -1 \n",
"2 -1 -1 \n",
"3 -1 -1 \n",
"4 -1 -1 \n",
"\n",
" Light_Conditions Weather_Conditions Road_Surface_Conditions \\\n",
"0 1 8 1 \n",
"1 4 8 3 \n",
"2 4 8 3 \n",
"3 4 8 3 \n",
"4 4 3 3 \n",
"\n",
" Special_Conditions_at_Site Carriageway_Hazards Urban_or_Rural_Area \\\n",
"0 -1 0 -1 \n",
"1 -1 0 -1 \n",
"2 -1 0 -1 \n",
"3 -1 0 -1 \n",
"4 -1 0 -1 \n",
"\n",
" Did_Police_Officer_Attend_Scene_of_Accident LSOA_of_Accident_Location \n",
"0 -1 NaN \n",
"1 -1 NaN \n",
"2 -1 NaN \n",
"3 -1 NaN \n",
"4 -1 NaN \n",
"\n",
"[5 rows x 33 columns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Define a variable for the accidents data file\n",
"f = './data/accidents1k.csv'\n",
"\n",
"# Use read_csv() to import the data\n",
"accidents = pd.read_csv(f,\n",
" sep=',',\n",
" header=0,\n",
" index_col=False,\n",
" parse_dates=True,\n",
" tupleize_cols=False,\n",
" error_bad_lines=False,\n",
" warn_bad_lines=True,\n",
" skip_blank_lines=True,\n",
" low_memory=False\n",
" )\n",
"# Run the head() command to see the top 5 rows of the data\n",
"accidents.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Ensuring Your Changes Stick"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are many ways to fill in missing (NaN) values in a DataFrame; some people use the mean of the column, others enter 0. You can do whatever you want. However, just because you tell Pandas to fill in the missing values doesn't mean the change will stick.\n",
"\n",
"Let's use the [fillna()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) method of the DataFrame and see what happens."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>Accident_Index</th>\n",
" <th>Location_Easting_OSGR</th>\n",
" <th>Location_Northing_OSGR</th>\n",
" <th>Longitude</th>\n",
" <th>Latitude</th>\n",
" <th>Police_Force</th>\n",
" <th>Accident_Severity</th>\n",
" <th>Number_of_Vehicles</th>\n",
" <th>Number_of_Casualties</th>\n",
" <th>...</th>\n",
" <th>Pedestrian_Crossing-Human_Control</th>\n",
" <th>Pedestrian_Crossing-Physical_Facilities</th>\n",
" <th>Light_Conditions</th>\n",
" <th>Weather_Conditions</th>\n",
" <th>Road_Surface_Conditions</th>\n",
" <th>Special_Conditions_at_Site</th>\n",
" <th>Carriageway_Hazards</th>\n",
" <th>Urban_or_Rural_Area</th>\n",
" <th>Did_Police_Officer_Attend_Scene_of_Accident</th>\n",
" <th>LSOA_of_Accident_Location</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>197901A11AD14</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>1</td>\n",
" <td>8</td>\n",
" <td>1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>197901A1BAW34</td>\n",
" <td>198460</td>\n",
" <td>894000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>4</td>\n",
" <td>8</td>\n",
" <td>3</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>197901A1BFD77</td>\n",
" <td>406380</td>\n",
" <td>307000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>4</td>\n",
" <td>8</td>\n",
" <td>3</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>197901A1BGC20</td>\n",
" <td>281680</td>\n",
" <td>440000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>4</td>\n",
" <td>8</td>\n",
" <td>3</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>197901A1BGF95</td>\n",
" <td>153960</td>\n",
" <td>795000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 33 columns</p>\n",
"</div>"
],
"text/plain": [
" Unnamed: 0 Accident_Index Location_Easting_OSGR Location_Northing_OSGR \\\n",
"0 0 197901A11AD14 0 0 \n",
"1 1 197901A1BAW34 198460 894000 \n",
"2 2 197901A1BFD77 406380 307000 \n",
"3 3 197901A1BGC20 281680 440000 \n",
"4 4 197901A1BGF95 153960 795000 \n",
"\n",
" Longitude Latitude Police_Force Accident_Severity Number_of_Vehicles \\\n",
"0 0 0 1 3 2 \n",
"1 0 0 1 3 1 \n",
"2 0 0 1 3 2 \n",
"3 0 0 1 3 2 \n",
"4 0 0 1 2 2 \n",
"\n",
" Number_of_Casualties ... \\\n",
"0 1 ... \n",
"1 1 ... \n",
"2 3 ... \n",
"3 2 ... \n",
"4 1 ... \n",
"\n",
" Pedestrian_Crossing-Human_Control Pedestrian_Crossing-Physical_Facilities \\\n",
"0 -1 -1 \n",
"1 -1 -1 \n",
"2 -1 -1 \n",
"3 -1 -1 \n",
"4 -1 -1 \n",
"\n",
" Light_Conditions Weather_Conditions Road_Surface_Conditions \\\n",
"0 1 8 1 \n",
"1 4 8 3 \n",
"2 4 8 3 \n",
"3 4 8 3 \n",
"4 4 3 3 \n",
"\n",
" Special_Conditions_at_Site Carriageway_Hazards Urban_or_Rural_Area \\\n",
"0 -1 0 -1 \n",
"1 -1 0 -1 \n",
"2 -1 0 -1 \n",
"3 -1 0 -1 \n",
"4 -1 0 -1 \n",
"\n",
" Did_Police_Officer_Attend_Scene_of_Accident LSOA_of_Accident_Location \n",
"0 -1 0 \n",
"1 -1 0 \n",
"2 -1 0 \n",
"3 -1 0 \n",
"4 -1 0 \n",
"\n",
"[5 rows x 33 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Fill in the NaN values and check the DataFrame\n",
"accidents.fillna(value=0).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hrm, it looks like the DataFrame is updated, but is it? I think not!"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>Accident_Index</th>\n",
" <th>Location_Easting_OSGR</th>\n",
" <th>Location_Northing_OSGR</th>\n",
" <th>Longitude</th>\n",
" <th>Latitude</th>\n",
" <th>Police_Force</th>\n",
" <th>Accident_Severity</th>\n",
" <th>Number_of_Vehicles</th>\n",
" <th>Number_of_Casualties</th>\n",
" <th>...</th>\n",
" <th>Pedestrian_Crossing-Human_Control</th>\n",
" <th>Pedestrian_Crossing-Physical_Facilities</th>\n",
" <th>Light_Conditions</th>\n",
" <th>Weather_Conditions</th>\n",
" <th>Road_Surface_Conditions</th>\n",
" <th>Special_Conditions_at_Site</th>\n",
" <th>Carriageway_Hazards</th>\n",
" <th>Urban_or_Rural_Area</th>\n",
" <th>Did_Police_Officer_Attend_Scene_of_Accident</th>\n",
" <th>LSOA_of_Accident_Location</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>197901A11AD14</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>1</td>\n",
" <td>8</td>\n",
" <td>1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>197901A1BAW34</td>\n",
" <td>198460</td>\n",
" <td>894000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>4</td>\n",
" <td>8</td>\n",
" <td>3</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>197901A1BFD77</td>\n",
" <td>406380</td>\n",
" <td>307000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>4</td>\n",
" <td>8</td>\n",
" <td>3</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>197901A1BGC20</td>\n",
" <td>281680</td>\n",
" <td>440000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>4</td>\n",
" <td>8</td>\n",
" <td>3</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>197901A1BGF95</td>\n",
" <td>153960</td>\n",
" <td>795000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 33 columns</p>\n",
"</div>"
],
"text/plain": [
" Unnamed: 0 Accident_Index Location_Easting_OSGR Location_Northing_OSGR \\\n",
"0 0 197901A11AD14 NaN NaN \n",
"1 1 197901A1BAW34 198460 894000 \n",
"2 2 197901A1BFD77 406380 307000 \n",
"3 3 197901A1BGC20 281680 440000 \n",
"4 4 197901A1BGF95 153960 795000 \n",
"\n",
" Longitude Latitude Police_Force Accident_Severity Number_of_Vehicles \\\n",
"0 NaN NaN 1 3 2 \n",
"1 NaN NaN 1 3 1 \n",
"2 NaN NaN 1 3 2 \n",
"3 NaN NaN 1 3 2 \n",
"4 NaN NaN 1 2 2 \n",
"\n",
" Number_of_Casualties ... \\\n",
"0 1 ... \n",
"1 1 ... \n",
"2 3 ... \n",
"3 2 ... \n",
"4 1 ... \n",
"\n",
" Pedestrian_Crossing-Human_Control Pedestrian_Crossing-Physical_Facilities \\\n",
"0 -1 -1 \n",
"1 -1 -1 \n",
"2 -1 -1 \n",
"3 -1 -1 \n",
"4 -1 -1 \n",
"\n",
" Light_Conditions Weather_Conditions Road_Surface_Conditions \\\n",
"0 1 8 1 \n",
"1 4 8 3 \n",
"2 4 8 3 \n",
"3 4 8 3 \n",
"4 4 3 3 \n",
"\n",
" Special_Conditions_at_Site Carriageway_Hazards Urban_or_Rural_Area \\\n",
"0 -1 0 -1 \n",
"1 -1 0 -1 \n",
"2 -1 0 -1 \n",
"3 -1 0 -1 \n",
"4 -1 0 -1 \n",
"\n",
" Did_Police_Officer_Attend_Scene_of_Accident LSOA_of_Accident_Location \n",
"0 -1 NaN \n",
"1 -1 NaN \n",
"2 -1 NaN \n",
"3 -1 NaN \n",
"4 -1 NaN \n",
"\n",
"[5 rows x 33 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"accidents.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What the heck?! The missing values haven't actually been updated. So how do we make the change stick? Using the inplace=True argument like so..."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>Accident_Index</th>\n",
" <th>Location_Easting_OSGR</th>\n",
" <th>Location_Northing_OSGR</th>\n",
" <th>Longitude</th>\n",
" <th>Latitude</th>\n",
" <th>Police_Force</th>\n",
" <th>Accident_Severity</th>\n",
" <th>Number_of_Vehicles</th>\n",
" <th>Number_of_Casualties</th>\n",
" <th>...</th>\n",
" <th>Pedestrian_Crossing-Human_Control</th>\n",
" <th>Pedestrian_Crossing-Physical_Facilities</th>\n",
" <th>Light_Conditions</th>\n",
" <th>Weather_Conditions</th>\n",
" <th>Road_Surface_Conditions</th>\n",
" <th>Special_Conditions_at_Site</th>\n",
" <th>Carriageway_Hazards</th>\n",
" <th>Urban_or_Rural_Area</th>\n",
" <th>Did_Police_Officer_Attend_Scene_of_Accident</th>\n",
" <th>LSOA_of_Accident_Location</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>197901A11AD14</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>1</td>\n",
" <td>8</td>\n",
" <td>1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>197901A1BAW34</td>\n",
" <td>198460</td>\n",
" <td>894000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>4</td>\n",
" <td>8</td>\n",
" <td>3</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>197901A1BFD77</td>\n",
" <td>406380</td>\n",
" <td>307000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>4</td>\n",
" <td>8</td>\n",
" <td>3</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>197901A1BGC20</td>\n",
" <td>281680</td>\n",
" <td>440000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>4</td>\n",
" <td>8</td>\n",
" <td>3</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>197901A1BGF95</td>\n",
" <td>153960</td>\n",
" <td>795000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 33 columns</p>\n",
"</div>"
],
"text/plain": [
" Unnamed: 0 Accident_Index Location_Easting_OSGR Location_Northing_OSGR \\\n",
"0 0 197901A11AD14 0 0 \n",
"1 1 197901A1BAW34 198460 894000 \n",
"2 2 197901A1BFD77 406380 307000 \n",
"3 3 197901A1BGC20 281680 440000 \n",
"4 4 197901A1BGF95 153960 795000 \n",
"\n",
" Longitude Latitude Police_Force Accident_Severity Number_of_Vehicles \\\n",
"0 0 0 1 3 2 \n",
"1 0 0 1 3 1 \n",
"2 0 0 1 3 2 \n",
"3 0 0 1 3 2 \n",
"4 0 0 1 2 2 \n",
"\n",
" Number_of_Casualties ... \\\n",
"0 1 ... \n",
"1 1 ... \n",
"2 3 ... \n",
"3 2 ... \n",
"4 1 ... \n",
"\n",
" Pedestrian_Crossing-Human_Control Pedestrian_Crossing-Physical_Facilities \\\n",
"0 -1 -1 \n",
"1 -1 -1 \n",
"2 -1 -1 \n",
"3 -1 -1 \n",
"4 -1 -1 \n",
"\n",
" Light_Conditions Weather_Conditions Road_Surface_Conditions \\\n",
"0 1 8 1 \n",
"1 4 8 3 \n",
"2 4 8 3 \n",
"3 4 8 3 \n",
"4 4 3 3 \n",
"\n",
" Special_Conditions_at_Site Carriageway_Hazards Urban_or_Rural_Area \\\n",
"0 -1 0 -1 \n",
"1 -1 0 -1 \n",
"2 -1 0 -1 \n",
"3 -1 0 -1 \n",
"4 -1 0 -1 \n",
"\n",
" Did_Police_Officer_Attend_Scene_of_Accident LSOA_of_Accident_Location \n",
"0 -1 0 \n",
"1 -1 0 \n",
"2 -1 0 \n",
"3 -1 0 \n",
"4 -1 0 \n",
"\n",
"[5 rows x 33 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Fill the NaN values and ensure the DataFrame is indeed updated.\n",
"accidents.fillna(value=0,\n",
" inplace=True)\n",
"\n",
"accidents.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Success! The DataFrame has now been updated."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Applying a Function With No Arguments to a DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the reasons Pandas rocks is that you can apply a function to either a single column of a DataFrame or an entire DataFrame, using the [apply()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) function. You'll be using this often, so here's how."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 18/01/1979\n",
"1 01/01/1979\n",
"2 01/01/1979\n",
"3 01/01/1979\n",
"4 01/01/1979\n",
"Name: Date, dtype: object"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Let's take a look at the Date column\n",
"accidents['Date'].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"According to Pandas, the Date is an object, meaning it doesn't actually see it as a date. Let's change that."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Define a function to convert a string to a date.\n",
"\n",
"def convert_string_to_date(s):\n",
" \"\"\"\n",
" Given a string, use the to_datetime function of Pandas to convert\n",
" it to a datetime, and then return it.\n",
" \"\"\"\n",
" return pd.to_datetime(s)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 1979-01-18\n",
"1 1979-01-01\n",
"2 1979-01-01\n",
"3 1979-01-01\n",
"4 1979-01-01\n",
"Name: Date, dtype: datetime64[ns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Apply the function to the Data column using the apply() function.\n",
"# Note: we do not have to explicitly pass in the value in the row being processed.\n",
"accidents['Date'] = accidents['Date'].apply(convert_string_to_date)\n",
"\n",
"# Let's check it out.\n",
"accidents['Date'].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Voila! Our data column is now a datetime."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Applying a Function With Arguments to a DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Along with applying a function to a single column, another common task is to create an additional column based on the values in two or more columns. In order to do that, we need create a function that takes multiple parameters, and then apply it to the DataFrame.\n",
"\n",
"We'll be using the same [apply()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) function we used in the previous tip, plus a little lambda magic."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Create a few dicts and a DataFrame to hold the mappings for the accident data\n",
"\n",
"# Accident severity\n",
"severity = {\n",
" 1: 'fatal',\n",
" 2: 'serious',\n",
" 3: 'fairly serious'\n",
"}\n",
"\n",
"# Day of Week\n",
"days_of_week = {\n",
" 1: 'Sunday',\n",
" 2: 'Monday',\n",
" 3: 'Tuesday',\n",
" 4: 'Wednesday',\n",
" 5: 'Thursday',\n",
" 6: 'Friday',\n",
" 7: 'Saturday',\n",
" 0: 'Earlier this week'\n",
"}\n",
"\n",
"# Road surfaces, updated to fit the sensationalism of a news broadcast\n",
"road_surfaces = {\n",
" 1: 'dry',\n",
" 2: 'wet',\n",
" 3: 'snow-covered',\n",
" 4: 'frosty',\n",
" 5: 'flooded',\n",
" 6: 'oily',\n",
" 7: 'muddy',\n",
" -1: 'Data missing or out of range',\n",
"}\n",
"\n",
"# Local Authority (District) - create a DataFrame from the CSV file\n",
"f = './data/accidents1k.csv'\n",
"\n",
"# Use read_csv() to create a DataFrame from the local_authority_district mapping tab of the data dictionary.\n",
"# There are almost 1000 districts, hence I put them into a CSV file.\n",
"districts = pd.read_csv('./data/local_authority_district.csv',\n",
" sep=',',\n",
" header=0,\n",
" index_col=0,\n",
" parse_dates=False,\n",
" tupleize_cols=False,\n",
" error_bad_lines=False,\n",
" warn_bad_lines=True,\n",
" skip_blank_lines=True,\n",
" low_memory=False\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Define a function to create a one-sentence summary of the record.\n",
"def create_summary(day_of_week, accident_severity, road_surface, local_authority_district):\n",
" \"\"\"\n",
" Create a one-sentence summary of the record.\n",
" Parameters: integer values for the Day_of_Week, Accident_Severity,\n",
" Road_Surface_Conditions and Local_Authority_(District) columns\n",
" \"\"\"\n",
" \n",
" # Perform the value lookups in the dicts and DataFrame\n",
" dow = days_of_week[day_of_week]\n",
" sev = severity[accident_severity]\n",
" road = road_surfaces[road_surface]\n",
" lad = districts.loc[local_authority_district].label\n",
" \n",
" # If the day of week was specified use the first sentence variation, otherwise use the second\n",
" # Yes, this is redundant and we could optimize it. I leave that to you!\n",
" if day_of_week != 0:\n",
" return \"On {} a {} accident occured on a {} road in {}\".format(dow, sev, road, lad)\n",
" else:\n",
" return \"{} a {} accident occured on a {} road in {}\".format(dow, sev, road, lad)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 On Thursday a fairly serious accident occured ...\n",
"1 On Monday a fairly serious accident occured on...\n",
"2 On Monday a fairly serious accident occured on...\n",
"3 On Monday a fairly serious accident occured on...\n",
"4 On Monday a serious accident occured on a snow...\n",
"Name: summary, dtype: object"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a new column in the DataFrame and fill it with the summary produced by the create_summary function\n",
"# Pass in the parameters needed to create the summary\n",
"accidents['summary'] = accidents.apply(lambda x: create_summary(x['Day_of_Week'],\n",
" x['Accident_Severity'],\n",
" x['Road_Surface_Conditions'],\n",
" x['Local_Authority_(District)']), axis=1)\n",
"\n",
"# Let's see some results!\n",
"accidents['summary'].head()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'On Thursday a fairly serious accident occured on a dry road in Hammersmith and Fulham'"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Let's view an entire summary\n",
"accidents['summary'][0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The lambda function can easily throw you for a curve. For more information on what they are and how to use them check out the [Python Tutorial: Lambda, Filter, Reduce and Map](http://www.python-course.eu/lambda.php)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## And Go!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With these three tips you're well on your way to data engineering your day away."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment