Created
August 3, 2018 05:27
-
-
Save pmarcelino/0b6920ae6d54995ea380d9792418304a to your computer and use it in GitHub Desktop.
Data cleaning - General - Data types
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Data types" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let's use the Kaggle Titanic dataset as an example." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>PassengerId</th>\n", | |
" <th>Survived</th>\n", | |
" <th>Pclass</th>\n", | |
" <th>Name</th>\n", | |
" <th>Sex</th>\n", | |
" <th>Age</th>\n", | |
" <th>SibSp</th>\n", | |
" <th>Parch</th>\n", | |
" <th>Ticket</th>\n", | |
" <th>Fare</th>\n", | |
" <th>Cabin</th>\n", | |
" <th>Embarked</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>3</td>\n", | |
" <td>Braund, Mr. Owen Harris</td>\n", | |
" <td>male</td>\n", | |
" <td>22</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>A/5 21171</td>\n", | |
" <td>7.2500</td>\n", | |
" <td>NaN</td>\n", | |
" <td>S</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n", | |
" <td>female</td>\n", | |
" <td>38</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>PC 17599</td>\n", | |
" <td>71.2833</td>\n", | |
" <td>C85</td>\n", | |
" <td>C</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>3</td>\n", | |
" <td>1</td>\n", | |
" <td>3</td>\n", | |
" <td>Heikkinen, Miss. Laina</td>\n", | |
" <td>female</td>\n", | |
" <td>26</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>STON/O2. 3101282</td>\n", | |
" <td>7.9250</td>\n", | |
" <td>NaN</td>\n", | |
" <td>S</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>4</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n", | |
" <td>female</td>\n", | |
" <td>35</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>113803</td>\n", | |
" <td>53.1000</td>\n", | |
" <td>C123</td>\n", | |
" <td>S</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>5</td>\n", | |
" <td>0</td>\n", | |
" <td>3</td>\n", | |
" <td>Allen, Mr. William Henry</td>\n", | |
" <td>male</td>\n", | |
" <td>35</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>373450</td>\n", | |
" <td>8.0500</td>\n", | |
" <td>NaN</td>\n", | |
" <td>S</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" PassengerId Survived Pclass \\\n", | |
"0 1 0 3 \n", | |
"1 2 1 1 \n", | |
"2 3 1 3 \n", | |
"3 4 1 1 \n", | |
"4 5 0 3 \n", | |
"\n", | |
" Name Sex Age SibSp \\\n", | |
"0 Braund, Mr. Owen Harris male 22 1 \n", | |
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 \n", | |
"2 Heikkinen, Miss. Laina female 26 0 \n", | |
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 \n", | |
"4 Allen, Mr. William Henry male 35 0 \n", | |
"\n", | |
" Parch Ticket Fare Cabin Embarked \n", | |
"0 0 A/5 21171 7.2500 NaN S \n", | |
"1 0 PC 17599 71.2833 C85 C \n", | |
"2 0 STON/O2. 3101282 7.9250 NaN S \n", | |
"3 0 113803 53.1000 C123 S \n", | |
"4 0 373450 8.0500 NaN S " | |
] | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Import data \n", | |
"import pandas as pd\n", | |
"df = pd.read_csv('./data/titanic.csv') # Kaggle Titanic dataset\n", | |
"\n", | |
"# Corrupt dataset to make the example meaningful\n", | |
"df['Age'] = df['Age'].astype('object')\n", | |
"\n", | |
"# Show dataset\n", | |
"df.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"If you need to remind the meaning of each feature, check this [link](https://www.kaggle.com/pmarcelino/data-analysis-and-feature-extraction-with-python)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',\n", | |
" 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],\n", | |
" dtype='object')" | |
] | |
}, | |
"execution_count": 18, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Get list of features\n", | |
"df.columns" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"What we expect to see when checking features' data types:\n", | |
"\n", | |
"| Feature | Data type | Comment |\n", | |
"|-------------|------------------|------------------------------------------------------------|\n", | |
"| PassengerId | int64 | It's a categorical feature. |\n", | |
"| Survived | int64 | It's a categorical feature. |\n", | |
"| Pclass | int64 | It's an ordinal feature. |\n", | |
"| Name | object | It's character data (string). |\n", | |
"| Sex | int64 or object | Categorical feature. It can be [0,1] or ['male','female']. |\n", | |
"| Age | int64 or float64 | float64 if age = date of birth - current date. |\n", | |
"| SibSp | int64 | It's continuous. |\n", | |
"| Parch | int64 | It's continuous. |\n", | |
"| Ticket | object or int64 | It's categorical. |\n", | |
"| Fare | float64 | It's continuous. |\n", | |
"| Cabin | int64 | It's categorical. |\n", | |
"| Embarked | int64 or object | It's categorical. |\n", | |
"\n", | |
"Looks cool, isn't it? If you also want to do fancy tables, use [this](https://www.tablesgenerator.com/markdown_tables#)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"PassengerId int64\n", | |
"Survived int64\n", | |
"Pclass int64\n", | |
"Name object\n", | |
"Sex object\n", | |
"Age object\n", | |
"SibSp int64\n", | |
"Parch int64\n", | |
"Ticket object\n", | |
"Fare float64\n", | |
"Cabin object\n", | |
"Embarked object\n", | |
"dtype: object" | |
] | |
}, | |
"execution_count": 19, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Get data types\n", | |
"df.dtypes" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The only feature that it's not according to our expectation is 'Age' (what a surprise). This feature is defined as character data (object), when it should be numeric data (int64 or float64).\n", | |
"\n", | |
"We can use [pandas.to_numeric](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html#pandas.to_numeric) to correct this situation." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"PassengerId int64\n", | |
"Survived int64\n", | |
"Pclass int64\n", | |
"Name object\n", | |
"Sex object\n", | |
"Age float64\n", | |
"SibSp int64\n", | |
"Parch int64\n", | |
"Ticket object\n", | |
"Fare float64\n", | |
"Cabin object\n", | |
"Embarked object\n", | |
"dtype: object" | |
] | |
}, | |
"execution_count": 20, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Correct data type\n", | |
"df['Age'] = pd.to_numeric(df['Age'])\n", | |
"\n", | |
"# Debug\n", | |
"df.dtypes" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Clean! This is the type of data cleaning operations related to data types that we usually do.\n", | |
"\n", | |
"Other common data type transformations are:\n", | |
"\n", | |
"* [pandas.to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html#pandas.to_datetime) - Converts argument to datetime.\n", | |
"* [pandas.to_timedelta](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_timedelta.html#pandas.to_timedelta) - Converts argument to timedelta.\n", | |
"* [pandas.DataFrame.astype('object')](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html#pandas.DataFrame.astype) - Converts argument to object. This is a generic transformation. You can use it to transform data to other data types than object." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.2" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment