pmarcelino/data_cleaning_general_data_types.ipynb

## data_cleaning_general_data_types.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data types"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use the Kaggle Titanic dataset as an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
       "      <td>female</td>\n",
       "      <td>38</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>35</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass  \\\n",
       "0            1         0       3   \n",
       "1            2         1       1   \n",
       "2            3         1       3   \n",
       "3            4         1       1   \n",
       "4            5         0       3   \n",
       "\n",
       "                                                Name     Sex Age  SibSp  \\\n",
       "0                            Braund, Mr. Owen Harris    male  22      1   \n",
       "1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38      1   \n",
       "2                             Heikkinen, Miss. Laina  female  26      0   \n",
       "3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35      1   \n",
       "4                           Allen, Mr. William Henry    male  35      0   \n",
       "\n",
       "   Parch            Ticket     Fare Cabin Embarked  \n",
       "0      0         A/5 21171   7.2500   NaN        S  \n",
       "1      0          PC 17599  71.2833   C85        C  \n",
       "2      0  STON/O2. 3101282   7.9250   NaN        S  \n",
       "3      0            113803  53.1000  C123        S  \n",
       "4      0            373450   8.0500   NaN        S  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Import data \n",
    "import pandas as pd\n",
    "df = pd.read_csv('./data/titanic.csv')  # Kaggle Titanic dataset\n",
    "\n",
    "# Corrupt dataset to make the example meaningful\n",
    "df['Age'] = df['Age'].astype('object')\n",
    "\n",
    "# Show dataset\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you need to remind the meaning of each feature, check this [link](https://www.kaggle.com/pmarcelino/data-analysis-and-feature-extraction-with-python)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',\n",
       "       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get list of features\n",
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What we expect to see when checking features' data types:\n",
    "\n",
    "| Feature     | Data type        | Comment                                                    |\n",
    "|-------------|------------------|------------------------------------------------------------|\n",
    "| PassengerId | int64            | It's a categorical feature.                                |\n",
    "| Survived    | int64            | It's a categorical feature.                                |\n",
    "| Pclass      | int64            | It's an ordinal feature.                                   |\n",
    "| Name        | object           | It's character data (string).                              |\n",
    "| Sex         | int64 or object  | Categorical feature. It can be [0,1] or ['male','female']. |\n",
    "| Age         | int64 or float64 | float64 if age = date of birth - current date.             |\n",
    "| SibSp       | int64            | It's continuous.                                           |\n",
    "| Parch       | int64            | It's continuous.                                           |\n",
    "| Ticket      | object or int64  | It's categorical.                                          |\n",
    "| Fare        | float64          | It's continuous.                                           |\n",
    "| Cabin       | int64            | It's categorical.                                          |\n",
    "| Embarked    | int64 or object  | It's categorical.                                          |\n",
    "\n",
    "Looks cool, isn't it? If you also want to do fancy tables, use [this](https://www.tablesgenerator.com/markdown_tables#)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PassengerId      int64\n",
       "Survived         int64\n",
       "Pclass           int64\n",
       "Name            object\n",
       "Sex             object\n",
       "Age             object\n",
       "SibSp            int64\n",
       "Parch            int64\n",
       "Ticket          object\n",
       "Fare           float64\n",
       "Cabin           object\n",
       "Embarked        object\n",
       "dtype: object"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get data types\n",
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The only feature that it's not according to our expectation is 'Age' (what a surprise). This feature is defined as character data (object), when it should be numeric data (int64 or float64).\n",
    "\n",
    "We can use [pandas.to_numeric](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html#pandas.to_numeric) to correct this situation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PassengerId      int64\n",
       "Survived         int64\n",
       "Pclass           int64\n",
       "Name            object\n",
       "Sex             object\n",
       "Age            float64\n",
       "SibSp            int64\n",
       "Parch            int64\n",
       "Ticket          object\n",
       "Fare           float64\n",
       "Cabin           object\n",
       "Embarked        object\n",
       "dtype: object"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Correct data type\n",
    "df['Age'] = pd.to_numeric(df['Age'])\n",
    "\n",
    "# Debug\n",
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clean! This is the type of data cleaning operations related to data types that we usually do.\n",
    "\n",
    "Other common data type transformations are:\n",
    "\n",
    "* [pandas.to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html#pandas.to_datetime) - Converts argument to datetime.\n",
    "* [pandas.to_timedelta](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_timedelta.html#pandas.to_timedelta) - Converts argument to timedelta.\n",
    "* [pandas.DataFrame.astype('object')](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html#pandas.DataFrame.astype) - Converts argument to object. This is a generic transformation. You can use it to transform data to other data types than object."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Data types"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's use the Kaggle Titanic dataset as an example."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 17,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style>\n",
	" .dataframe thead tr:only-child th {\n",
	" text-align: right;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: left;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>PassengerId</th>\n",
	" <th>Survived</th>\n",
	" <th>Pclass</th>\n",
	" <th>Name</th>\n",
	" <th>Sex</th>\n",
	" <th>Age</th>\n",
	" <th>SibSp</th>\n",
	" <th>Parch</th>\n",
	" <th>Ticket</th>\n",
	" <th>Fare</th>\n",
	" <th>Cabin</th>\n",
	" <th>Embarked</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>3</td>\n",
	" <td>Braund, Mr. Owen Harris</td>\n",
	" <td>male</td>\n",
	" <td>22</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>A/5 21171</td>\n",
	" <td>7.2500</td>\n",
	" <td>NaN</td>\n",
	" <td>S</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>2</td>\n",
	" <td>1</td>\n",
	" <td>1</td>\n",
	" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
	" <td>female</td>\n",
	" <td>38</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>PC 17599</td>\n",
	" <td>71.2833</td>\n",
	" <td>C85</td>\n",
	" <td>C</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>3</td>\n",
	" <td>1</td>\n",
	" <td>3</td>\n",
	" <td>Heikkinen, Miss. Laina</td>\n",
	" <td>female</td>\n",
	" <td>26</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>STON/O2. 3101282</td>\n",
	" <td>7.9250</td>\n",
	" <td>NaN</td>\n",
	" <td>S</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>4</td>\n",
	" <td>1</td>\n",
	" <td>1</td>\n",
	" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
	" <td>female</td>\n",
	" <td>35</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>113803</td>\n",
	" <td>53.1000</td>\n",
	" <td>C123</td>\n",
	" <td>S</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>5</td>\n",
	" <td>0</td>\n",
	" <td>3</td>\n",
	" <td>Allen, Mr. William Henry</td>\n",
	" <td>male</td>\n",
	" <td>35</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>373450</td>\n",
	" <td>8.0500</td>\n",
	" <td>NaN</td>\n",
	" <td>S</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" PassengerId Survived Pclass \\\n",
	"0 1 0 3 \n",
	"1 2 1 1 \n",
	"2 3 1 3 \n",
	"3 4 1 1 \n",
	"4 5 0 3 \n",
	"\n",
	" Name Sex Age SibSp \\\n",
	"0 Braund, Mr. Owen Harris male 22 1 \n",
	"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 \n",
	"2 Heikkinen, Miss. Laina female 26 0 \n",
	"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 \n",
	"4 Allen, Mr. William Henry male 35 0 \n",
	"\n",
	" Parch Ticket Fare Cabin Embarked \n",
	"0 0 A/5 21171 7.2500 NaN S \n",
	"1 0 PC 17599 71.2833 C85 C \n",
	"2 0 STON/O2. 3101282 7.9250 NaN S \n",
	"3 0 113803 53.1000 C123 S \n",
	"4 0 373450 8.0500 NaN S "
	]
	},
	"execution_count": 17,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# Import data \n",
	"import pandas as pd\n",
	"df = pd.read_csv('./data/titanic.csv') # Kaggle Titanic dataset\n",
	"\n",
	"# Corrupt dataset to make the example meaningful\n",
	"df['Age'] = df['Age'].astype('object')\n",
	"\n",
	"# Show dataset\n",
	"df.head()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"If you need to remind the meaning of each feature, check this [link](https://www.kaggle.com/pmarcelino/data-analysis-and-feature-extraction-with-python)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 18,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',\n",
	" 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],\n",
	" dtype='object')"
	]
	},
	"execution_count": 18,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# Get list of features\n",
	"df.columns"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"What we expect to see when checking features' data types:\n",
	"\n",
	"\| Feature \| Data type \| Comment \|\n",
	"\|-------------\|------------------\|------------------------------------------------------------\|\n",
	"\| PassengerId \| int64 \| It's a categorical feature. \|\n",
	"\| Survived \| int64 \| It's a categorical feature. \|\n",
	"\| Pclass \| int64 \| It's an ordinal feature. \|\n",
	"\| Name \| object \| It's character data (string). \|\n",
	"\| Sex \| int64 or object \| Categorical feature. It can be [0,1] or ['male','female']. \|\n",
	"\| Age \| int64 or float64 \| float64 if age = date of birth - current date. \|\n",
	"\| SibSp \| int64 \| It's continuous. \|\n",
	"\| Parch \| int64 \| It's continuous. \|\n",
	"\| Ticket \| object or int64 \| It's categorical. \|\n",
	"\| Fare \| float64 \| It's continuous. \|\n",
	"\| Cabin \| int64 \| It's categorical. \|\n",
	"\| Embarked \| int64 or object \| It's categorical. \|\n",
	"\n",
	"Looks cool, isn't it? If you also want to do fancy tables, use [this](https://www.tablesgenerator.com/markdown_tables#)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 19,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"PassengerId int64\n",
	"Survived int64\n",
	"Pclass int64\n",
	"Name object\n",
	"Sex object\n",
	"Age object\n",
	"SibSp int64\n",
	"Parch int64\n",
	"Ticket object\n",
	"Fare float64\n",
	"Cabin object\n",
	"Embarked object\n",
	"dtype: object"
	]
	},
	"execution_count": 19,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# Get data types\n",
	"df.dtypes"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The only feature that it's not according to our expectation is 'Age' (what a surprise). This feature is defined as character data (object), when it should be numeric data (int64 or float64).\n",
	"\n",
	"We can use [pandas.to_numeric](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html#pandas.to_numeric) to correct this situation."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"PassengerId int64\n",
	"Survived int64\n",
	"Pclass int64\n",
	"Name object\n",
	"Sex object\n",
	"Age float64\n",
	"SibSp int64\n",
	"Parch int64\n",
	"Ticket object\n",
	"Fare float64\n",
	"Cabin object\n",
	"Embarked object\n",
	"dtype: object"
	]
	},
	"execution_count": 20,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# Correct data type\n",
	"df['Age'] = pd.to_numeric(df['Age'])\n",
	"\n",
	"# Debug\n",
	"df.dtypes"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Clean! This is the type of data cleaning operations related to data types that we usually do.\n",
	"\n",
	"Other common data type transformations are:\n",
	"\n",
	"* [pandas.to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html#pandas.to_datetime) - Converts argument to datetime.\n",
	"* [pandas.to_timedelta](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_timedelta.html#pandas.to_timedelta) - Converts argument to timedelta.\n",
	"* [pandas.DataFrame.astype('object')](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html#pandas.DataFrame.astype) - Converts argument to object. This is a generic transformation. You can use it to transform data to other data types than object."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}