pybokeh/cleaning_columns_of_specific_types.ipynb

## cleaning_columns_of_specific_types.ipynb
{
  "cells": [
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### BACKGROUND:`"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Sample data is from UC Davis Machine Learning Repository [adult data set](http://archive.ics.uci.edu/ml/machine-learning-databases/adult/)"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Data set has white space that is prepended or in front of each value. We want to remove the white space from columns of type string. Even the column names have a white space in front. Instances such as theses where the source data has these quirky data anomalies are quite common in the real world. Most often we can't control the source data, but we can clean the data afterwards. Depending on the level of control, below are 3 different methods."
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "import pandas as pd",
      "execution_count": 1,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "df = pd.read_csv('adult_data.csv')",
      "execution_count": 2,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "df.head()",
      "execution_count": 3,
      "outputs": [
        {
          "data": {
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>age</th>\n      <th>workclass</th>\n      <th>fnlwgt</th>\n      <th>education</th>\n      <th>education-num</th>\n      <th>marital-status</th>\n      <th>occupation</th>\n      <th>relationship</th>\n      <th>race</th>\n      <th>sex</th>\n      <th>capital-gain</th>\n      <th>capital-loss</th>\n      <th>hours-per-week</th>\n      <th>native-country</th>\n      <th>&lt;=50k</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>39</td>\n      <td>State-gov</td>\n      <td>77516</td>\n      <td>Bachelors</td>\n      <td>13</td>\n      <td>Never-married</td>\n      <td>Adm-clerical</td>\n      <td>Not-in-family</td>\n      <td>White</td>\n      <td>Male</td>\n      <td>2174</td>\n      <td>0</td>\n      <td>40</td>\n      <td>United-States</td>\n      <td>&lt;=50K</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>50</td>\n      <td>Self-emp-not-inc</td>\n      <td>83311</td>\n      <td>Bachelors</td>\n      <td>13</td>\n      <td>Married-civ-spouse</td>\n      <td>Exec-managerial</td>\n      <td>Husband</td>\n      <td>White</td>\n      <td>Male</td>\n      <td>0</td>\n      <td>0</td>\n      <td>13</td>\n      <td>United-States</td>\n      <td>&lt;=50K</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>38</td>\n      <td>Private</td>\n      <td>215646</td>\n      <td>HS-grad</td>\n      <td>9</td>\n      <td>Divorced</td>\n      <td>Handlers-cleaners</td>\n      <td>Not-in-family</td>\n      <td>White</td>\n      <td>Male</td>\n      <td>0</td>\n      <td>0</td>\n      <td>40</td>\n      <td>United-States</td>\n      <td>&lt;=50K</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>53</td>\n      <td>Private</td>\n      <td>234721</td>\n      <td>11th</td>\n      <td>7</td>\n      <td>Married-civ-spouse</td>\n      <td>Handlers-cleaners</td>\n      <td>Husband</td>\n      <td>Black</td>\n      <td>Male</td>\n      <td>0</td>\n      <td>0</td>\n      <td>40</td>\n      <td>United-States</td>\n      <td>&lt;=50K</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>28</td>\n      <td>Private</td>\n      <td>338409</td>\n      <td>Bachelors</td>\n      <td>13</td>\n      <td>Married-civ-spouse</td>\n      <td>Prof-specialty</td>\n      <td>Wife</td>\n      <td>Black</td>\n      <td>Female</td>\n      <td>0</td>\n      <td>0</td>\n      <td>40</td>\n      <td>Cuba</td>\n      <td>&lt;=50K</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
            "text/plain": "   age          workclass   fnlwgt   education   education-num  \\\n0   39          State-gov    77516   Bachelors              13   \n1   50   Self-emp-not-inc    83311   Bachelors              13   \n2   38            Private   215646     HS-grad               9   \n3   53            Private   234721        11th               7   \n4   28            Private   338409   Bachelors              13   \n\n        marital-status          occupation    relationship    race      sex  \\\n0        Never-married        Adm-clerical   Not-in-family   White     Male   \n1   Married-civ-spouse     Exec-managerial         Husband   White     Male   \n2             Divorced   Handlers-cleaners   Not-in-family   White     Male   \n3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   \n4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   \n\n    capital-gain   capital-loss   hours-per-week  native-country   <=50k  \n0           2174              0               40   United-States   <=50K  \n1              0              0               13   United-States   <=50K  \n2              0              0               40   United-States   <=50K  \n3              0              0               40   United-States   <=50K  \n4              0              0               40            Cuba   <=50K  "
          },
          "execution_count": 3,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Except for the first column (```age```), the column names have a white space in front:"
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "df.columns",
      "execution_count": 4,
      "outputs": [
        {
          "data": {
            "text/plain": "Index(['age', ' workclass', ' fnlwgt', ' education', ' education-num',\n       ' marital-status', ' occupation', ' relationship', ' race', ' sex',\n       ' capital-gain', ' capital-loss', ' hours-per-week', ' native-country',\n       ' <=50k'],\n      dtype='object')"
          },
          "execution_count": 4,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Let's fix the column names first, then the column values:"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "df.columns = [column.strip() for column in df.columns]",
      "execution_count": 5,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "df.columns",
      "execution_count": 6,
      "outputs": [
        {
          "data": {
            "text/plain": "Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',\n       'marital-status', 'occupation', 'relationship', 'race', 'sex',\n       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',\n       '<=50k'],\n      dtype='object')"
          },
          "execution_count": 6,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Columns of type string or ```str``` also have a white space in front of each value.  In pandas dataframe, ```str``` columns are represented as ```object``` data type:"
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "df.dtypes",
      "execution_count": 7,
      "outputs": [
        {
          "data": {
            "text/plain": "age                int64\nworkclass         object\nfnlwgt             int64\neducation         object\neducation-num      int64\nmarital-status    object\noccupation        object\nrelationship      object\nrace              object\nsex               object\ncapital-gain       int64\ncapital-loss       int64\nhours-per-week     int64\nnative-country    object\n<=50k             object\ndtype: object"
          },
          "execution_count": 7,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### We can use ```value_counts()``` method to give us counts by data type:"
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "df.dtypes.value_counts()",
      "execution_count": 8,
      "outputs": [
        {
          "data": {
            "text/plain": "object    9\nint64     6\ndtype: int64"
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Let's take a peek at a couple of the string columns:"
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "df['workclass'].values",
      "execution_count": 9,
      "outputs": [
        {
          "data": {
            "text/plain": "array([' State-gov', ' Self-emp-not-inc', ' Private', ..., ' Private',\n       ' Private', ' Self-emp-inc'], dtype=object)"
          },
          "execution_count": 9,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "df['education'].values",
      "execution_count": 10,
      "outputs": [
        {
          "data": {
            "text/plain": "array([' Bachelors', ' Bachelors', ' HS-grad', ..., ' HS-grad',\n       ' HS-grad', ' HS-grad'], dtype=object)"
          },
          "execution_count": 10,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "From above, we see that indeed there is a white space in front of each value.  There are a few different ways to remove these white spaces."
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Method #1: Test type using ```isinstance()``` method:"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "def strip_whitespace(df):\n    for column in df.columns:\n        # if first value of column is of type str, then strip the white space from all values\n        if isinstance(df[column][0], str):\n            df[column] = df[column].str.strip()",
      "execution_count": 11,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "strip_whitespace(df)",
      "execution_count": 12,
      "outputs": []
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Let's double-check that the function did its job:"
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "df['sex'].values",
      "execution_count": 13,
      "outputs": [
        {
          "data": {
            "text/plain": "array(['Male', 'Male', 'Male', ..., 'Female', 'Male', 'Female'],\n      dtype=object)"
          },
          "execution_count": 13,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "df['education'].values",
      "execution_count": 14,
      "outputs": [
        {
          "data": {
            "text/plain": "array(['Bachelors', 'Bachelors', 'HS-grad', ..., 'HS-grad', 'HS-grad',\n       'HS-grad'], dtype=object)"
          },
          "execution_count": 14,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Method #2: Using ```dtype``` to test for data type:"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "df = pd.read_csv('adult_data.csv')\ndf.columns = [column.strip() for column in df.columns]",
      "execution_count": 15,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "def strip_whitespace2(df):\n    for column in df.columns:\n        # if column is of type 'object', then strip the white space\n        if df[column].dtype == 'object':\n            df[column] = df[column].str.strip()",
      "execution_count": 16,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "strip_whitespace2(df)",
      "execution_count": 17,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "df['education'].values",
      "execution_count": 18,
      "outputs": [
        {
          "data": {
            "text/plain": "array(['Bachelors', 'Bachelors', 'HS-grad', ..., 'HS-grad', 'HS-grad',\n       'HS-grad'], dtype=object)"
          },
          "execution_count": 18,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Method #3: Using pandas ```is_string_dtype``` or ```is_numeric_dtype```:"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "from pandas.api.types import is_string_dtype\nfrom pandas.api.types import is_numeric_dtype",
      "execution_count": 19,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "df = pd.read_csv('adult_data.csv')\ndf.columns = [column.strip() for column in df.columns]",
      "execution_count": 20,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "def strip_whitespace3(df):\n    for column in df.columns:\n        # if column is of a string data type, then strip the white space\n        if is_string_dtype(df[column]):\n            df[column] = df[column].str.strip()",
      "execution_count": 21,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "df['education'].values",
      "execution_count": 22,
      "outputs": [
        {
          "data": {
            "text/plain": "array([' Bachelors', ' Bachelors', ' HS-grad', ..., ' HS-grad',\n       ' HS-grad', ' HS-grad'], dtype=object)"
          },
          "execution_count": 22,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "strip_whitespace3(df)",
      "execution_count": 23,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "cell_type": "code",
      "source": "df['education'].values",
      "execution_count": 24,
      "outputs": [
        {
          "data": {
            "text/plain": "array(['Bachelors', 'Bachelors', 'HS-grad', ..., 'HS-grad', 'HS-grad',\n       'HS-grad'], dtype=object)"
          },
          "execution_count": 24,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    }
  ],
  "metadata": {
    "hide_input": false,
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "version": "3.6.5",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "varInspector": {
      "cols": {
        "lenName": 16,
        "lenType": 16,
        "lenVar": 40
      },
      "kernels_config": {
        "python": {
          "delete_cmd_postfix": "",
          "delete_cmd_prefix": "del ",
          "library": "var_list.py",
          "varRefreshCmd": "print(var_dic_list())"
        },
        "r": {
          "delete_cmd_postfix": ") ",
          "delete_cmd_prefix": "rm(",
          "library": "var_list.r",
          "varRefreshCmd": "cat(var_dic_list()) "
        }
      },
      "types_to_exclude": [
        "module",
        "function",
        "builtin_function_or_method",
        "instance",
        "_Feature"
      ],
      "window_display": false
    },
    "gist": {
      "id": "",
      "data": {
        "description": "data_cleaning/cleaning_columns_of_specific_types.ipynb",
        "public": true
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}
	{
	"cells": [
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### BACKGROUND:`"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### Sample data is from UC Davis Machine Learning Repository [adult data set](http://archive.ics.uci.edu/ml/machine-learning-databases/adult/)"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Data set has white space that is prepended or in front of each value. We want to remove the white space from columns of type string. Even the column names have a white space in front. Instances such as theses where the source data has these quirky data anomalies are quite common in the real world. Most often we can't control the source data, but we can clean the data afterwards. Depending on the level of control, below are 3 different methods."
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "import pandas as pd",
	"execution_count": 1,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "df = pd.read_csv('adult_data.csv')",
	"execution_count": 2,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "df.head()",
	"execution_count": 3,
	"outputs": [
	{
	"data": {
	"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>age</th>\n <th>workclass</th>\n <th>fnlwgt</th>\n <th>education</th>\n <th>education-num</th>\n <th>marital-status</th>\n <th>occupation</th>\n <th>relationship</th>\n <th>race</th>\n <th>sex</th>\n <th>capital-gain</th>\n <th>capital-loss</th>\n <th>hours-per-week</th>\n <th>native-country</th>\n <th><=50k</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>39</td>\n <td>State-gov</td>\n <td>77516</td>\n <td>Bachelors</td>\n <td>13</td>\n <td>Never-married</td>\n <td>Adm-clerical</td>\n <td>Not-in-family</td>\n <td>White</td>\n <td>Male</td>\n <td>2174</td>\n <td>0</td>\n <td>40</td>\n <td>United-States</td>\n <td><=50K</td>\n </tr>\n <tr>\n <th>1</th>\n <td>50</td>\n <td>Self-emp-not-inc</td>\n <td>83311</td>\n <td>Bachelors</td>\n <td>13</td>\n <td>Married-civ-spouse</td>\n <td>Exec-managerial</td>\n <td>Husband</td>\n <td>White</td>\n <td>Male</td>\n <td>0</td>\n <td>0</td>\n <td>13</td>\n <td>United-States</td>\n <td><=50K</td>\n </tr>\n <tr>\n <th>2</th>\n <td>38</td>\n <td>Private</td>\n <td>215646</td>\n <td>HS-grad</td>\n <td>9</td>\n <td>Divorced</td>\n <td>Handlers-cleaners</td>\n <td>Not-in-family</td>\n <td>White</td>\n <td>Male</td>\n <td>0</td>\n <td>0</td>\n <td>40</td>\n <td>United-States</td>\n <td><=50K</td>\n </tr>\n <tr>\n <th>3</th>\n <td>53</td>\n <td>Private</td>\n <td>234721</td>\n <td>11th</td>\n <td>7</td>\n <td>Married-civ-spouse</td>\n <td>Handlers-cleaners</td>\n <td>Husband</td>\n <td>Black</td>\n <td>Male</td>\n <td>0</td>\n <td>0</td>\n <td>40</td>\n <td>United-States</td>\n <td><=50K</td>\n </tr>\n <tr>\n <th>4</th>\n <td>28</td>\n <td>Private</td>\n <td>338409</td>\n <td>Bachelors</td>\n <td>13</td>\n <td>Married-civ-spouse</td>\n <td>Prof-specialty</td>\n <td>Wife</td>\n <td>Black</td>\n <td>Female</td>\n <td>0</td>\n <td>0</td>\n <td>40</td>\n <td>Cuba</td>\n <td><=50K</td>\n </tr>\n </tbody>\n</table>\n</div>",
	"text/plain": " age workclass fnlwgt education education-num \\\n0 39 State-gov 77516 Bachelors 13 \n1 50 Self-emp-not-inc 83311 Bachelors 13 \n2 38 Private 215646 HS-grad 9 \n3 53 Private 234721 11th 7 \n4 28 Private 338409 Bachelors 13 \n\n marital-status occupation relationship race sex \\\n0 Never-married Adm-clerical Not-in-family White Male \n1 Married-civ-spouse Exec-managerial Husband White Male \n2 Divorced Handlers-cleaners Not-in-family White Male \n3 Married-civ-spouse Handlers-cleaners Husband Black Male \n4 Married-civ-spouse Prof-specialty Wife Black Female \n\n capital-gain capital-loss hours-per-week native-country <=50k \n0 2174 0 40 United-States <=50K \n1 0 0 13 United-States <=50K \n2 0 0 40 United-States <=50K \n3 0 0 40 United-States <=50K \n4 0 0 40 Cuba <=50K "
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### Except for the first column (```age```), the column names have a white space in front:"
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "df.columns",
	"execution_count": 4,
	"outputs": [
	{
	"data": {
	"text/plain": "Index(['age', ' workclass', ' fnlwgt', ' education', ' education-num',\n ' marital-status', ' occupation', ' relationship', ' race', ' sex',\n ' capital-gain', ' capital-loss', ' hours-per-week', ' native-country',\n ' <=50k'],\n dtype='object')"
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### Let's fix the column names first, then the column values:"
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "df.columns = [column.strip() for column in df.columns]",
	"execution_count": 5,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "df.columns",
	"execution_count": 6,
	"outputs": [
	{
	"data": {
	"text/plain": "Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',\n 'marital-status', 'occupation', 'relationship', 'race', 'sex',\n 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',\n '<=50k'],\n dtype='object')"
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### Columns of type string or ```str``` also have a white space in front of each value. In pandas dataframe, ```str``` columns are represented as ```object``` data type:"
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "df.dtypes",
	"execution_count": 7,
	"outputs": [
	{
	"data": {
	"text/plain": "age int64\nworkclass object\nfnlwgt int64\neducation object\neducation-num int64\nmarital-status object\noccupation object\nrelationship object\nrace object\nsex object\ncapital-gain int64\ncapital-loss int64\nhours-per-week int64\nnative-country object\n<=50k object\ndtype: object"
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### We can use ```value_counts()``` method to give us counts by data type:"
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "df.dtypes.value_counts()",
	"execution_count": 8,
	"outputs": [
	{
	"data": {
	"text/plain": "object 9\nint64 6\ndtype: int64"
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### Let's take a peek at a couple of the string columns:"
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "df['workclass'].values",
	"execution_count": 9,
	"outputs": [
	{
	"data": {
	"text/plain": "array([' State-gov', ' Self-emp-not-inc', ' Private', ..., ' Private',\n ' Private', ' Self-emp-inc'], dtype=object)"
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "df['education'].values",
	"execution_count": 10,
	"outputs": [
	{
	"data": {
	"text/plain": "array([' Bachelors', ' Bachelors', ' HS-grad', ..., ' HS-grad',\n ' HS-grad', ' HS-grad'], dtype=object)"
	},
	"execution_count": 10,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "From above, we see that indeed there is a white space in front of each value. There are a few different ways to remove these white spaces."
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### Method #1: Test type using ```isinstance()``` method:"
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "def strip_whitespace(df):\n for column in df.columns:\n # if first value of column is of type str, then strip the white space from all values\n if isinstance(df[column][0], str):\n df[column] = df[column].str.strip()",
	"execution_count": 11,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "strip_whitespace(df)",
	"execution_count": 12,
	"outputs": []
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### Let's double-check that the function did its job:"
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "df['sex'].values",
	"execution_count": 13,
	"outputs": [
	{
	"data": {
	"text/plain": "array(['Male', 'Male', 'Male', ..., 'Female', 'Male', 'Female'],\n dtype=object)"
	},
	"execution_count": 13,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "df['education'].values",
	"execution_count": 14,
	"outputs": [
	{
	"data": {
	"text/plain": "array(['Bachelors', 'Bachelors', 'HS-grad', ..., 'HS-grad', 'HS-grad',\n 'HS-grad'], dtype=object)"
	},
	"execution_count": 14,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### Method #2: Using ```dtype``` to test for data type:"
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "df = pd.read_csv('adult_data.csv')\ndf.columns = [column.strip() for column in df.columns]",
	"execution_count": 15,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "def strip_whitespace2(df):\n for column in df.columns:\n # if column is of type 'object', then strip the white space\n if df[column].dtype == 'object':\n df[column] = df[column].str.strip()",
	"execution_count": 16,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "strip_whitespace2(df)",
	"execution_count": 17,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "df['education'].values",
	"execution_count": 18,
	"outputs": [
	{
	"data": {
	"text/plain": "array(['Bachelors', 'Bachelors', 'HS-grad', ..., 'HS-grad', 'HS-grad',\n 'HS-grad'], dtype=object)"
	},
	"execution_count": 18,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### Method #3: Using pandas ```is_string_dtype``` or ```is_numeric_dtype```:"
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "from pandas.api.types import is_string_dtype\nfrom pandas.api.types import is_numeric_dtype",
	"execution_count": 19,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "df = pd.read_csv('adult_data.csv')\ndf.columns = [column.strip() for column in df.columns]",
	"execution_count": 20,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "def strip_whitespace3(df):\n for column in df.columns:\n # if column is of a string data type, then strip the white space\n if is_string_dtype(df[column]):\n df[column] = df[column].str.strip()",
	"execution_count": 21,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "df['education'].values",
	"execution_count": 22,
	"outputs": [
	{
	"data": {
	"text/plain": "array([' Bachelors', ' Bachelors', ' HS-grad', ..., ' HS-grad',\n ' HS-grad', ' HS-grad'], dtype=object)"
	},
	"execution_count": 22,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "strip_whitespace3(df)",
	"execution_count": 23,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": false
	},
	"cell_type": "code",
	"source": "df['education'].values",
	"execution_count": 24,
	"outputs": [
	{
	"data": {
	"text/plain": "array(['Bachelors', 'Bachelors', 'HS-grad', ..., 'HS-grad', 'HS-grad',\n 'HS-grad'], dtype=object)"
	},
	"execution_count": 24,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	}
	],
	"metadata": {
	"hide_input": false,
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3",
	"language": "python"
	},
	"language_info": {
	"name": "python",
	"version": "3.6.5",
	"mimetype": "text/x-python",
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"pygments_lexer": "ipython3",
	"nbconvert_exporter": "python",
	"file_extension": ".py"
	},
	"varInspector": {
	"cols": {
	"lenName": 16,
	"lenType": 16,
	"lenVar": 40
	},
	"kernels_config": {
	"python": {
	"delete_cmd_postfix": "",
	"delete_cmd_prefix": "del ",
	"library": "var_list.py",
	"varRefreshCmd": "print(var_dic_list())"
	},
	"r": {
	"delete_cmd_postfix": ") ",
	"delete_cmd_prefix": "rm(",
	"library": "var_list.r",
	"varRefreshCmd": "cat(var_dic_list()) "
	}
	},
	"types_to_exclude": [
	"module",
	"function",
	"builtin_function_or_method",
	"instance",
	"_Feature"
	],
	"window_display": false
	},
	"gist": {
	"id": "",
	"data": {
	"description": "data_cleaning/cleaning_columns_of_specific_types.ipynb",
	"public": true
	}
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}