Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pybokeh/f7aeafc10ae3d84705419547a79f1f2c to your computer and use it in GitHub Desktop.
Save pybokeh/f7aeafc10ae3d84705419547a79f1f2c to your computer and use it in GitHub Desktop.
data_cleaning/cleaning_columns_of_specific_types.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "### BACKGROUND:`"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Sample data is from UC Davis Machine Learning Repository [adult data set](http://archive.ics.uci.edu/ml/machine-learning-databases/adult/)"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Data set has white space that is prepended or in front of each value. We want to remove the white space from columns of type string. Even the column names have a white space in front. Instances such as theses where the source data has these quirky data anomalies are quite common in the real world. Most often we can't control the source data, but we can clean the data afterwards. Depending on the level of control, below are 3 different methods."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "import pandas as pd",
"execution_count": 1,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df = pd.read_csv('adult_data.csv')",
"execution_count": 2,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"cell_type": "code",
"source": "df.head()",
"execution_count": 3,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>age</th>\n <th>workclass</th>\n <th>fnlwgt</th>\n <th>education</th>\n <th>education-num</th>\n <th>marital-status</th>\n <th>occupation</th>\n <th>relationship</th>\n <th>race</th>\n <th>sex</th>\n <th>capital-gain</th>\n <th>capital-loss</th>\n <th>hours-per-week</th>\n <th>native-country</th>\n <th>&lt;=50k</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>39</td>\n <td>State-gov</td>\n <td>77516</td>\n <td>Bachelors</td>\n <td>13</td>\n <td>Never-married</td>\n <td>Adm-clerical</td>\n <td>Not-in-family</td>\n <td>White</td>\n <td>Male</td>\n <td>2174</td>\n <td>0</td>\n <td>40</td>\n <td>United-States</td>\n <td>&lt;=50K</td>\n </tr>\n <tr>\n <th>1</th>\n <td>50</td>\n <td>Self-emp-not-inc</td>\n <td>83311</td>\n <td>Bachelors</td>\n <td>13</td>\n <td>Married-civ-spouse</td>\n <td>Exec-managerial</td>\n <td>Husband</td>\n <td>White</td>\n <td>Male</td>\n <td>0</td>\n <td>0</td>\n <td>13</td>\n <td>United-States</td>\n <td>&lt;=50K</td>\n </tr>\n <tr>\n <th>2</th>\n <td>38</td>\n <td>Private</td>\n <td>215646</td>\n <td>HS-grad</td>\n <td>9</td>\n <td>Divorced</td>\n <td>Handlers-cleaners</td>\n <td>Not-in-family</td>\n <td>White</td>\n <td>Male</td>\n <td>0</td>\n <td>0</td>\n <td>40</td>\n <td>United-States</td>\n <td>&lt;=50K</td>\n </tr>\n <tr>\n <th>3</th>\n <td>53</td>\n <td>Private</td>\n <td>234721</td>\n <td>11th</td>\n <td>7</td>\n <td>Married-civ-spouse</td>\n <td>Handlers-cleaners</td>\n <td>Husband</td>\n <td>Black</td>\n <td>Male</td>\n <td>0</td>\n <td>0</td>\n <td>40</td>\n <td>United-States</td>\n <td>&lt;=50K</td>\n </tr>\n <tr>\n <th>4</th>\n <td>28</td>\n <td>Private</td>\n <td>338409</td>\n <td>Bachelors</td>\n <td>13</td>\n <td>Married-civ-spouse</td>\n <td>Prof-specialty</td>\n <td>Wife</td>\n <td>Black</td>\n <td>Female</td>\n <td>0</td>\n <td>0</td>\n <td>40</td>\n <td>Cuba</td>\n <td>&lt;=50K</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " age workclass fnlwgt education education-num \\\n0 39 State-gov 77516 Bachelors 13 \n1 50 Self-emp-not-inc 83311 Bachelors 13 \n2 38 Private 215646 HS-grad 9 \n3 53 Private 234721 11th 7 \n4 28 Private 338409 Bachelors 13 \n\n marital-status occupation relationship race sex \\\n0 Never-married Adm-clerical Not-in-family White Male \n1 Married-civ-spouse Exec-managerial Husband White Male \n2 Divorced Handlers-cleaners Not-in-family White Male \n3 Married-civ-spouse Handlers-cleaners Husband Black Male \n4 Married-civ-spouse Prof-specialty Wife Black Female \n\n capital-gain capital-loss hours-per-week native-country <=50k \n0 2174 0 40 United-States <=50K \n1 0 0 13 United-States <=50K \n2 0 0 40 United-States <=50K \n3 0 0 40 United-States <=50K \n4 0 0 40 Cuba <=50K "
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Except for the first column (```age```), the column names have a white space in front:"
},
{
"metadata": {
"trusted": false
},
"cell_type": "code",
"source": "df.columns",
"execution_count": 4,
"outputs": [
{
"data": {
"text/plain": "Index(['age', ' workclass', ' fnlwgt', ' education', ' education-num',\n ' marital-status', ' occupation', ' relationship', ' race', ' sex',\n ' capital-gain', ' capital-loss', ' hours-per-week', ' native-country',\n ' <=50k'],\n dtype='object')"
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Let's fix the column names first, then the column values:"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.columns = [column.strip() for column in df.columns]",
"execution_count": 5,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"cell_type": "code",
"source": "df.columns",
"execution_count": 6,
"outputs": [
{
"data": {
"text/plain": "Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',\n 'marital-status', 'occupation', 'relationship', 'race', 'sex',\n 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',\n '<=50k'],\n dtype='object')"
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Columns of type string or ```str``` also have a white space in front of each value. In pandas dataframe, ```str``` columns are represented as ```object``` data type:"
},
{
"metadata": {
"trusted": false
},
"cell_type": "code",
"source": "df.dtypes",
"execution_count": 7,
"outputs": [
{
"data": {
"text/plain": "age int64\nworkclass object\nfnlwgt int64\neducation object\neducation-num int64\nmarital-status object\noccupation object\nrelationship object\nrace object\nsex object\ncapital-gain int64\ncapital-loss int64\nhours-per-week int64\nnative-country object\n<=50k object\ndtype: object"
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### We can use ```value_counts()``` method to give us counts by data type:"
},
{
"metadata": {
"trusted": false
},
"cell_type": "code",
"source": "df.dtypes.value_counts()",
"execution_count": 8,
"outputs": [
{
"data": {
"text/plain": "object 9\nint64 6\ndtype: int64"
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Let's take a peek at a couple of the string columns:"
},
{
"metadata": {
"trusted": false
},
"cell_type": "code",
"source": "df['workclass'].values",
"execution_count": 9,
"outputs": [
{
"data": {
"text/plain": "array([' State-gov', ' Self-emp-not-inc', ' Private', ..., ' Private',\n ' Private', ' Self-emp-inc'], dtype=object)"
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": false
},
"cell_type": "code",
"source": "df['education'].values",
"execution_count": 10,
"outputs": [
{
"data": {
"text/plain": "array([' Bachelors', ' Bachelors', ' HS-grad', ..., ' HS-grad',\n ' HS-grad', ' HS-grad'], dtype=object)"
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "From above, we see that indeed there is a white space in front of each value. There are a few different ways to remove these white spaces."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Method #1: Test type using ```isinstance()``` method:"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "def strip_whitespace(df):\n for column in df.columns:\n # if first value of column is of type str, then strip the white space from all values\n if isinstance(df[column][0], str):\n df[column] = df[column].str.strip()",
"execution_count": 11,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "strip_whitespace(df)",
"execution_count": 12,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Let's double-check that the function did its job:"
},
{
"metadata": {
"trusted": false
},
"cell_type": "code",
"source": "df['sex'].values",
"execution_count": 13,
"outputs": [
{
"data": {
"text/plain": "array(['Male', 'Male', 'Male', ..., 'Female', 'Male', 'Female'],\n dtype=object)"
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": false
},
"cell_type": "code",
"source": "df['education'].values",
"execution_count": 14,
"outputs": [
{
"data": {
"text/plain": "array(['Bachelors', 'Bachelors', 'HS-grad', ..., 'HS-grad', 'HS-grad',\n 'HS-grad'], dtype=object)"
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Method #2: Using ```dtype``` to test for data type:"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df = pd.read_csv('adult_data.csv')\ndf.columns = [column.strip() for column in df.columns]",
"execution_count": 15,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "def strip_whitespace2(df):\n for column in df.columns:\n # if column is of type 'object', then strip the white space\n if df[column].dtype == 'object':\n df[column] = df[column].str.strip()",
"execution_count": 16,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "strip_whitespace2(df)",
"execution_count": 17,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"cell_type": "code",
"source": "df['education'].values",
"execution_count": 18,
"outputs": [
{
"data": {
"text/plain": "array(['Bachelors', 'Bachelors', 'HS-grad', ..., 'HS-grad', 'HS-grad',\n 'HS-grad'], dtype=object)"
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Method #3: Using pandas ```is_string_dtype``` or ```is_numeric_dtype```:"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "from pandas.api.types import is_string_dtype\nfrom pandas.api.types import is_numeric_dtype",
"execution_count": 19,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df = pd.read_csv('adult_data.csv')\ndf.columns = [column.strip() for column in df.columns]",
"execution_count": 20,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "def strip_whitespace3(df):\n for column in df.columns:\n # if column is of a string data type, then strip the white space\n if is_string_dtype(df[column]):\n df[column] = df[column].str.strip()",
"execution_count": 21,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"cell_type": "code",
"source": "df['education'].values",
"execution_count": 22,
"outputs": [
{
"data": {
"text/plain": "array([' Bachelors', ' Bachelors', ' HS-grad', ..., ' HS-grad',\n ' HS-grad', ' HS-grad'], dtype=object)"
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "strip_whitespace3(df)",
"execution_count": 23,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"cell_type": "code",
"source": "df['education'].values",
"execution_count": 24,
"outputs": [
{
"data": {
"text/plain": "array(['Bachelors', 'Bachelors', 'HS-grad', ..., 'HS-grad', 'HS-grad',\n 'HS-grad'], dtype=object)"
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
]
}
],
"metadata": {
"hide_input": false,
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.6.5",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
},
"gist": {
"id": "",
"data": {
"description": "data_cleaning/cleaning_columns_of_specific_types.ipynb",
"public": true
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment