Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save pmarcelino/dc8e5e97e7afc153e3488c262b3ad5a4 to your computer and use it in GitHub Desktop.
Save pmarcelino/dc8e5e97e7afc153e3488c262b3ad5a4 to your computer and use it in GitHub Desktop.
Data cleaning - General - Values out of range
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Values out of range"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section, we only deal with those values that are clearly wrong because they are impossible to happen. For example, no one can have a negative age (not even Benjamin Button). There is a different type of values out of range, which belongs to the outliers analysis. Those values correspond to situations that are anomalous, but not necessarily impossible. We will analyse outliers in a different notebook.\n",
"\n",
"For now, let's check what's going on in our favorite dataset."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>player</th>\n",
" <th>team</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>71</td>\n",
" <td>eusebio</td>\n",
" <td>benfica</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>55</td>\n",
" <td>magnusson</td>\n",
" <td>benfica</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-68</td>\n",
" <td>nene</td>\n",
" <td>benfica</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age player team\n",
"0 71 eusebio benfica\n",
"1 55 magnusson benfica\n",
"2 -68 nene benfica"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create example \n",
"df = pd.DataFrame({'player':['eusebio', 'magnusson', 'nene'],\n",
" 'team':['benfica', 'benfica', 'benfica'],\n",
" 'age':[71,55,-68]})\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In a real case, I'd probably detect that Nené's age is wrong through one of two processes:\n",
"1. Impose up front that all age values must be above 0 (kind of sanity check).\n",
"1. Check minimum and maximum values in the descriptive statistics.\n",
"\n",
"Let's see how to proceed in each case."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sanity check"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>player</th>\n",
" <th>team</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>71</td>\n",
" <td>eusebio</td>\n",
" <td>benfica</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>55</td>\n",
" <td>magnusson</td>\n",
" <td>benfica</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age player team\n",
"0 71 eusebio benfica\n",
"1 55 magnusson benfica"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Impose condition for age values\n",
"df_sane = df[df['age']>0]\n",
"df_sane"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Descriptive statistics"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>3.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>19.333333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>76.054805</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>-68.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>-6.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>55.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>63.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>71.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age\n",
"count 3.000000\n",
"mean 19.333333\n",
"std 76.054805\n",
"min -68.000000\n",
"25% -6.500000\n",
"50% 55.000000\n",
"75% 63.000000\n",
"max 71.000000"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Case 2\n",
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the things that I pay attention when analysing descriptive statistics is *min* and *max* values. It's one of the best ways I know to catch those gross errors that would emabarass you in front of your boss.\n",
"\n",
"In this case, the value -68 would get my attention and I'd remove it."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>player</th>\n",
" <th>team</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-68</td>\n",
" <td>nene</td>\n",
" <td>benfica</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age player team\n",
"2 -68 nene benfica"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# See what's going on\n",
"df[df['age']==df['age'].min()]"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>player</th>\n",
" <th>team</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>71</td>\n",
" <td>eusebio</td>\n",
" <td>benfica</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>55</td>\n",
" <td>magnusson</td>\n",
" <td>benfica</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age player team\n",
"0 71 eusebio benfica\n",
"1 55 magnusson benfica"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Drop row\n",
"df = df[df['age']>df['age'].min()]\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's good practice to check descriptive statistics again. We never know what else is hidding there."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>2.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>63.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>11.313708</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>55.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>59.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>63.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>67.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>71.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age\n",
"count 2.000000\n",
"mean 63.000000\n",
"std 11.313708\n",
"min 55.000000\n",
"25% 59.000000\n",
"50% 63.000000\n",
"75% 67.000000\n",
"max 71.000000"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Descriptive statistics (take 2)\n",
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks nice. Mission accomplished!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment