Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pb111/9e3816d2584a85ef7bff8d70bed20b1b to your computer and use it in GitHub Desktop.
Save pb111/9e3816d2584a85ef7bff8d70bed20b1b to your computer and use it in GitHub Desktop.
Naïve Bayes Classification with Python and Scikit-Learn
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Naive Bayes Classification with Python and Scikit-Learn\n",
"\n",
"\n",
"In this project, I implement Naive Bayes Classification algorithm with Python and Scikit-Learn. I build a Naive Bayes Classifier to predict whether a person makes over 50K a year. I have used the **Adult Data Set** for this project. I have downloaded this dataset from the UCI Machine Learning Repository website. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
"1.\tIntroduction to Naive Bayes Classification algorithm\n",
"2.\tNaive Bayes algorithm intuition\n",
"3.\tThe problem statement\n",
"4.\tDataset description\n",
"5.\tImport libraries\n",
"6.\tImport dataset\n",
"7.\tExploratory data analysis\n",
"8.\tDeclare feature vector and target variable\n",
"9.\tSplit data into separate training and test set\n",
"10.\tFeature engineering\n",
"11.\tFeature scaling\n",
"12.\tModel training\n",
"13.\tPredict the test-set results\n",
"14.\tCheck the accuracy score\n",
"15.\tConfusion matrix\n",
"16.\tClassification metrices\n",
"17.\tCalculate class probabilities\n",
"18.\tROC - AUC\n",
"19.\tk-Fold Cross Validation\n",
"20.\tResults and conclusion\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Introduction to Naive Bayes Classification algorithm\n",
"\n",
"\n",
"In machine learning, Naïve Bayes classification is a straightforward and powerful algorithm for the classification task. Naïve Bayes classification is based on applying Bayes’ theorem with strong independence assumption between the features. Naïve Bayes classification produces good results when we use it for textual data analysis such as Natural Language Processing.\n",
"\n",
"\n",
"Naïve Bayes models are also known as `simple Bayes` or `independent Bayes`. All these names refer to the application of Bayes’ theorem in the classifier’s decision rule. Naïve Bayes classifier applies the Bayes’ theorem in practice. This classifier brings the power of Bayes’ theorem to machine learning.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Naive Bayes algorithm intuition\n",
"\n",
"\n",
"Naïve Bayes Classifier uses the Bayes’ theorem to predict membership probabilities for each class such as the probability that given record or data point belongs to a particular class. The class with the highest probability is considered as the most likely class. This is also known as the **Maximum A Posteriori (MAP)**. \n",
"\n",
"The **MAP for a hypothesis with 2 events A and B is**\n",
"\n",
"**MAP (A)**\n",
"\n",
"= max (P (A | B))\n",
"\n",
"= max (P (B | A) * P (A))/P (B)\n",
"\n",
"= max (P (B | A) * P (A))\n",
"\n",
"\n",
"Here, P (B) is evidence probability. It is used to normalize the result. It remains the same, So, removing it would not affect the result.\n",
"\n",
"\n",
"Naïve Bayes Classifier assumes that all the features are unrelated to each other. Presence or absence of a feature does not influence the presence or absence of any other feature. \n",
"\n",
"\n",
"In real world datasets, we test a hypothesis given multiple evidence on features. So, the calculations become quite complicated. To simplify the work, the feature independence approach is used to uncouple multiple evidence and treat each as an independent one.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. The problem statement\n",
"\n",
"\n",
"In this project, I try to make predictions where the prediction task is to determine whether a person makes over 50K a year. I implement Naive Bayes Classification with Python and Scikit-Learn. So, to answer the question, I build a Naive Bayes classifier to predict whether a person makes over 50K a year."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Dataset description\n",
"\n",
"\n",
"I have used the **Adult Data Set** for this project. I have downloaded this dataset from the UCI Machine Learning Repository website. The data set can be found at the following url:-\n",
"\n",
"\n",
"https://archive.ics.uci.edu/ml/datasets/Adult\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Import libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Import dataset"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"data = 'C:/datasets/adult.data'\n",
"\n",
"df = pd.read_csv(data, header=None, sep=',\\s')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Exploratory data analysis\n",
"\n",
"\n",
"Now, I will explore the data to gain insights about the data. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(32561, 15)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view dimensions of dataset\n",
"\n",
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 32561 instances and 15 attributes in the data set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View top 5 rows of dataset"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>10</th>\n",
" <th>11</th>\n",
" <th>12</th>\n",
" <th>13</th>\n",
" <th>14</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>39</td>\n",
" <td>State-gov</td>\n",
" <td>77516</td>\n",
" <td>Bachelors</td>\n",
" <td>13</td>\n",
" <td>Never-married</td>\n",
" <td>Adm-clerical</td>\n",
" <td>Not-in-family</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>2174</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>50</td>\n",
" <td>Self-emp-not-inc</td>\n",
" <td>83311</td>\n",
" <td>Bachelors</td>\n",
" <td>13</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Exec-managerial</td>\n",
" <td>Husband</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>13</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>38</td>\n",
" <td>Private</td>\n",
" <td>215646</td>\n",
" <td>HS-grad</td>\n",
" <td>9</td>\n",
" <td>Divorced</td>\n",
" <td>Handlers-cleaners</td>\n",
" <td>Not-in-family</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>53</td>\n",
" <td>Private</td>\n",
" <td>234721</td>\n",
" <td>11th</td>\n",
" <td>7</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Handlers-cleaners</td>\n",
" <td>Husband</td>\n",
" <td>Black</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>28</td>\n",
" <td>Private</td>\n",
" <td>338409</td>\n",
" <td>Bachelors</td>\n",
" <td>13</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Prof-specialty</td>\n",
" <td>Wife</td>\n",
" <td>Black</td>\n",
" <td>Female</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>Cuba</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3 4 5 \\\n",
"0 39 State-gov 77516 Bachelors 13 Never-married \n",
"1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse \n",
"2 38 Private 215646 HS-grad 9 Divorced \n",
"3 53 Private 234721 11th 7 Married-civ-spouse \n",
"4 28 Private 338409 Bachelors 13 Married-civ-spouse \n",
"\n",
" 6 7 8 9 10 11 12 \\\n",
"0 Adm-clerical Not-in-family White Male 2174 0 40 \n",
"1 Exec-managerial Husband White Male 0 0 13 \n",
"2 Handlers-cleaners Not-in-family White Male 0 0 40 \n",
"3 Handlers-cleaners Husband Black Male 0 0 40 \n",
"4 Prof-specialty Wife Black Female 0 0 40 \n",
"\n",
" 13 14 \n",
"0 United-States <=50K \n",
"1 United-States <=50K \n",
"2 United-States <=50K \n",
"3 United-States <=50K \n",
"4 Cuba <=50K "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# preview the dataset\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rename column names\n",
"\n",
"We can see that the dataset does not have proper column names. The columns are merely labelled as 0,1,2.... and so on. We should give proper names to the columns. I will do it as follows:-"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',\n",
" 'marital_status', 'occupation', 'relationship', 'race', 'sex',\n",
" 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',\n",
" 'income'],\n",
" dtype='object')"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',\n",
" 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']\n",
"\n",
"df.columns = col_names\n",
"\n",
"df.columns"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>workclass</th>\n",
" <th>fnlwgt</th>\n",
" <th>education</th>\n",
" <th>education_num</th>\n",
" <th>marital_status</th>\n",
" <th>occupation</th>\n",
" <th>relationship</th>\n",
" <th>race</th>\n",
" <th>sex</th>\n",
" <th>capital_gain</th>\n",
" <th>capital_loss</th>\n",
" <th>hours_per_week</th>\n",
" <th>native_country</th>\n",
" <th>income</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>39</td>\n",
" <td>State-gov</td>\n",
" <td>77516</td>\n",
" <td>Bachelors</td>\n",
" <td>13</td>\n",
" <td>Never-married</td>\n",
" <td>Adm-clerical</td>\n",
" <td>Not-in-family</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>2174</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>50</td>\n",
" <td>Self-emp-not-inc</td>\n",
" <td>83311</td>\n",
" <td>Bachelors</td>\n",
" <td>13</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Exec-managerial</td>\n",
" <td>Husband</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>13</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>38</td>\n",
" <td>Private</td>\n",
" <td>215646</td>\n",
" <td>HS-grad</td>\n",
" <td>9</td>\n",
" <td>Divorced</td>\n",
" <td>Handlers-cleaners</td>\n",
" <td>Not-in-family</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>53</td>\n",
" <td>Private</td>\n",
" <td>234721</td>\n",
" <td>11th</td>\n",
" <td>7</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Handlers-cleaners</td>\n",
" <td>Husband</td>\n",
" <td>Black</td>\n",
" <td>Male</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>28</td>\n",
" <td>Private</td>\n",
" <td>338409</td>\n",
" <td>Bachelors</td>\n",
" <td>13</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Prof-specialty</td>\n",
" <td>Wife</td>\n",
" <td>Black</td>\n",
" <td>Female</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" <td>Cuba</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age workclass fnlwgt education education_num \\\n",
"0 39 State-gov 77516 Bachelors 13 \n",
"1 50 Self-emp-not-inc 83311 Bachelors 13 \n",
"2 38 Private 215646 HS-grad 9 \n",
"3 53 Private 234721 11th 7 \n",
"4 28 Private 338409 Bachelors 13 \n",
"\n",
" marital_status occupation relationship race sex \\\n",
"0 Never-married Adm-clerical Not-in-family White Male \n",
"1 Married-civ-spouse Exec-managerial Husband White Male \n",
"2 Divorced Handlers-cleaners Not-in-family White Male \n",
"3 Married-civ-spouse Handlers-cleaners Husband Black Male \n",
"4 Married-civ-spouse Prof-specialty Wife Black Female \n",
"\n",
" capital_gain capital_loss hours_per_week native_country income \n",
"0 2174 0 40 United-States <=50K \n",
"1 0 0 13 United-States <=50K \n",
"2 0 0 40 United-States <=50K \n",
"3 0 0 40 United-States <=50K \n",
"4 0 0 40 Cuba <=50K "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# let's agian preview the dataset\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the column names are renamed. Now, the columns have meaningful names."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View summary of dataset"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 32561 entries, 0 to 32560\n",
"Data columns (total 15 columns):\n",
"age 32561 non-null int64\n",
"workclass 32561 non-null object\n",
"fnlwgt 32561 non-null int64\n",
"education 32561 non-null object\n",
"education_num 32561 non-null int64\n",
"marital_status 32561 non-null object\n",
"occupation 32561 non-null object\n",
"relationship 32561 non-null object\n",
"race 32561 non-null object\n",
"sex 32561 non-null object\n",
"capital_gain 32561 non-null int64\n",
"capital_loss 32561 non-null int64\n",
"hours_per_week 32561 non-null int64\n",
"native_country 32561 non-null object\n",
"income 32561 non-null object\n",
"dtypes: int64(6), object(9)\n",
"memory usage: 3.7+ MB\n"
]
}
],
"source": [
"# view summary of dataset\n",
"\n",
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are no missing values in the dataset. I will confirm this further."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Types of variables\n",
"\n",
"\n",
"In this section, I segregate the dataset into categorical and numerical variables. There are a mixture of categorical and numerical variables in the dataset. Categorical variables have data type object. Numerical variables have data type int64.\n",
"\n",
"\n",
"First of all, I will explore categorical variables."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore categorical variables"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 9 categorical variables\n",
"\n",
"The categorical variables are :\n",
"\n",
" ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country', 'income']\n"
]
}
],
"source": [
"# find categorical variables\n",
"\n",
"categorical = [var for var in df.columns if df[var].dtype=='O']\n",
"\n",
"print('There are {} categorical variables\\n'.format(len(categorical)))\n",
"\n",
"print('The categorical variables are :\\n\\n', categorical)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>workclass</th>\n",
" <th>education</th>\n",
" <th>marital_status</th>\n",
" <th>occupation</th>\n",
" <th>relationship</th>\n",
" <th>race</th>\n",
" <th>sex</th>\n",
" <th>native_country</th>\n",
" <th>income</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>State-gov</td>\n",
" <td>Bachelors</td>\n",
" <td>Never-married</td>\n",
" <td>Adm-clerical</td>\n",
" <td>Not-in-family</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Self-emp-not-inc</td>\n",
" <td>Bachelors</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Exec-managerial</td>\n",
" <td>Husband</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Private</td>\n",
" <td>HS-grad</td>\n",
" <td>Divorced</td>\n",
" <td>Handlers-cleaners</td>\n",
" <td>Not-in-family</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Private</td>\n",
" <td>11th</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Handlers-cleaners</td>\n",
" <td>Husband</td>\n",
" <td>Black</td>\n",
" <td>Male</td>\n",
" <td>United-States</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Private</td>\n",
" <td>Bachelors</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Prof-specialty</td>\n",
" <td>Wife</td>\n",
" <td>Black</td>\n",
" <td>Female</td>\n",
" <td>Cuba</td>\n",
" <td>&lt;=50K</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" workclass education marital_status occupation \\\n",
"0 State-gov Bachelors Never-married Adm-clerical \n",
"1 Self-emp-not-inc Bachelors Married-civ-spouse Exec-managerial \n",
"2 Private HS-grad Divorced Handlers-cleaners \n",
"3 Private 11th Married-civ-spouse Handlers-cleaners \n",
"4 Private Bachelors Married-civ-spouse Prof-specialty \n",
"\n",
" relationship race sex native_country income \n",
"0 Not-in-family White Male United-States <=50K \n",
"1 Husband White Male United-States <=50K \n",
"2 Not-in-family White Male United-States <=50K \n",
"3 Husband Black Male United-States <=50K \n",
"4 Wife Black Female Cuba <=50K "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view the categorical variables\n",
"\n",
"df[categorical].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary of categorical variables\n",
"\n",
"\n",
"- There are 9 categorical variables. \n",
"\n",
"\n",
"- The categorical variables are given by `workclass`, `education`, `marital_status`, `occupation`, `relationship`, `race`, `sex`, `native_country` and `income`.\n",
"\n",
"\n",
"- `income` is the target variable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore problems within categorical variables\n",
"\n",
"\n",
"First, I will explore the categorical variables.\n",
"\n",
"\n",
"### Missing values in categorical variables"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"workclass 0\n",
"education 0\n",
"marital_status 0\n",
"occupation 0\n",
"relationship 0\n",
"race 0\n",
"sex 0\n",
"native_country 0\n",
"income 0\n",
"dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in categorical variables\n",
"\n",
"df[categorical].isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are no missing values in the categorical variables. I will confirm this further."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Frequency counts of categorical variables\n",
"\n",
"\n",
"Now, I will check the frequency counts of categorical variables."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Private 22696\n",
"Self-emp-not-inc 2541\n",
"Local-gov 2093\n",
"? 1836\n",
"State-gov 1298\n",
"Self-emp-inc 1116\n",
"Federal-gov 960\n",
"Without-pay 14\n",
"Never-worked 7\n",
"Name: workclass, dtype: int64\n",
"HS-grad 10501\n",
"Some-college 7291\n",
"Bachelors 5355\n",
"Masters 1723\n",
"Assoc-voc 1382\n",
"11th 1175\n",
"Assoc-acdm 1067\n",
"10th 933\n",
"7th-8th 646\n",
"Prof-school 576\n",
"9th 514\n",
"12th 433\n",
"Doctorate 413\n",
"5th-6th 333\n",
"1st-4th 168\n",
"Preschool 51\n",
"Name: education, dtype: int64\n",
"Married-civ-spouse 14976\n",
"Never-married 10683\n",
"Divorced 4443\n",
"Separated 1025\n",
"Widowed 993\n",
"Married-spouse-absent 418\n",
"Married-AF-spouse 23\n",
"Name: marital_status, dtype: int64\n",
"Prof-specialty 4140\n",
"Craft-repair 4099\n",
"Exec-managerial 4066\n",
"Adm-clerical 3770\n",
"Sales 3650\n",
"Other-service 3295\n",
"Machine-op-inspct 2002\n",
"? 1843\n",
"Transport-moving 1597\n",
"Handlers-cleaners 1370\n",
"Farming-fishing 994\n",
"Tech-support 928\n",
"Protective-serv 649\n",
"Priv-house-serv 149\n",
"Armed-Forces 9\n",
"Name: occupation, dtype: int64\n",
"Husband 13193\n",
"Not-in-family 8305\n",
"Own-child 5068\n",
"Unmarried 3446\n",
"Wife 1568\n",
"Other-relative 981\n",
"Name: relationship, dtype: int64\n",
"White 27816\n",
"Black 3124\n",
"Asian-Pac-Islander 1039\n",
"Amer-Indian-Eskimo 311\n",
"Other 271\n",
"Name: race, dtype: int64\n",
"Male 21790\n",
"Female 10771\n",
"Name: sex, dtype: int64\n",
"United-States 29170\n",
"Mexico 643\n",
"? 583\n",
"Philippines 198\n",
"Germany 137\n",
"Canada 121\n",
"Puerto-Rico 114\n",
"El-Salvador 106\n",
"India 100\n",
"Cuba 95\n",
"England 90\n",
"Jamaica 81\n",
"South 80\n",
"China 75\n",
"Italy 73\n",
"Dominican-Republic 70\n",
"Vietnam 67\n",
"Guatemala 64\n",
"Japan 62\n",
"Poland 60\n",
"Columbia 59\n",
"Taiwan 51\n",
"Haiti 44\n",
"Iran 43\n",
"Portugal 37\n",
"Nicaragua 34\n",
"Peru 31\n",
"France 29\n",
"Greece 29\n",
"Ecuador 28\n",
"Ireland 24\n",
"Hong 20\n",
"Trinadad&Tobago 19\n",
"Cambodia 19\n",
"Thailand 18\n",
"Laos 18\n",
"Yugoslavia 16\n",
"Outlying-US(Guam-USVI-etc) 14\n",
"Honduras 13\n",
"Hungary 13\n",
"Scotland 12\n",
"Holand-Netherlands 1\n",
"Name: native_country, dtype: int64\n",
"<=50K 24720\n",
">50K 7841\n",
"Name: income, dtype: int64\n"
]
}
],
"source": [
"# view frequency counts of values in categorical variables\n",
"\n",
"for var in categorical: \n",
" \n",
" print(df[var].value_counts())"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Private 0.697030\n",
"Self-emp-not-inc 0.078038\n",
"Local-gov 0.064279\n",
"? 0.056386\n",
"State-gov 0.039864\n",
"Self-emp-inc 0.034274\n",
"Federal-gov 0.029483\n",
"Without-pay 0.000430\n",
"Never-worked 0.000215\n",
"Name: workclass, dtype: float64\n",
"HS-grad 0.322502\n",
"Some-college 0.223918\n",
"Bachelors 0.164461\n",
"Masters 0.052916\n",
"Assoc-voc 0.042443\n",
"11th 0.036086\n",
"Assoc-acdm 0.032769\n",
"10th 0.028654\n",
"7th-8th 0.019840\n",
"Prof-school 0.017690\n",
"9th 0.015786\n",
"12th 0.013298\n",
"Doctorate 0.012684\n",
"5th-6th 0.010227\n",
"1st-4th 0.005160\n",
"Preschool 0.001566\n",
"Name: education, dtype: float64\n",
"Married-civ-spouse 0.459937\n",
"Never-married 0.328092\n",
"Divorced 0.136452\n",
"Separated 0.031479\n",
"Widowed 0.030497\n",
"Married-spouse-absent 0.012837\n",
"Married-AF-spouse 0.000706\n",
"Name: marital_status, dtype: float64\n",
"Prof-specialty 0.127146\n",
"Craft-repair 0.125887\n",
"Exec-managerial 0.124873\n",
"Adm-clerical 0.115783\n",
"Sales 0.112097\n",
"Other-service 0.101195\n",
"Machine-op-inspct 0.061485\n",
"? 0.056601\n",
"Transport-moving 0.049046\n",
"Handlers-cleaners 0.042075\n",
"Farming-fishing 0.030527\n",
"Tech-support 0.028500\n",
"Protective-serv 0.019932\n",
"Priv-house-serv 0.004576\n",
"Armed-Forces 0.000276\n",
"Name: occupation, dtype: float64\n",
"Husband 0.405178\n",
"Not-in-family 0.255060\n",
"Own-child 0.155646\n",
"Unmarried 0.105832\n",
"Wife 0.048156\n",
"Other-relative 0.030128\n",
"Name: relationship, dtype: float64\n",
"White 0.854274\n",
"Black 0.095943\n",
"Asian-Pac-Islander 0.031909\n",
"Amer-Indian-Eskimo 0.009551\n",
"Other 0.008323\n",
"Name: race, dtype: float64\n",
"Male 0.669205\n",
"Female 0.330795\n",
"Name: sex, dtype: float64\n",
"United-States 0.895857\n",
"Mexico 0.019748\n",
"? 0.017905\n",
"Philippines 0.006081\n",
"Germany 0.004207\n",
"Canada 0.003716\n",
"Puerto-Rico 0.003501\n",
"El-Salvador 0.003255\n",
"India 0.003071\n",
"Cuba 0.002918\n",
"England 0.002764\n",
"Jamaica 0.002488\n",
"South 0.002457\n",
"China 0.002303\n",
"Italy 0.002242\n",
"Dominican-Republic 0.002150\n",
"Vietnam 0.002058\n",
"Guatemala 0.001966\n",
"Japan 0.001904\n",
"Poland 0.001843\n",
"Columbia 0.001812\n",
"Taiwan 0.001566\n",
"Haiti 0.001351\n",
"Iran 0.001321\n",
"Portugal 0.001136\n",
"Nicaragua 0.001044\n",
"Peru 0.000952\n",
"France 0.000891\n",
"Greece 0.000891\n",
"Ecuador 0.000860\n",
"Ireland 0.000737\n",
"Hong 0.000614\n",
"Trinadad&Tobago 0.000584\n",
"Cambodia 0.000584\n",
"Thailand 0.000553\n",
"Laos 0.000553\n",
"Yugoslavia 0.000491\n",
"Outlying-US(Guam-USVI-etc) 0.000430\n",
"Honduras 0.000399\n",
"Hungary 0.000399\n",
"Scotland 0.000369\n",
"Holand-Netherlands 0.000031\n",
"Name: native_country, dtype: float64\n",
"<=50K 0.75919\n",
">50K 0.24081\n",
"Name: income, dtype: float64\n"
]
}
],
"source": [
"# view frequency distribution of categorical variables\n",
"\n",
"for var in categorical: \n",
" \n",
" print(df[var].value_counts()/np.float(len(df)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can see that there are several variables like `workclass`, `occupation` and `native_country` which contain missing values. Generally, the missing values are coded as `NaN` and python will detect them with the usual command of `df.isnull().sum()`.\n",
"\n",
"But, in this case the missing values are coded as `?`. Python fail to detect these as missing values because it do not consider `?` as missing values. So, I have to replace `?` with `NaN` so that Python can detect these missing values.\n",
"\n",
"I will explore these variables and replace `?` with `NaN`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore workclass variable"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['State-gov', 'Self-emp-not-inc', 'Private', 'Federal-gov',\n",
" 'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'],\n",
" dtype=object)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check labels in workclass variable\n",
"\n",
"df.workclass.unique()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Private 22696\n",
"Self-emp-not-inc 2541\n",
"Local-gov 2093\n",
"? 1836\n",
"State-gov 1298\n",
"Self-emp-inc 1116\n",
"Federal-gov 960\n",
"Without-pay 14\n",
"Never-worked 7\n",
"Name: workclass, dtype: int64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check frequency distribution of values in workclass variable\n",
"\n",
"df.workclass.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 1836 values encoded as `?` in workclass variable. I will replace these `?` with `NaN`."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# replace '?' values in workclass variable with `NaN`\n",
"\n",
"\n",
"df['workclass'].replace('?', np.NaN, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Private 22696\n",
"Self-emp-not-inc 2541\n",
"Local-gov 2093\n",
"State-gov 1298\n",
"Self-emp-inc 1116\n",
"Federal-gov 960\n",
"Without-pay 14\n",
"Never-worked 7\n",
"Name: workclass, dtype: int64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# again check the frequency distribution of values in workclass variable\n",
"\n",
"df.workclass.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can see that there are no values encoded as `?` in the `workclass` variable.\n",
"\n",
"I will adopt similar approach with `occupation` and `native_country` column."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore occupation variable"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners',\n",
" 'Prof-specialty', 'Other-service', 'Sales', 'Craft-repair',\n",
" 'Transport-moving', 'Farming-fishing', 'Machine-op-inspct',\n",
" 'Tech-support', '?', 'Protective-serv', 'Armed-Forces',\n",
" 'Priv-house-serv'], dtype=object)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check labels in occupation variable\n",
"\n",
"df.occupation.unique()\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Prof-specialty 4140\n",
"Craft-repair 4099\n",
"Exec-managerial 4066\n",
"Adm-clerical 3770\n",
"Sales 3650\n",
"Other-service 3295\n",
"Machine-op-inspct 2002\n",
"? 1843\n",
"Transport-moving 1597\n",
"Handlers-cleaners 1370\n",
"Farming-fishing 994\n",
"Tech-support 928\n",
"Protective-serv 649\n",
"Priv-house-serv 149\n",
"Armed-Forces 9\n",
"Name: occupation, dtype: int64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check frequency distribution of values in occupation variable\n",
"\n",
"df.occupation.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 1843 values encoded as `?` in `occupation` variable. I will replace these `?` with `NaN`."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# replace '?' values in occupation variable with `NaN`\n",
"\n",
"df['occupation'].replace('?', np.NaN, inplace=True)\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Prof-specialty 4140\n",
"Craft-repair 4099\n",
"Exec-managerial 4066\n",
"Adm-clerical 3770\n",
"Sales 3650\n",
"Other-service 3295\n",
"Machine-op-inspct 2002\n",
"Transport-moving 1597\n",
"Handlers-cleaners 1370\n",
"Farming-fishing 994\n",
"Tech-support 928\n",
"Protective-serv 649\n",
"Priv-house-serv 149\n",
"Armed-Forces 9\n",
"Name: occupation, dtype: int64"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# again check the frequency distribution of values in occupation variable\n",
"\n",
"df.occupation.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore native_country variable\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['United-States', 'Cuba', 'Jamaica', 'India', '?', 'Mexico',\n",
" 'South', 'Puerto-Rico', 'Honduras', 'England', 'Canada', 'Germany',\n",
" 'Iran', 'Philippines', 'Italy', 'Poland', 'Columbia', 'Cambodia',\n",
" 'Thailand', 'Ecuador', 'Laos', 'Taiwan', 'Haiti', 'Portugal',\n",
" 'Dominican-Republic', 'El-Salvador', 'France', 'Guatemala',\n",
" 'China', 'Japan', 'Yugoslavia', 'Peru',\n",
" 'Outlying-US(Guam-USVI-etc)', 'Scotland', 'Trinadad&Tobago',\n",
" 'Greece', 'Nicaragua', 'Vietnam', 'Hong', 'Ireland', 'Hungary',\n",
" 'Holand-Netherlands'], dtype=object)"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check labels in native_country variable\n",
"\n",
"df.native_country.unique()\n"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"United-States 29170\n",
"Mexico 643\n",
"? 583\n",
"Philippines 198\n",
"Germany 137\n",
"Canada 121\n",
"Puerto-Rico 114\n",
"El-Salvador 106\n",
"India 100\n",
"Cuba 95\n",
"England 90\n",
"Jamaica 81\n",
"South 80\n",
"China 75\n",
"Italy 73\n",
"Dominican-Republic 70\n",
"Vietnam 67\n",
"Guatemala 64\n",
"Japan 62\n",
"Poland 60\n",
"Columbia 59\n",
"Taiwan 51\n",
"Haiti 44\n",
"Iran 43\n",
"Portugal 37\n",
"Nicaragua 34\n",
"Peru 31\n",
"France 29\n",
"Greece 29\n",
"Ecuador 28\n",
"Ireland 24\n",
"Hong 20\n",
"Trinadad&Tobago 19\n",
"Cambodia 19\n",
"Thailand 18\n",
"Laos 18\n",
"Yugoslavia 16\n",
"Outlying-US(Guam-USVI-etc) 14\n",
"Honduras 13\n",
"Hungary 13\n",
"Scotland 12\n",
"Holand-Netherlands 1\n",
"Name: native_country, dtype: int64"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check frequency distribution of values in native_country variable\n",
"\n",
"df.native_country.value_counts()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 583 values encoded as `?` in `native_country` variable. I will replace these `?` with `NaN`.\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"# replace '?' values in native_country variable with `NaN`\n",
"\n",
"df['native_country'].replace('?', np.NaN, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"United-States 29170\n",
"Mexico 643\n",
"Philippines 198\n",
"Germany 137\n",
"Canada 121\n",
"Puerto-Rico 114\n",
"El-Salvador 106\n",
"India 100\n",
"Cuba 95\n",
"England 90\n",
"Jamaica 81\n",
"South 80\n",
"China 75\n",
"Italy 73\n",
"Dominican-Republic 70\n",
"Vietnam 67\n",
"Guatemala 64\n",
"Japan 62\n",
"Poland 60\n",
"Columbia 59\n",
"Taiwan 51\n",
"Haiti 44\n",
"Iran 43\n",
"Portugal 37\n",
"Nicaragua 34\n",
"Peru 31\n",
"France 29\n",
"Greece 29\n",
"Ecuador 28\n",
"Ireland 24\n",
"Hong 20\n",
"Trinadad&Tobago 19\n",
"Cambodia 19\n",
"Thailand 18\n",
"Laos 18\n",
"Yugoslavia 16\n",
"Outlying-US(Guam-USVI-etc) 14\n",
"Honduras 13\n",
"Hungary 13\n",
"Scotland 12\n",
"Holand-Netherlands 1\n",
"Name: native_country, dtype: int64"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# again check the frequency distribution of values in native_country variable\n",
"\n",
"df.native_country.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check missing values in categorical variables again"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"workclass 1836\n",
"education 0\n",
"marital_status 0\n",
"occupation 1843\n",
"relationship 0\n",
"race 0\n",
"sex 0\n",
"native_country 583\n",
"income 0\n",
"dtype: int64"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[categorical].isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can see that `workclass`, `occupation` and `native_country` variable contains missing values."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Number of labels: cardinality\n",
"\n",
"\n",
"The number of labels within a categorical variable is known as **cardinality**. A high number of labels within a variable is known as **high cardinality**. High cardinality may pose some serious problems in the machine learning model. So, I will check for high cardinality."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"workclass contains 9 labels\n",
"education contains 16 labels\n",
"marital_status contains 7 labels\n",
"occupation contains 15 labels\n",
"relationship contains 6 labels\n",
"race contains 5 labels\n",
"sex contains 2 labels\n",
"native_country contains 42 labels\n",
"income contains 2 labels\n"
]
}
],
"source": [
"# check for cardinality in categorical variables\n",
"\n",
"for var in categorical:\n",
" \n",
" print(var, ' contains ', len(df[var].unique()), ' labels')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that `native_country` column contains relatively large number of labels as compared to other columns. I will check for cardinality after train-test split."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore Numerical Variables"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 6 numerical variables\n",
"\n",
"The numerical variables are : ['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']\n"
]
}
],
"source": [
"# find numerical variables\n",
"\n",
"numerical = [var for var in df.columns if df[var].dtype!='O']\n",
"\n",
"print('There are {} numerical variables\\n'.format(len(numerical)))\n",
"\n",
"print('The numerical variables are :', numerical)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>fnlwgt</th>\n",
" <th>education_num</th>\n",
" <th>capital_gain</th>\n",
" <th>capital_loss</th>\n",
" <th>hours_per_week</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>39</td>\n",
" <td>77516</td>\n",
" <td>13</td>\n",
" <td>2174</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>50</td>\n",
" <td>83311</td>\n",
" <td>13</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>38</td>\n",
" <td>215646</td>\n",
" <td>9</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>53</td>\n",
" <td>234721</td>\n",
" <td>7</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>28</td>\n",
" <td>338409</td>\n",
" <td>13</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>40</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age fnlwgt education_num capital_gain capital_loss hours_per_week\n",
"0 39 77516 13 2174 0 40\n",
"1 50 83311 13 0 0 13\n",
"2 38 215646 9 0 0 40\n",
"3 53 234721 7 0 0 40\n",
"4 28 338409 13 0 0 40"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view the numerical variables\n",
"\n",
"df[numerical].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary of numerical variables\n",
"\n",
"\n",
"- There are 6 numerical variables. \n",
"\n",
"\n",
"- These are given by `age`, `fnlwgt`, `education_num`, `capital_gain`, `capital_loss` and `hours_per_week`.\n",
"\n",
"\n",
"- All of the numerical variables are of discrete data type."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore problems within numerical variables\n",
"\n",
"\n",
"Now, I will explore the numerical variables.\n",
"\n",
"\n",
"### Missing values in numerical variables"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"age 0\n",
"fnlwgt 0\n",
"education_num 0\n",
"capital_gain 0\n",
"capital_loss 0\n",
"hours_per_week 0\n",
"dtype: int64"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in numerical variables\n",
"\n",
"df[numerical].isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that all the 6 numerical variables do not contain missing values. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Declare feature vector and target variable"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"X = df.drop(['income'], axis=1)\n",
"\n",
"y = df['income']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Split data into separate training and test set"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# split X and y into training and testing sets\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((22792, 14), (9769, 14))"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check the shape of X_train and X_test\n",
"\n",
"X_train.shape, X_test.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Feature Engineering\n",
"\n",
"\n",
"**Feature Engineering** is the process of transforming raw data into useful features that help us to understand our model better and increase its predictive power. I will carry out feature engineering on different types of variables.\n",
"\n",
"\n",
"First, I will display the categorical and numerical variables again separately."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"age int64\n",
"workclass object\n",
"fnlwgt int64\n",
"education object\n",
"education_num int64\n",
"marital_status object\n",
"occupation object\n",
"relationship object\n",
"race object\n",
"sex object\n",
"capital_gain int64\n",
"capital_loss int64\n",
"hours_per_week int64\n",
"native_country object\n",
"dtype: object"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check data types in X_train\n",
"\n",
"X_train.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['workclass',\n",
" 'education',\n",
" 'marital_status',\n",
" 'occupation',\n",
" 'relationship',\n",
" 'race',\n",
" 'sex',\n",
" 'native_country']"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# display categorical variables\n",
"\n",
"categorical = [col for col in X_train.columns if X_train[col].dtypes == 'O']\n",
"\n",
"categorical"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['age',\n",
" 'fnlwgt',\n",
" 'education_num',\n",
" 'capital_gain',\n",
" 'capital_loss',\n",
" 'hours_per_week']"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# display numerical variables\n",
"\n",
"numerical = [col for col in X_train.columns if X_train[col].dtypes != 'O']\n",
"\n",
"numerical"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Engineering missing values in categorical variables"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"workclass 0.055985\n",
"education 0.000000\n",
"marital_status 0.000000\n",
"occupation 0.056072\n",
"relationship 0.000000\n",
"race 0.000000\n",
"sex 0.000000\n",
"native_country 0.018164\n",
"dtype: float64"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# print percentage of missing values in the categorical variables in training set\n",
"\n",
"X_train[categorical].isnull().mean()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"workclass 0.055984555984555984\n",
"occupation 0.05607230607230607\n",
"native_country 0.018164268164268166\n"
]
}
],
"source": [
"# print categorical variables with missing data\n",
"\n",
"for col in categorical:\n",
" if X_train[col].isnull().mean()>0:\n",
" print(col, (X_train[col].isnull().mean()))"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"# impute missing categorical variables with most frequent value\n",
"\n",
"for df2 in [X_train, X_test]:\n",
" df2['workclass'].fillna(X_train['workclass'].mode()[0], inplace=True)\n",
" df2['occupation'].fillna(X_train['occupation'].mode()[0], inplace=True)\n",
" df2['native_country'].fillna(X_train['native_country'].mode()[0], inplace=True) "
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"workclass 0\n",
"education 0\n",
"marital_status 0\n",
"occupation 0\n",
"relationship 0\n",
"race 0\n",
"sex 0\n",
"native_country 0\n",
"dtype: int64"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in categorical variables in X_train\n",
"\n",
"X_train[categorical].isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"workclass 0\n",
"education 0\n",
"marital_status 0\n",
"occupation 0\n",
"relationship 0\n",
"race 0\n",
"sex 0\n",
"native_country 0\n",
"dtype: int64"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in categorical variables in X_test\n",
"\n",
"X_test[categorical].isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As a final check, I will check for missing values in X_train and X_test."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"age 0\n",
"workclass 0\n",
"fnlwgt 0\n",
"education 0\n",
"education_num 0\n",
"marital_status 0\n",
"occupation 0\n",
"relationship 0\n",
"race 0\n",
"sex 0\n",
"capital_gain 0\n",
"capital_loss 0\n",
"hours_per_week 0\n",
"native_country 0\n",
"dtype: int64"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in X_train\n",
"\n",
"X_train.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"age 0\n",
"workclass 0\n",
"fnlwgt 0\n",
"education 0\n",
"education_num 0\n",
"marital_status 0\n",
"occupation 0\n",
"relationship 0\n",
"race 0\n",
"sex 0\n",
"capital_gain 0\n",
"capital_loss 0\n",
"hours_per_week 0\n",
"native_country 0\n",
"dtype: int64"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in X_test\n",
"\n",
"X_test.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are no missing values in X_train and X_test."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Encode categorical variables"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['workclass',\n",
" 'education',\n",
" 'marital_status',\n",
" 'occupation',\n",
" 'relationship',\n",
" 'race',\n",
" 'sex',\n",
" 'native_country']"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# print categorical variables\n",
"\n",
"categorical"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>workclass</th>\n",
" <th>education</th>\n",
" <th>marital_status</th>\n",
" <th>occupation</th>\n",
" <th>relationship</th>\n",
" <th>race</th>\n",
" <th>sex</th>\n",
" <th>native_country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>32098</th>\n",
" <td>Private</td>\n",
" <td>HS-grad</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Craft-repair</td>\n",
" <td>Husband</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>United-States</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25206</th>\n",
" <td>State-gov</td>\n",
" <td>HS-grad</td>\n",
" <td>Divorced</td>\n",
" <td>Adm-clerical</td>\n",
" <td>Unmarried</td>\n",
" <td>White</td>\n",
" <td>Female</td>\n",
" <td>United-States</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23491</th>\n",
" <td>Private</td>\n",
" <td>Some-college</td>\n",
" <td>Married-civ-spouse</td>\n",
" <td>Sales</td>\n",
" <td>Husband</td>\n",
" <td>White</td>\n",
" <td>Male</td>\n",
" <td>United-States</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12367</th>\n",
" <td>Private</td>\n",
" <td>HS-grad</td>\n",
" <td>Never-married</td>\n",
" <td>Craft-repair</td>\n",
" <td>Not-in-family</td>\n",