Skip to content

Instantly share code, notes, and snippets.

@kiwidamien
Created August 26, 2019 04:57
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save kiwidamien/1ee8d6217610be9ed1dcda81dbc9eba4 to your computer and use it in GitHub Desktop.
Save kiwidamien/1ee8d6217610be9ed1dcda81dbc9eba4 to your computer and use it in GitHub Desktop.
Category Encoders companion gist
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Install category_encoders using pip (not conda, which is\n",
"# on an old version)\n",
"#!pip install category_encoders"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to category encoders"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's read in the example data that was used in the article [\"Encoding Categorical Variables\"](https://kiwidamien.gihub.io/encoding-categorical-variables.html). We are deliberately using a small dataset, so that it is easy to see what the encoders are doing."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"You are using category encoders version 2.0.0\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>annual_income</th>\n",
" <th>debt_to_income</th>\n",
" <th>loan_amount</th>\n",
" <th>purpose</th>\n",
" <th>grade</th>\n",
" <th>repaid</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>120000</td>\n",
" <td>0.100</td>\n",
" <td>3500</td>\n",
" <td>medical</td>\n",
" <td>A</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>130000</td>\n",
" <td>0.500</td>\n",
" <td>13800</td>\n",
" <td>medical</td>\n",
" <td>C</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>220000</td>\n",
" <td>0.400</td>\n",
" <td>33500</td>\n",
" <td>medical</td>\n",
" <td>B</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>65000</td>\n",
" <td>0.250</td>\n",
" <td>2000</td>\n",
" <td>refinance</td>\n",
" <td>B</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>60000</td>\n",
" <td>0.200</td>\n",
" <td>2200</td>\n",
" <td>refinance</td>\n",
" <td>B</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>45000</td>\n",
" <td>0.312</td>\n",
" <td>5500</td>\n",
" <td>auto</td>\n",
" <td>D</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>75000</td>\n",
" <td>0.111</td>\n",
" <td>2000</td>\n",
" <td>auto</td>\n",
" <td>B</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>24000</td>\n",
" <td>0.400</td>\n",
" <td>500</td>\n",
" <td>other</td>\n",
" <td>C</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" annual_income debt_to_income loan_amount purpose grade repaid\n",
"0 120000 0.100 3500 medical A True\n",
"1 130000 0.500 13800 medical C False\n",
"2 220000 0.400 33500 medical B False\n",
"3 65000 0.250 2000 refinance B False\n",
"4 60000 0.200 2200 refinance B True\n",
"5 45000 0.312 5500 auto D True\n",
"6 75000 0.111 2000 auto B True\n",
"7 24000 0.400 500 other C False"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import category_encoders as ce\n",
"import pandas as pd\n",
"\n",
"print(f\"You are using category encoders version {ce.__version__}\")\n",
"if int(ce.__version__.split('.')[0]) < 2:\n",
" print(\"Install version 2.0.0 or higher!\")\n",
" \n",
"df_train = pd.read_csv('https://raw.githubusercontent.com/kiwidamien/StackedTurtles/master/content/preprocessing/simple_loan_example.csv')\n",
"df_train"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ordinal encoder"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Used for ordered categories (e.g. grade, where `A` is better than `B`, `B` is better than `C`, etc)\n",
"* Actual values used **don't** mattter for tree-based models, only the order matters\n",
"* Actual values used **do** mattter for linear-coefficient basde models.\n",
"\n",
"Let's start with the default encoding:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>annual_income</th>\n",
" <th>debt_to_income</th>\n",
" <th>loan_amount</th>\n",
" <th>purpose</th>\n",
" <th>grade</th>\n",
" <th>repaid</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>120000</td>\n",
" <td>0.100</td>\n",
" <td>3500</td>\n",
" <td>medical</td>\n",
" <td>1</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>130000</td>\n",
" <td>0.500</td>\n",
" <td>13800</td>\n",
" <td>medical</td>\n",
" <td>2</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>220000</td>\n",
" <td>0.400</td>\n",
" <td>33500</td>\n",
" <td>medical</td>\n",
" <td>3</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>65000</td>\n",
" <td>0.250</td>\n",
" <td>2000</td>\n",
" <td>refinance</td>\n",
" <td>3</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>60000</td>\n",
" <td>0.200</td>\n",
" <td>2200</td>\n",
" <td>refinance</td>\n",
" <td>3</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>45000</td>\n",
" <td>0.312</td>\n",
" <td>5500</td>\n",
" <td>auto</td>\n",
" <td>4</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>75000</td>\n",
" <td>0.111</td>\n",
" <td>2000</td>\n",
" <td>auto</td>\n",
" <td>3</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>24000</td>\n",
" <td>0.400</td>\n",
" <td>500</td>\n",
" <td>other</td>\n",
" <td>2</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" annual_income debt_to_income loan_amount purpose grade repaid\n",
"0 120000 0.100 3500 medical 1 True\n",
"1 130000 0.500 13800 medical 2 False\n",
"2 220000 0.400 33500 medical 3 False\n",
"3 65000 0.250 2000 refinance 3 False\n",
"4 60000 0.200 2200 refinance 3 True\n",
"5 45000 0.312 5500 auto 4 True\n",
"6 75000 0.111 2000 auto 3 True\n",
"7 24000 0.400 500 other 2 False"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"encoder_grade = ce.OrdinalEncoder(cols=['grade'], return_df=True)\n",
"encoder_grade.fit_transform(df_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What happens if we have a new grade (e.g. `E`) that we didn't see in training?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>annual_income</th>\n",
" <th>debt_to_income</th>\n",
" <th>loan_amount</th>\n",
" <th>purpose</th>\n",
" <th>grade</th>\n",
" <th>repaid</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>120000</td>\n",
" <td>0.100</td>\n",
" <td>3500</td>\n",
" <td>medical</td>\n",
" <td>-1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>130000</td>\n",
" <td>0.500</td>\n",
" <td>13800</td>\n",
" <td>medical</td>\n",
" <td>2.0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>220000</td>\n",
" <td>0.400</td>\n",
" <td>33500</td>\n",
" <td>medical</td>\n",
" <td>3.0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>65000</td>\n",
" <td>0.250</td>\n",
" <td>2000</td>\n",
" <td>refinance</td>\n",
" <td>3.0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>60000</td>\n",
" <td>0.200</td>\n",
" <td>2200</td>\n",
" <td>refinance</td>\n",
" <td>3.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>45000</td>\n",
" <td>0.312</td>\n",
" <td>5500</td>\n",
" <td>auto</td>\n",
" <td>4.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>75000</td>\n",
" <td>0.111</td>\n",
" <td>2000</td>\n",
" <td>auto</td>\n",
" <td>3.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>24000</td>\n",
" <td>0.400</td>\n",
" <td>500</td>\n",
" <td>other</td>\n",
" <td>2.0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" annual_income debt_to_income loan_amount purpose grade repaid\n",
"0 120000 0.100 3500 medical -1.0 True\n",
"1 130000 0.500 13800 medical 2.0 False\n",
"2 220000 0.400 33500 medical 3.0 False\n",
"3 65000 0.250 2000 refinance 3.0 False\n",
"4 60000 0.200 2200 refinance 3.0 True\n",
"5 45000 0.312 5500 auto 4.0 True\n",
"6 75000 0.111 2000 auto 3.0 True\n",
"7 24000 0.400 500 other 2.0 False"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_test = df_train.copy()\n",
"df_test.loc[0, 'grade'] = 'E'\n",
"\n",
"encoder_grade.transform(df_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that `E` was mapped to the value `-1`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Custom map\n",
"\n",
"By default, we map `A` &rightarrow; `1`, `B` &rightarrow; `2`, etc. More precisely, we \"sort\" the levels seen in training, and then label them consecutively.\n",
"\n",
"Let's say we wanted `A` &rightarrow; `1`, `B` &rightarrow; `3`, `C` &rightarrow; `5`, and anything worse than `C` to go 10. We can implement our own map using a function:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>annual_income</th>\n",
" <th>debt_to_income</th>\n",
" <th>loan_amount</th>\n",
" <th>purpose</th>\n",
" <th>grade</th>\n",
" <th>repaid</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>120000</td>\n",
" <td>0.100</td>\n",
" <td>3500</td>\n",
" <td>medical</td>\n",
" <td>1</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>130000</td>\n",
" <td>0.500</td>\n",
" <td>13800</td>\n",
" <td>medical</td>\n",
" <td>5</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>220000</td>\n",
" <td>0.400</td>\n",
" <td>33500</td>\n",
" <td>medical</td>\n",
" <td>3</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>65000</td>\n",
" <td>0.250</td>\n",
" <td>2000</td>\n",
" <td>refinance</td>\n",
" <td>3</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>60000</td>\n",
" <td>0.200</td>\n",
" <td>2200</td>\n",
" <td>refinance</td>\n",
" <td>3</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>45000</td>\n",
" <td>0.312</td>\n",
" <td>5500</td>\n",
" <td>auto</td>\n",
" <td>10</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>75000</td>\n",
" <td>0.111</td>\n",
" <td>2000</td>\n",
" <td>auto</td>\n",
" <td>3</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>24000</td>\n",
" <td>0.400</td>\n",
" <td>500</td>\n",
" <td>other</td>\n",
" <td>5</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" annual_income debt_to_income loan_amount purpose grade repaid\n",
"0 120000 0.100 3500 medical 1 True\n",
"1 130000 0.500 13800 medical 5 False\n",
"2 220000 0.400 33500 medical 3 False\n",
"3 65000 0.250 2000 refinance 3 False\n",
"4 60000 0.200 2200 refinance 3 True\n",
"5 45000 0.312 5500 auto 10 True\n",
"6 75000 0.111 2000 auto 3 True\n",
"7 24000 0.400 500 other 5 False"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def custom_grade(grade):\n",
" encoding = {'A': 1, 'B': 3, 'C': 5}\n",
" return encoding.get(grade, 10)\n",
"\n",
"encoder_grade = ce.OrdinalEncoder(mapping=[{'col': 'grade', 'mapping': custom_grade}], return_df=True)\n",
"encoder_grade.fit_transform(df_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This might be particularly useful if you lexigraphic ordering doesn't match your intended ordering (e.g. `A+`, `A`, `A-` are not ordered the way you would typically want by default)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## One Hot Encoder"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One hot encoding is used for non-ordered categories if there are only a few levels. \n",
"\n",
"In this case, `purpose` only has 4 different levels, as we can see with `value_counts`"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"medical 3\n",
"refinance 2\n",
"auto 2\n",
"other 1\n",
"Name: purpose, dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['purpose'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each level of the `purpose` feature gets it's own column:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>annual_income</th>\n",
" <th>debt_to_income</th>\n",
" <th>loan_amount</th>\n",
" <th>purpose_medical</th>\n",
" <th>purpose_refinance</th>\n",
" <th>purpose_auto</th>\n",
" <th>purpose_other</th>\n",
" <th>grade</th>\n",
" <th>repaid</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>120000</td>\n",
" <td>0.100</td>\n",
" <td>3500</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>A</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>130000</td>\n",
" <td>0.500</td>\n",
" <td>13800</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>C</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>220000</td>\n",
" <td>0.400</td>\n",
" <td>33500</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>B</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>65000</td>\n",
" <td>0.250</td>\n",
" <td>2000</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>B</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>60000</td>\n",
" <td>0.200</td>\n",
" <td>2200</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>B</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>45000</td>\n",
" <td>0.312</td>\n",
" <td>5500</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>D</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>75000</td>\n",
" <td>0.111</td>\n",
" <td>2000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>B</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>24000</td>\n",
" <td>0.400</td>\n",
" <td>500</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>C</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" annual_income debt_to_income loan_amount purpose_medical \\\n",
"0 120000 0.100 3500 1 \n",
"1 130000 0.500 13800 1 \n",
"2 220000 0.400 33500 1 \n",
"3 65000 0.250 2000 0 \n",
"4 60000 0.200 2200 0 \n",
"5 45000 0.312 5500 0 \n",
"6 75000 0.111 2000 0 \n",
"7 24000 0.400 500 0 \n",
"\n",
" purpose_refinance purpose_auto purpose_other grade repaid \n",
"0 0 0 0 A True \n",
"1 0 0 0 C False \n",
"2 0 0 0 B False \n",
"3 1 0 0 B False \n",
"4 1 0 0 B True \n",
"5 0 1 0 D True \n",
"6 0 1 0 B True \n",
"7 0 0 1 C False "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"encoder_purpose = ce.OneHotEncoder(cols='purpose', use_cat_names=True, return_df=True)\n",
"encoder_purpose.fit_transform(df_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Columns with unknown values just get all zeros, as we can see by setting the first row to `\"tuition\"`:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>annual_income</th>\n",
" <th>debt_to_income</th>\n",
" <th>loan_amount</th>\n",
" <th>purpose_medical</th>\n",
" <th>purpose_refinance</th>\n",
" <th>purpose_auto</th>\n",
" <th>purpose_other</th>\n",
" <th>grade</th>\n",
" <th>repaid</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>120000</td>\n",
" <td>0.100</td>\n",
" <td>3500</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>A</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>130000</td>\n",
" <td>0.500</td>\n",
" <td>13800</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>C</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>220000</td>\n",
" <td>0.400</td>\n",
" <td>33500</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>B</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>65000</td>\n",
" <td>0.250</td>\n",
" <td>2000</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>B</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>60000</td>\n",
" <td>0.200</td>\n",
" <td>2200</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>B</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>45000</td>\n",
" <td>0.312</td>\n",
" <td>5500</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>D</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>75000</td>\n",
" <td>0.111</td>\n",
" <td>2000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>B</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>24000</td>\n",
" <td>0.400</td>\n",
" <td>500</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>C</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" annual_income debt_to_income loan_amount purpose_medical \\\n",
"0 120000 0.100 3500 0 \n",
"1 130000 0.500 13800 1 \n",
"2 220000 0.400 33500 1 \n",
"3 65000 0.250 2000 0 \n",
"4 60000 0.200 2200 0 \n",
"5 45000 0.312 5500 0 \n",
"6 75000 0.111 2000 0 \n",
"7 24000 0.400 500 0 \n",
"\n",
" purpose_refinance purpose_auto purpose_other grade repaid \n",
"0 0 0 0 A True \n",
"1 0 0 0 C False \n",
"2 0 0 0 B False \n",
"3 1 0 0 B False \n",
"4 1 0 0 B True \n",
"5 0 1 0 D True \n",
"6 0 1 0 B True \n",
"7 0 0 1 C False "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_test = df_train.copy()\n",
"df_test.loc[0, 'purpose'] = \"tuition\"\n",
"\n",
"encoder_purpose.transform(df_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Target Encoder"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`TargetEncoder` uses the average value of the target in the same level to determine the value we should encode with. In our case, the target is binary (`repaid`), so the average of the level is fraction of that level that repaid.\n",
"\n",
"By default, it smooths the value between the overall average and the average of the group. This helps prevent overfitting by giving more weight to the overall average when we only have a few examples in that level. You should keep this smoothing in actual problems, but we will turn it off here (`smoothing=0.0`) as it makes it easier to see what the `TargetEncoder` is doing.\n",
"\n",
"First, let's show what fraction of each `purpose` ended up repaying their loan:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"purpose\n",
"auto 1.000000\n",
"medical 0.333333\n",
"other 0.000000\n",
"refinance 0.500000\n",
"Name: repaid, dtype: float64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train.groupby(['purpose'])['repaid'].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is the original dataframe (should be easy to verify):"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>annual_income</th>\n",
" <th>debt_to_income</th>\n",
" <th>loan_amount</th>\n",
" <th>purpose</th>\n",
" <th>grade</th>\n",
" <th>repaid</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>120000</td>\n",
" <td>0.100</td>\n",
" <td>3500</td>\n",
" <td>medical</td>\n",
" <td>A</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>130000</td>\n",
" <td>0.500</td>\n",
" <td>13800</td>\n",
" <td>medical</td>\n",
" <td>C</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>220000</td>\n",
" <td>0.400</td>\n",
" <td>33500</td>\n",
" <td>medical</td>\n",
" <td>B</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>65000</td>\n",
" <td>0.250</td>\n",
" <td>2000</td>\n",
" <td>refinance</td>\n",
" <td>B</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>60000</td>\n",
" <td>0.200</td>\n",
" <td>2200</td>\n",
" <td>refinance</td>\n",
" <td>B</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>45000</td>\n",
" <td>0.312</td>\n",
" <td>5500</td>\n",
" <td>auto</td>\n",
" <td>D</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>75000</td>\n",
" <td>0.111</td>\n",
" <td>2000</td>\n",
" <td>auto</td>\n",
" <td>B</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>24000</td>\n",
" <td>0.400</td>\n",
" <td>500</td>\n",
" <td>other</td>\n",
" <td>C</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" annual_income debt_to_income loan_amount purpose grade repaid\n",
"0 120000 0.100 3500 medical A True\n",
"1 130000 0.500 13800 medical C False\n",
"2 220000 0.400 33500 medical B False\n",
"3 65000 0.250 2000 refinance B False\n",
"4 60000 0.200 2200 refinance B True\n",
"5 45000 0.312 5500 auto D True\n",
"6 75000 0.111 2000 auto B True\n",
"7 24000 0.400 500 other C False"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now look at the encoding:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>annual_income</th>\n",
" <th>debt_to_income</th>\n",
" <th>loan_amount</th>\n",
" <th>purpose</th>\n",
" <th>grade</th>\n",
" <th>repaid</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>120000</td>\n",
" <td>0.100</td>\n",
" <td>3500</td>\n",
" <td>0.333333</td>\n",
" <td>A</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>130000</td>\n",
" <td>0.500</td>\n",
" <td>13800</td>\n",
" <td>0.333333</td>\n",
" <td>C</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>220000</td>\n",
" <td>0.400</td>\n",
" <td>33500</td>\n",
" <td>0.333333</td>\n",
" <td>B</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>65000</td>\n",
" <td>0.250</td>\n",
" <td>2000</td>\n",
" <td>0.500000</td>\n",
" <td>B</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>60000</td>\n",
" <td>0.200</td>\n",
" <td>2200</td>\n",
" <td>0.500000</td>\n",
" <td>B</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>45000</td>\n",
" <td>0.312</td>\n",
" <td>5500</td>\n",
" <td>1.000000</td>\n",
" <td>D</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>75000</td>\n",
" <td>0.111</td>\n",
" <td>2000</td>\n",
" <td>1.000000</td>\n",
" <td>B</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>24000</td>\n",
" <td>0.400</td>\n",
" <td>500</td>\n",
" <td>0.500000</td>\n",
" <td>C</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" annual_income debt_to_income loan_amount purpose grade repaid\n",
"0 120000 0.100 3500 0.333333 A True\n",
"1 130000 0.500 13800 0.333333 C False\n",
"2 220000 0.400 33500 0.333333 B False\n",
"3 65000 0.250 2000 0.500000 B False\n",
"4 60000 0.200 2200 0.500000 B True\n",
"5 45000 0.312 5500 1.000000 D True\n",
"6 75000 0.111 2000 1.000000 B True\n",
"7 24000 0.400 500 0.500000 C False"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"encoder_purpose = ce.TargetEncoder(cols='purpose', smoothing=0.0, return_df=True)\n",
"encoder_purpose.fit_transform(df_train, df_train.repaid)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With the exception of the \"Other\" category, the `purpose` category was replaced with the average repayment rate for each purpose. If we have only one example (like we did for \"other\") or a new category, it is replaced with the average."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Warning:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When using the target encoder, you are using the values of the output. It is critical when you are doing cross-validation that you encode on each fold, rather than encoding everything and then doing cross validation. Otherwise your cross validation will \"know\" about the hold out set, making your cross-validatation scores higher than they will be on the test set (and on new data). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Hash Encoder"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The hash encoder maps each feature value to `n_components` binary columns. Because it doesn't memorize the levels during training, it can be good if you have a **lot** of categories. The function can also translate new (unseen) levels at test time.\n",
"\n",
"It helps with tree-based models, because roughly half the levels will have a 0 or 1 in each column, so if there are relationships between levels the hope is that some of the columns will have common values for the related levels.\n",
"\n",
"Drawbacks:\n",
"\n",
"* It is hard to get interpretable results from a HashEncoder colum\n",
"* If you choose a small number of levels, or are unlucky, you can get _collisions_ where distinct levels get mapped to the same encoding. Below we see that `medical` and `refinance` are both mapped to `(0, 0, 1)`."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col_0</th>\n",
" <th>col_1</th>\n",
" <th>col_2</th>\n",
" <th>annual_income</th>\n",
" <th>debt_to_income</th>\n",
" <th>loan_amount</th>\n",
" <th>grade</th>\n",
" <th>repaid</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>120000</td>\n",
" <td>0.100</td>\n",
" <td>3500</td>\n",
" <td>A</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>130000</td>\n",
" <td>0.500</td>\n",
" <td>13800</td>\n",
" <td>C</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>220000</td>\n",
" <td>0.400</td>\n",
" <td>33500</td>\n",
" <td>B</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>65000</td>\n",
" <td>0.250</td>\n",
" <td>2000</td>\n",
" <td>B</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>60000</td>\n",
" <td>0.200</td>\n",
" <td>2200</td>\n",
" <td>B</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>45000</td>\n",
" <td>0.312</td>\n",
" <td>5500</td>\n",
" <td>D</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>75000</td>\n",
" <td>0.111</td>\n",
" <td>2000</td>\n",
" <td>B</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>24000</td>\n",
" <td>0.400</td>\n",
" <td>500</td>\n",
" <td>C</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col_0 col_1 col_2 annual_income debt_to_income loan_amount grade \\\n",
"0 0 0 1 120000 0.100 3500 A \n",
"1 0 0 1 130000 0.500 13800 C \n",
"2 0 0 1 220000 0.400 33500 B \n",
"3 0 0 1 65000 0.250 2000 B \n",
"4 0 0 1 60000 0.200 2200 B \n",
"5 1 0 0 45000 0.312 5500 D \n",
"6 1 0 0 75000 0.111 2000 B \n",
"7 0 1 0 24000 0.400 500 C \n",
"\n",
" repaid \n",
"0 True \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 True \n",
"5 True \n",
"6 True \n",
"7 False "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"encoder_purpose = ce.HashingEncoder(n_components=3, cols=['purpose'])\n",
"encoder_purpose.fit_transform(df_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Encoding multiple columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's encode \n",
"\n",
"* `grades` using `OneHotEncoder` (usually you would use \"OrdinalEncoder\")\n",
"* `purpose` using `TargetEncoder`\n",
"\n",
"We will do it in two steps, then use a pipeline, to ensure that we are able to do cross-validation correctly:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Two steps\n",
"\n",
"First, let's do it _incorrectly_:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"encoder_grade = ce.OneHotEncoder(cols=['grade'], return_df=True).fit(df_train)\n",
"encoder_purpose = ce.TargetEncoder(cols=['purpose'], return_df=True).fit(df_train, df_train['repaid'])"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>annual_income</th>\n",
" <th>debt_to_income</th>\n",
" <th>loan_amount</th>\n",
" <th>purpose</th>\n",
" <th>grade_1</th>\n",
" <th>grade_2</th>\n",
" <th>grade_3</th>\n",
" <th>grade_4</th>\n",
" <th>repaid</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>120000</td>\n",
" <td>0.100</td>\n",
" <td>3500</td>\n",
" <td>medical</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>130000</td>\n",
" <td>0.500</td>\n",
" <td>13800</td>\n",
" <td>medical</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>220000</td>\n",
" <td>0.400</td>\n",
" <td>33500</td>\n",
" <td>medical</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>65000</td>\n",
" <td>0.250</td>\n",
" <td>2000</td>\n",
" <td>refinance</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>60000</td>\n",
" <td>0.200</td>\n",
" <td>2200</td>\n",
" <td>refinance</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>45000</td>\n",
" <td>0.312</td>\n",
" <td>5500</td>\n",
" <td>auto</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>75000</td>\n",
" <td>0.111</td>\n",
" <td>2000</td>\n",
" <td>auto</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>24000</td>\n",
" <td>0.400</td>\n",
" <td>500</td>\n",
" <td>other</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" annual_income debt_to_income loan_amount purpose grade_1 grade_2 \\\n",
"0 120000 0.100 3500 medical 1 0 \n",
"1 130000 0.500 13800 medical 0 1 \n",
"2 220000 0.400 33500 medical 0 0 \n",
"3 65000 0.250 2000 refinance 0 0 \n",
"4 60000 0.200 2200 refinance 0 0 \n",
"5 45000 0.312 5500 auto 0 0 \n",
"6 75000 0.111 2000 auto 0 0 \n",
"7 24000 0.400 500 other 0 1 \n",
"\n",
" grade_3 grade_4 repaid \n",
"0 0 0 True \n",
"1 0 0 False \n",
"2 1 0 False \n",
"3 1 0 False \n",
"4 1 0 True \n",
"5 0 1 True \n",
"6 1 0 True \n",
"7 0 0 False "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Can we encode?\n",
"df_train_grade_encoded = encoder_grade.transform(df_train)\n",
"df_train_grade_encoded"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's encode pupose of this dataframe...."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "Unexpected input dimension 9, expected 6",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-15-f82b0015ffa7>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf_train_all\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mencoder_purpose\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtransform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf_train_grade_encoded\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/anaconda3/lib/python3.6/site-packages/category_encoders/target_encoder.py\u001b[0m in \u001b[0;36mtransform\u001b[0;34m(self, X, y, override_return_df)\u001b[0m\n\u001b[1;32m 214\u001b[0m \u001b[0;31m# then make sure that it is the right size\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 215\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_dim\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 216\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Unexpected input dimension %d, expected %d'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_dim\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 217\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 218\u001b[0m \u001b[0;31m# if we are encoding the training data, we have to check the target\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mValueError\u001b[0m: Unexpected input dimension 9, expected 6"
]
}
],
"source": [
"df_train_all = encoder_purpose.transform(df_train_grade_encoded)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What happened?\n",
"\n",
"Our `encoder_purpose` was trained on `df_train`, which had only 6 columns. Here we asked it to transform _after_ the one hot encoder had expanded to 9 columns! If we reverse the order, however, we are fine:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>annual_income</th>\n",
" <th>debt_to_income</th>\n",
" <th>loan_amount</th>\n",
" <th>purpose</th>\n",
" <th>grade</th>\n",
" <th>repaid</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>120000</td>\n",
" <td>0.100</td>\n",
" <td>3500</td>\n",
" <td>0.353200</td>\n",
" <td>A</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>130000</td>\n",
" <td>0.500</td>\n",
" <td>13800</td>\n",
" <td>0.353200</td>\n",
" <td>C</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>220000</td>\n",
" <td>0.400</td>\n",
" <td>33500</td>\n",
" <td>0.353200</td>\n",
" <td>B</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>65000</td>\n",
" <td>0.250</td>\n",
" <td>2000</td>\n",
" <td>0.500000</td>\n",
" <td>B</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>60000</td>\n",
" <td>0.200</td>\n",
" <td>2200</td>\n",
" <td>0.500000</td>\n",
" <td>B</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>45000</td>\n",
" <td>0.312</td>\n",
" <td>5500</td>\n",
" <td>0.865529</td>\n",
" <td>D</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>75000</td>\n",
" <td>0.111</td>\n",
" <td>2000</td>\n",
" <td>0.865529</td>\n",
" <td>B</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>24000</td>\n",
" <td>0.400</td>\n",
" <td>500</td>\n",
" <td>0.500000</td>\n",
" <td>C</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" annual_income debt_to_income loan_amount purpose grade repaid\n",
"0 120000 0.100 3500 0.353200 A True\n",
"1 130000 0.500 13800 0.353200 C False\n",
"2 220000 0.400 33500 0.353200 B False\n",
"3 65000 0.250 2000 0.500000 B False\n",
"4 60000 0.200 2200 0.500000 B True\n",
"5 45000 0.312 5500 0.865529 D True\n",
"6 75000 0.111 2000 0.865529 B True\n",
"7 24000 0.400 500 0.500000 C False"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This encoding doesn't change the number of columns\n",
"df_train_purpose_encoded = encoder_purpose.transform(df_train)\n",
"df_train_purpose_encoded"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
".... so we _can_ pass this along to one hot encoding:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>annual_income</th>\n",
" <th>debt_to_income</th>\n",
" <th>loan_amount</th>\n",
" <th>purpose</th>\n",
" <th>grade_1</th>\n",
" <th>grade_2</th>\n",
" <th>grade_3</th>\n",
" <th>grade_4</th>\n",
" <th>repaid</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>120000</td>\n",
" <td>0.100</td>\n",
" <td>3500</td>\n",
" <td>0.353200</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>130000</td>\n",
" <td>0.500</td>\n",
" <td>13800</td>\n",
" <td>0.353200</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>220000</td>\n",
" <td>0.400</td>\n",
" <td>33500</td>\n",
" <td>0.353200</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>65000</td>\n",
" <td>0.250</td>\n",
" <td>2000</td>\n",
" <td>0.500000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>60000</td>\n",
" <td>0.200</td>\n",
" <td>2200</td>\n",
" <td>0.500000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>45000</td>\n",
" <td>0.312</td>\n",
" <td>5500</td>\n",
" <td>0.865529</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>75000</td>\n",
" <td>0.111</td>\n",
" <td>2000</td>\n",
" <td>0.865529</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>24000</td>\n",
" <td>0.400</td>\n",
" <td>500</td>\n",
" <td>0.500000</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" annual_income debt_to_income loan_amount purpose grade_1 grade_2 \\\n",
"0 120000 0.100 3500 0.353200 1 0 \n",
"1 130000 0.500 13800 0.353200 0 1 \n",
"2 220000 0.400 33500 0.353200 0 0 \n",
"3 65000 0.250 2000 0.500000 0 0 \n",
"4 60000 0.200 2200 0.500000 0 0 \n",
"5 45000 0.312 5500 0.865529 0 0 \n",
"6 75000 0.111 2000 0.865529 0 0 \n",
"7 24000 0.400 500 0.500000 0 1 \n",
"\n",
" grade_3 grade_4 repaid \n",
"0 0 0 True \n",
"1 0 0 False \n",
"2 1 0 False \n",
"3 1 0 False \n",
"4 1 0 True \n",
"5 0 1 True \n",
"6 1 0 True \n",
"7 0 0 False "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train_all_encoded = encoder_grade.transform(df_train_purpose_encoded)\n",
"df_train_all_encoded"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A better way: pipelines!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That was really annoying! We would hope to have a better way, and there is -- use a pipeline! The pipeline trains all at once, with each step trained on the output of the previous step. Therefore we don't need to keep track of which step we do first. They also work nicely with `GridSearch` and `cross_val_score` as we do the encoding on each set of training folds, so we know there is no data leakage into the validation set.\n",
"\n",
"Let's do an example with the OneHotEncoder first:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(memory=None,\n",
" steps=[('encode_grade', OneHotEncoder(cols=['grade'], drop_invariant=False, handle_missing='value',\n",
" handle_unknown='value', return_df=True, use_cat_names=False,\n",
" verbose=0)), ('encode_purpose', TargetEncoder(cols=['purpose'], drop_invariant=False, handle_missing='value',\n",
" handle_unknown='value', min_samples_leaf=1, return_df=True,\n",
" smoothing=1.0, verbose=0))])"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.pipeline import Pipeline\n",
"\n",
"# We can put these in either order, the second one\n",
"# fits/transforms on the output of the first!\n",
"encoding_pipeline = Pipeline([\n",
" ('encode_grade', encoder_grade),\n",
" ('encode_purpose', encoder_purpose)\n",
"])\n",
"\n",
"encoding_pipeline.fit(df_train, df_train['repaid'])"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>annual_income</th>\n",
" <th>debt_to_income</th>\n",
" <th>loan_amount</th>\n",
" <th>purpose</th>\n",
" <th>grade_1</th>\n",
" <th>grade_2</th>\n",
" <th>grade_3</th>\n",
" <th>grade_4</th>\n",
" <th>repaid</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>120000</td>\n",
" <td>0.100</td>\n",
" <td>3500</td>\n",
" <td>0.353200</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>130000</td>\n",
" <td>0.500</td>\n",
" <td>13800</td>\n",
" <td>0.353200</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>220000</td>\n",
" <td>0.400</td>\n",
" <td>33500</td>\n",
" <td>0.353200</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>65000</td>\n",
" <td>0.250</td>\n",
" <td>2000</td>\n",
" <td>0.500000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>60000</td>\n",
" <td>0.200</td>\n",
" <td>2200</td>\n",
" <td>0.500000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>45000</td>\n",
" <td>0.312</td>\n",
" <td>5500</td>\n",
" <td>0.865529</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>75000</td>\n",
" <td>0.111</td>\n",
" <td>2000</td>\n",
" <td>0.865529</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>24000</td>\n",
" <td>0.400</td>\n",
" <td>500</td>\n",
" <td>0.500000</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" annual_income debt_to_income loan_amount purpose grade_1 grade_2 \\\n",
"0 120000 0.100 3500 0.353200 1 0 \n",
"1 130000 0.500 13800 0.353200 0 1 \n",
"2 220000 0.400 33500 0.353200 0 0 \n",
"3 65000 0.250 2000 0.500000 0 0 \n",
"4 60000 0.200 2200 0.500000 0 0 \n",
"5 45000 0.312 5500 0.865529 0 0 \n",
"6 75000 0.111 2000 0.865529 0 0 \n",
"7 24000 0.400 500 0.500000 0 1 \n",
"\n",
" grade_3 grade_4 repaid \n",
"0 0 0 True \n",
"1 0 0 False \n",
"2 1 0 False \n",
"3 1 0 False \n",
"4 1 0 True \n",
"5 0 1 True \n",
"6 1 0 True \n",
"7 0 0 False "
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Note we don't pass in the target values!\n",
"encoding_pipeline.transform(df_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For more information on pipelines, see the article [\"An introduction to pipelines\"](https://kiwidamien.github.io/introduction-to-pipelines.html)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@marcelkore
Copy link

your notebooks/posts are so easy to follow! I have spend the better part of this afternoon reviewing your posts. Thanks for sharing!

@kiwidamien
Copy link
Author

@marcelkore thanks for taking the time to drop a comment -- it helps to know I am not talking to myself and someone finds this useful! =)

@HaeHwan
Copy link

HaeHwan commented Mar 4, 2020

I had a problem with Hashing Encoder and it seems like the problem may also happens to yours since I used all your code exactly the same.
Would you mind if you come and visit my github and see the problem?
Here is the URL : https://github.com/HaeHwan/hello-world/blob/master/Hashing(2).ipynb

The main problem is that HashEncoder doesn't change the columns at all as you can see on the above URL.

Thanks.

@kiwidamien
Copy link
Author

Hi @HaeHwan

That's strange .... I cannot duplicate your error. If you try running the following file in the terminal, what do you get?

import pandas as pd
import category_encoders as ce

print(f"""
  Version check:
  --------------
      Pandas version:            {pd.__version__}
      Category Encoders version: {ce.__version__}
""")

df_train = pd.read_csv('https://raw.githubusercontent.com/kiwidamien/StackedTurtles/master/content/preprocessing/simple_loan_example.csv')

encoder_purpose = ce.HashingEncoder(n_components=3, cols=['purpose'])
df_transform = encoder_purpose.fit_transform(df_train)

print(df_transform)

For reference, my output is

  Version check:
  --------------
      Pandas version:            0.24.2
      Category Encoders version: 2.1.0

   col_0  col_1  col_2  annual_income  debt_to_income  loan_amount grade  repaid
0      0      0      1         120000           0.100         3500     A    True
1      0      0      1         130000           0.500        13800     C   False
2      0      0      1         220000           0.400        33500     B   False
3      0      0      1          65000           0.250         2000     B   False
4      0      0      1          60000           0.200         2200     B    True
5      1      0      0          45000           0.312         5500     D    True
6      1      0      0          75000           0.111         2000     B    True
7      0      1      0          24000           0.400          500     C   False

@HaeHwan
Copy link

HaeHwan commented Mar 5, 2020

oh finally I solved it! maybe the problem was process number within my laptop pc. I plug "max_process = 1" and now it works thank you for your kindness

@ThisIsVenkatesh
Copy link

Hi @kiwidamien, Thanks for sharing this. It helps me a lot. I'm unable to open the link to "An introduction to pipelines". Can you please look into this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment