Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 15 You must be signed in to star a gist
  • Fork 10 You must be signed in to fork a gist
  • Save pb111/cc341409081dffa5e9eaf60d79562a03 to your computer and use it in GitHub Desktop.
Save pb111/cc341409081dffa5e9eaf60d79562a03 to your computer and use it in GitHub Desktop.
XGBoost with Python and Scikit-Learn
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# XGBoost with Python and Scikit-Learn \n",
"\n",
"\n",
"**XGBoost** is an acronym for **Extreme Gradient Boosting**. It is a powerful machine learning algorithm that can be used to solve classification and regression problems. In this project, I implement XGBoost with Python and Scikit-Learn to solve a classification problem. The problem is to classify the customers from two different channels as Horeca (Hotel/Retail/Café) customers or Retail channel (nominal) customers.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
"\n",
"\n",
"1.\tIntroduction to XGBoost algorithm\n",
"2.\tXGBoost algorithm intuition\n",
"3.\tThe problem statement\n",
"4.\tDataset description\n",
"5.\tImport libraries\n",
"6.\tImport dataset\n",
"7.\tExploratory data analysis\n",
"8.\tDeclare feature vector and target variable\n",
"9.\tSplit data into separate training and test set\n",
"10.\tTrain the XGBoost classifier\n",
"11.\tMake predictions with XGBoost classifier\n",
"12.\tCheck accuracy score\n",
"13.\tk-fold Cross Validation using XGBoost\n",
"14.\tFeature importance with XGBoost\n",
"15.\tResults and conclusion\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Introduction to XGBoost algorithm\n",
"\n",
"\n",
"**XGBoost** stands for **Extreme Gradient Boosting**. XGBoost is a powerful machine learning algorithm that is dominating the world of applied machine learning and Kaggle competitions. It is an implementation of gradient boosted trees designed for speed and accuracy.\n",
"\n",
"\n",
"**XGBoost (Extreme Gradient Boosting)** is an advanced implementation of the gradient boosting algorithm. It has proved to be a highly effective machine learning algorithm extensively used in machine learning competitions. XGBoost has high predictive power and is almost 10 times faster than other gradient boosting techniques. It also includes a variety of regularization parameters which reduces overfitting and improves overall performance. Hence, it is also known as **regularized boosting** technique.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. XGBoost algorithm intuition\n",
"\n",
"\n",
"XGBoost (Extreme Gradient Boosting) belongs to a family of boosting algorithms. It uses the gradient boosting (GBM) framework at its core. So, first of all we should know about gradient boosting.\n",
"\n",
"\n",
"### Gradient boosting\n",
"\n",
"Gradient boosting is a supervised machine learning algorithm, which tries to predict a target variable by combining the estimates of a set of simpler, weaker models. In boosting, the trees are built in a sequential manner such that each subsequent tree aims to reduce the errors of the previous tree. The misclassified labels are given higher weights. Each tree learns from its predecessors and tries to reduce the residual errors. So, the tree next in sequence will learn from the previous tree residuals.\n",
"\n",
"\n",
"### XGBoost\n",
"\n",
"In XGBoost, we try to fit a model on the gradient of the loss function generated from the previous step. So, in XGBoost we modified our gradient boosting algorithm so that it works with any differentiable loss function.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. The problem statement\n",
"\n",
"In this project, I try to solve a classification problem. The problem is to classify the customers from two different channels as Horeca (Hotel/Retail/Café) customers or Retail channel (nominal) customers. I implement XGBoost with Python and Scikit-Learn to solve the classification problem. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Dataset description\n",
"\n",
"\n",
"I have used the `Wholesale customers data set` for this project, downloaded from the UCI Machine learning repository. \n",
"This dataset can be found at the following url-\n",
"\n",
"\n",
"https://archive.ics.uci.edu/ml/datasets/Wholesale+customers\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Import libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Import dataset"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Import dataset\n",
"\n",
"data = 'C:/datasets/Wholesale customers data.csv'\n",
"\n",
"df = pd.read_csv(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Exploratory Data Analysis\n",
"\n",
"\n",
"I will start off by checking the shape of the dataset."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(440, 8)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 440 instances and 8 attributes in the dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Preview dataset"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Channel</th>\n",
" <th>Region</th>\n",
" <th>Fresh</th>\n",
" <th>Milk</th>\n",
" <th>Grocery</th>\n",
" <th>Frozen</th>\n",
" <th>Detergents_Paper</th>\n",
" <th>Delicassen</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>12669</td>\n",
" <td>9656</td>\n",
" <td>7561</td>\n",
" <td>214</td>\n",
" <td>2674</td>\n",
" <td>1338</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>7057</td>\n",
" <td>9810</td>\n",
" <td>9568</td>\n",
" <td>1762</td>\n",
" <td>3293</td>\n",
" <td>1776</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>6353</td>\n",
" <td>8808</td>\n",
" <td>7684</td>\n",
" <td>2405</td>\n",
" <td>3516</td>\n",
" <td>7844</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>13265</td>\n",
" <td>1196</td>\n",
" <td>4221</td>\n",
" <td>6404</td>\n",
" <td>507</td>\n",
" <td>1788</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>22615</td>\n",
" <td>5410</td>\n",
" <td>7198</td>\n",
" <td>3915</td>\n",
" <td>1777</td>\n",
" <td>5185</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen\n",
"0 2 3 12669 9656 7561 214 2674 1338\n",
"1 2 3 7057 9810 9568 1762 3293 1776\n",
"2 2 3 6353 8808 7684 2405 3516 7844\n",
"3 1 3 13265 1196 4221 6404 507 1788\n",
"4 2 3 22615 5410 7198 3915 1777 5185"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that `Channel` variable contains values as `1` and `2`. These two values classify the customers from two different channels as 1 for Horeca (Hotel/Retail/Café) customers and 2 for Retail channel (nominal) customers."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View summary of dataframe"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 440 entries, 0 to 439\n",
"Data columns (total 8 columns):\n",
"Channel 440 non-null int64\n",
"Region 440 non-null int64\n",
"Fresh 440 non-null int64\n",
"Milk 440 non-null int64\n",
"Grocery 440 non-null int64\n",
"Frozen 440 non-null int64\n",
"Detergents_Paper 440 non-null int64\n",
"Delicassen 440 non-null int64\n",
"dtypes: int64(8)\n",
"memory usage: 27.6 KB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are only numerical variables in the dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View summary statistics of dataframe"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Channel</th>\n",
" <th>Region</th>\n",
" <th>Fresh</th>\n",
" <th>Milk</th>\n",
" <th>Grocery</th>\n",
" <th>Frozen</th>\n",
" <th>Detergents_Paper</th>\n",
" <th>Delicassen</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>440.000000</td>\n",
" <td>440.000000</td>\n",
" <td>440.000000</td>\n",
" <td>440.000000</td>\n",
" <td>440.000000</td>\n",
" <td>440.000000</td>\n",
" <td>440.000000</td>\n",
" <td>440.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>1.322727</td>\n",
" <td>2.543182</td>\n",
" <td>12000.297727</td>\n",
" <td>5796.265909</td>\n",
" <td>7951.277273</td>\n",
" <td>3071.931818</td>\n",
" <td>2881.493182</td>\n",
" <td>1524.870455</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>0.468052</td>\n",
" <td>0.774272</td>\n",
" <td>12647.328865</td>\n",
" <td>7380.377175</td>\n",
" <td>9503.162829</td>\n",
" <td>4854.673333</td>\n",
" <td>4767.854448</td>\n",
" <td>2820.105937</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>3.000000</td>\n",
" <td>55.000000</td>\n",
" <td>3.000000</td>\n",
" <td>25.000000</td>\n",
" <td>3.000000</td>\n",
" <td>3.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>1.000000</td>\n",
" <td>2.000000</td>\n",
" <td>3127.750000</td>\n",
" <td>1533.000000</td>\n",
" <td>2153.000000</td>\n",
" <td>742.250000</td>\n",
" <td>256.750000</td>\n",
" <td>408.250000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>1.000000</td>\n",
" <td>3.000000</td>\n",
" <td>8504.000000</td>\n",
" <td>3627.000000</td>\n",
" <td>4755.500000</td>\n",
" <td>1526.000000</td>\n",
" <td>816.500000</td>\n",
" <td>965.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>2.000000</td>\n",
" <td>3.000000</td>\n",
" <td>16933.750000</td>\n",
" <td>7190.250000</td>\n",
" <td>10655.750000</td>\n",
" <td>3554.250000</td>\n",
" <td>3922.000000</td>\n",
" <td>1820.250000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>2.000000</td>\n",
" <td>3.000000</td>\n",
" <td>112151.000000</td>\n",
" <td>73498.000000</td>\n",
" <td>92780.000000</td>\n",
" <td>60869.000000</td>\n",
" <td>40827.000000</td>\n",
" <td>47943.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Channel Region Fresh Milk Grocery \\\n",
"count 440.000000 440.000000 440.000000 440.000000 440.000000 \n",
"mean 1.322727 2.543182 12000.297727 5796.265909 7951.277273 \n",
"std 0.468052 0.774272 12647.328865 7380.377175 9503.162829 \n",
"min 1.000000 1.000000 3.000000 55.000000 3.000000 \n",
"25% 1.000000 2.000000 3127.750000 1533.000000 2153.000000 \n",
"50% 1.000000 3.000000 8504.000000 3627.000000 4755.500000 \n",
"75% 2.000000 3.000000 16933.750000 7190.250000 10655.750000 \n",
"max 2.000000 3.000000 112151.000000 73498.000000 92780.000000 \n",
"\n",
" Frozen Detergents_Paper Delicassen \n",
"count 440.000000 440.000000 440.000000 \n",
"mean 3071.931818 2881.493182 1524.870455 \n",
"std 4854.673333 4767.854448 2820.105937 \n",
"min 25.000000 3.000000 3.000000 \n",
"25% 742.250000 256.750000 408.250000 \n",
"50% 1526.000000 816.500000 965.500000 \n",
"75% 3554.250000 3922.000000 1820.250000 \n",
"max 60869.000000 40827.000000 47943.000000 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check for missing values"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Channel 0\n",
"Region 0\n",
"Fresh 0\n",
"Milk 0\n",
"Grocery 0\n",
"Frozen 0\n",
"Detergents_Paper 0\n",
"Delicassen 0\n",
"dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are no missing values in the dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Declare feature vector and target variable"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"X = df.drop('Channel', axis=1)\n",
"\n",
"y = df['Channel']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### let's take a look at feature vector(X) and target variable(y)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Region</th>\n",
" <th>Fresh</th>\n",
" <th>Milk</th>\n",
" <th>Grocery</th>\n",
" <th>Frozen</th>\n",
" <th>Detergents_Paper</th>\n",
" <th>Delicassen</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3</td>\n",
" <td>12669</td>\n",
" <td>9656</td>\n",
" <td>7561</td>\n",
" <td>214</td>\n",
" <td>2674</td>\n",
" <td>1338</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>7057</td>\n",
" <td>9810</td>\n",
" <td>9568</td>\n",
" <td>1762</td>\n",
" <td>3293</td>\n",
" <td>1776</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>6353</td>\n",
" <td>8808</td>\n",
" <td>7684</td>\n",
" <td>2405</td>\n",
" <td>3516</td>\n",
" <td>7844</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>13265</td>\n",
" <td>1196</td>\n",
" <td>4221</td>\n",
" <td>6404</td>\n",
" <td>507</td>\n",
" <td>1788</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3</td>\n",
" <td>22615</td>\n",
" <td>5410</td>\n",
" <td>7198</td>\n",
" <td>3915</td>\n",
" <td>1777</td>\n",
" <td>5185</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen\n",
"0 3 12669 9656 7561 214 2674 1338\n",
"1 3 7057 9810 9568 1762 3293 1776\n",
"2 3 6353 8808 7684 2405 3516 7844\n",
"3 3 13265 1196 4221 6404 507 1788\n",
"4 3 22615 5410 7198 3915 1777 5185"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.head()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 2\n",
"1 2\n",
"2 2\n",
"3 1\n",
"4 2\n",
"Name: Channel, dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the y labels contain values as 1 and 2. I will need to convert it into 0 and 1 for further analysis. I will do it as follows-"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# convert labels into binary values\n",
"\n",
"y[y == 2] = 0\n",
"\n",
"y[y == 1] = 1"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"1 0\n",
"2 0\n",
"3 1\n",
"4 0\n",
"Name: Channel, dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# again preview the y label\n",
"\n",
"y.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, I will convert the dataset into an optimized data structure called **Dmatrix** that XGBoost supports and gives it acclaimed performance and efficiency gains. I will do it as follows."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"# import XGBoost\n",
"import xgboost as xgb\n",
"\n",
"\n",
"# define data_dmatrix\n",
"data_dmatrix = xgb.DMatrix(data=X,label=y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Split data into separate training and test set"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"# split X and y into training and testing sets\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Train the XGBoost classifier\n",
"\n",
"\n",
"- Now, I will train the XGBoost classifier. We need to know different parameters that XGBoost provides. There are three types of parameters that we must set before running XGBoost. These parameters are as follows:-\n",
"\n",
"\n",
"### General parameters\n",
"\n",
"These parameters relate to which booster we are doing boosting. The common ones are tree or linear model.\n",
"\n",
"\n",
"### Booster parameters\n",
"\n",
"It depends on which booster we have chosen for boosting.\n",
"\n",
"\n",
"### Learning task parameters\n",
"\n",
"These parameters decide on the learning scenario. For example, regression tasks may use different parameters than ranking tasks. \n",
"\n",
"\n",
"### Command line parameters\n",
"\n",
"In addition there are command line parameters which relate to behaviour of CLI version of XGBoost.\n",
"\n",
"\n",
"The most important parameters that we should know about are as follows:-\n",
"\n",
"\n",
"**learning_rate** - It gives us the step size shrinkage which is used to prevent overfitting. Its range is [0,1].\n",
"\n",
"**max_depth** - It determines how deeply each tree is allowed to grow during any boosting round.\n",
"\n",
"**subsample** - It determines the percentage of samples used per tree. Low value of subsample can lead to underfitting.\n",
"\n",
"**colsample_bytree** - It determines the percentage of features used per tree. High value of it can lead to overfitting.\n",
"\n",
"**n_estimators** - It is the number of trees we want to build.\n",
"\n",
"**objective** - It determines the loss function to be used in the process. For example, `reg:linear` for regression problems, `reg:logistic` for classification problems with only decision, `binary:logistic` for classification problems with probability.\n",
"\n",
"\n",
"XGBoost also supports regularization parameters to penalize models as they become more complex and reduce them to simple models. These regularization parameters are as follows:-\n",
"\n",
"\n",
"**gamma** - It controls whether a given node will split based on the expected reduction in loss after the split. A higher value leads to fewer splits. It is supported only for tree-based learners.\n",
"\n",
"**alpha** - It gives us the `L1` regularization on leaf weights. A large value of it leads to more regularization.\n",
"\n",
"**lambda** - It gives us the `L2` regularization on leaf weights and is smoother than `L1` regularization.\n",
"\n",
"Though we are using trees as our base learners, we can also use XGBoost’s relatively less popular linear base learners and one other tree learner known as `dart`. We have to set the `booster` parameter to either `gbtree` (default), `gblinear` or `dart`.\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
" colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,\n",
" max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,\n",
" n_estimators=100, n_jobs=1, nthread=None,\n",
" objective='binary:logistic', random_state=0, reg_alpha=0,\n",
" reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,\n",
" subsample=1, verbosity=1)"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# import XGBClassifier\n",
"from xgboost import XGBClassifier\n",
"\n",
"\n",
"# declare parameters\n",
"params = {\n",
" 'objective':'binary:logistic',\n",
" 'max_depth': 4,\n",
" 'alpha': 10,\n",
" 'learning_rate': 1.0,\n",
" 'n_estimators':100\n",
" }\n",
" \n",
" \n",
" \n",
"# instantiate the classifier \n",
"xgb_clf = XGBClassifier(**params)\n",
"\n",
"\n",
"\n",
"# fit the classifier to the training data\n",
"xgb_clf.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
" colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1.0,\n",
" max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,\n",
" n_estimators=100, n_jobs=1, nthread=None,\n",
" objective='binary:logistic', random_state=0, reg_alpha=0,\n",
" reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,\n",
" subsample=1, verbosity=1)\n"
]
}
],
"source": [
"# alternatively view the parameters of the xgb trained model\n",
"print(xgb_clf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Make predictions with XGBoost Classifier"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# make predictions on test data\n",
"y_pred = xgb_clf.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Check accuracy score"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"XGBoost model accuracy score: 0.9167\n"
]
}
],
"source": [
"# check accuracy score\n",
"from sklearn.metrics import accuracy_score\n",
"\n",
"print('XGBoost model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that XGBoost obtain very high accuracy score of 91.67%."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13. k-fold Cross Validation using XGBoost\n",
"\n",
"\n",
"To build more robust models with XGBoost, we must do k-fold cross validation. In this way, we ensure that the original training dataset is used for both training and validation. Also, each entry is used for validation just once. XGBoost supports k-fold cross validation using the `cv()` method. In this method, we will specify several parameters which are as follows:- \n",
"\n",
"\n",
"**nfolds** - This parameter specifies the number of cross-validation sets we want to build. \n",
"\n",
"**num_boost_round** - It denotes the number of trees we build.\n",
"\n",
"**metrics** - It is the performance evaluation metrics to be considered during CV.\n",
"\n",
"**as_pandas** - It is used to return the results in a pandas DataFrame.\n",
"\n",
"**early_stopping_rounds** - This parameter stops training of the model early if the hold-out metric does not improve for a given number of rounds.\n",
"\n",
"**seed** - This parameter is used for reproducibility of results.\n",
"\n",
"We can use these parameters to build a k-fold cross-validation model by calling `XGBoost's CV()` method.\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"from xgboost import cv\n",
"\n",
"params = {\"objective\":\"binary:logistic\",'colsample_bytree': 0.3,'learning_rate': 0.1,\n",
" 'max_depth': 5, 'alpha': 10}\n",
"\n",
"xgb_cv = cv(dtrain=data_dmatrix, params=params, nfold=3,\n",
" num_boost_round=50, early_stopping_rounds=10, metrics=\"auc\", as_pandas=True, seed=123)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`xgb_cv` contains train and test `auc` metrics for each boosting round. Let's preview `xgb_cv`."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>train-auc-mean</th>\n",
" <th>train-auc-std</th>\n",
" <th>test-auc-mean</th>\n",
" <th>test-auc-std</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.914998</td>\n",
" <td>0.009704</td>\n",
" <td>0.880965</td>\n",
" <td>0.021050</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.934374</td>\n",
" <td>0.013263</td>\n",
" <td>0.923561</td>\n",
" <td>0.022810</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.936252</td>\n",
" <td>0.013723</td>\n",
" <td>0.924433</td>\n",
" <td>0.025777</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.943878</td>\n",
" <td>0.009032</td>\n",
" <td>0.927152</td>\n",
" <td>0.022228</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.957880</td>\n",
" <td>0.008845</td>\n",
" <td>0.935191</td>\n",
" <td>0.016437</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" train-auc-mean train-auc-std test-auc-mean test-auc-std\n",
"0 0.914998 0.009704 0.880965 0.021050\n",
"1 0.934374 0.013263 0.923561 0.022810\n",
"2 0.936252 0.013723 0.924433 0.025777\n",
"3 0.943878 0.009032 0.927152 0.022228\n",
"4 0.957880 0.008845 0.935191 0.016437"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"xgb_cv.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Feature importance with XGBoost\n",
"\n",
"\n",
"XGBoost provides a way to examine the importance of each feature in the original dataset within the model. It involves counting the number of times each feature is split on across all boosting trees in the model. Then we visualize the result as a bar graph, with the features ordered according to how many times they appear. \n",
"\n",
"XGBoost has a **plot_importance()** function that helps us to achieve this task. Then we can visualize the features that has been given the highest important score among all the features. Thus XGBoost provides us a way to do feature selection.\n",
"\n",
"I will proceed as follows:-\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"xgb.plot_importance(xgb_clf)\n",
"plt.rcParams['figure.figsize'] = [6, 4]\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the feature `Grocery` has been given the highest importance score among all the features. Thus XGBoost also gives us a way to do Feature Selection."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 15. Results and conclusion\n",
"\n",
"\n",
"1.\tIn this project, I implement XGBoost with Python and Scikit-Learn to classify the customers from two different channels as Horeca (Hotel/Retail/Café) customers or Retail channel (nominal) customers.\n",
"\n",
"2.\tThe y labels contain values as 1 and 2. I have converted them into 0 and 1 for further analysis.\n",
"3.\tI have trained the XGBoost classifier and found the accuracy score to be 91.67%.\n",
"\n",
"4.\tI have done the hyperparameter tuning in XGBoost by doing k-fold cross-validation.\n",
"\n",
"5.\tI find the most important feature in XGBoost to be `Grocey`. I did it using the **plot_importance()** function in XGBoost that helps us to achieve this task. \n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@kmlknta21
Copy link

-Nice. It is helpful to run in Jupyter Notebook. Thank you

@malambomutila
Copy link

From the Feature Importance graph, Delicassen has the highest F score. Doesn't this mean that Delicassen was the most important feature as opposed to Grocery which was fourth best?

@ajitbalakrishnan
Copy link

No answer to malambomutila comment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment