{{ message }}

Instantly share code, notes, and snippets.

# Yorko/topic5_medium.md Secret

Last active Mar 4, 2018

# Topic 5. Composition of algorithms, random forest

## Part 1. Bagging

In the previous articles, you saw different algorithms for classification as well as techniques for how to properly validate and evaluate the quality of your models.

Now, suppose that you have chosen the best possible model for a particular problem and are struggling to further improve its accuracy. In this case, you would need to apply some more advanced techniques of machine learning that are collectively referred to as ensembles.

An ensemble is a set of elements that collectively contribute to a whole. A familiar example is a musical ensemble, which blends the sounds of several musical instruments to create a beautiful harmony, or architectural ensembles, which are a set of buildings designed as a unit. In ensembles, the (whole) harmonious outcome is more important than the performance of any individual part.

### Article outline

1. Ensembles
2. Bootstrapping
3. Bagging
4. Out-of-bag error $\DeclareMathOperator{\Var}{Var}$ $\DeclareMathOperator{\Cov}{Cov}$ $\DeclareMathOperator{\Corr}{Corr}$ $\DeclareMathOperator{\Err}{Err}$ $\DeclareMathOperator{\Bias}{Bias}$ $\DeclareMathOperator{\E}{\mathbb{E}}$

### 1. Ensembles

One good example of ensembles is Condorcet's jury theorem (1784), which states that, if each member of the jury makes an independent judgement and the probability of the correct decision by each juror is more than 0.5, then the probability of the correct decision by the whole jury increases with the total number of jurors and tends to one. On the other hand, if the probability of being right is less than 0.5 for each juror, then the probability of the correct decision by the whole jury decreases with the number of jurors and tends to zero.

Let's write an analytic expression for this theorem:

• $\large N$ is the total number of jurors;
• $\large m$ is a minimal number of jurors that would make a majority, that is $\large m = \frac{N + 1}{2}$;
• $\large {N \choose i}$ is the number of $\large i$-combinations from a set with $\large N$ elements.
• $\large p$ is the probability of the correct decision by a juror;
• $\large \mu$ is the probability of the correct decision by the whole jury.

Then:

$$\large \mu = \sum_{i=m}^{N}{N\choose i}p^i(1-p)^{N-i}$$

It can be seen that if $\large p > 0.5$, then $\large \mu > p$. In addition, if $\large N \rightarrow \infty$, then $\large \mu \rightarrow 1$.

Let's look at another example of ensembles: an observation known as Wisdom of the crowd. In 1906, Francis Galton visited a country fair in Plymouth where he saw a contest being held for farmers. 800 participants tried to estimate the weight of a slaughtered bull. The real weight of the bull was 1198 pounds. Although none of the farmers could guess the exact weight of the animal, the average of their predictions was 1197 pounds.

A similar idea for error reduction was adopted in the field of Machine Learning.

### 2. Bootstrapping

Bagging (also known as Bootstrap aggregation) is one of the first and most basic ensemble techniques. It was proposed by Leo Breiman in 1994. Bagging is based on the statistical method of bootstrapping, which makes the evaluation of many statistics of complex models feasible.

The bootstrap method goes as follows. Let there be a sample $\large X$ of size $\large N$. We can make a new sample from the original sample by drawing $\large N$ elements from the latter randomly and uniformly, with replacement. In other words, we select a random element from the original sample of size $\large N$ and do this $\large N$ times. All elements are equally likely to be selected, thus each element is drawn with the equal probability $\large \frac{1}{N}$.

Let's say we are drawing balls from a bag one at a time. At each step, the selected ball is put back into the bag so that the next selection is made equiprobably i.e. from the same number of balls $\large N$. Note that, because we put the balls back, there may be duplicates in the new sample. Let's call this new sample $\large X_1$.

By repeating this procedure $\large M$ times, we create $\large M$ bootstrap samples $\large X_1, \dots, X_M$. In the end, we have a sufficient number of samples and can compute various statistics of the original distribution. For our example, we'll use the familiar telecom_churn dataset. Previously, when we discussed feature importance, we saw that one of the most important features in this dataset is the number of calls to customer service. Let's vizualize the data and look at the distribution of this feature.

import pandas as pd
from matplotlib import pyplot as plt
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = 10, 6
import seaborn as sns
%matplotlib inline

fig = sns.kdeplot(telecom_data[telecom_data['Churn'] == False]['Customer service calls'], label = 'Loyal')
fig = sns.kdeplot(telecom_data[telecom_data['Churn'] == True]['Customer service calls'], label = 'Churn')
fig.set(xlabel='Number of calls', ylabel='Density')
plt.show() As you can see, loyal customers make fewer calls to customer service than those who eventually left. Now, it might be a good idea to estimate the average number of customer service calls in each group. Since our dataset is small, we would not get a good estimate by simply calculating the mean of the original sample. We will be better off applying the bootstrap method. Let's generate 1000 new bootstrap samples from our original population and produce an interval estimate of the mean.

import numpy as np

def get_bootstrap_samples(data, n_samples):
"""Generate bootstrap samples using the bootstrap method."""
indices = np.random.randint(0, len(data), (n_samples, len(data)))
samples = data[indices]
return samples
def stat_intervals(stat, alpha):
"""Produce an interval estimate."""
boundaries = np.percentile(stat, [100 * alpha / 2., 100 * (1 - alpha / 2.)])
return boundaries

# Save the data about the loyal and former customers to split the dataset
loyal_calls = telecom_data[telecom_data['Churn'] == False]['Customer service calls'].values
churn_calls= telecom_data[telecom_data['Churn'] == True]['Customer service calls'].values

# Set the seed for reproducibility of the results
np.random.seed(0)

# Generate the samples using bootstrapping and calculate the mean for each of them
loyal_mean_scores = [np.mean(sample)
for sample in get_bootstrap_samples(loyal_calls, 1000)]
churn_mean_scores = [np.mean(sample)
for sample in get_bootstrap_samples(churn_calls, 1000)]

# Print the resulting interval estimates
print("Service calls from loyal: mean interval", stat_intervals(loyal_mean_scores, 0.05))
print("Service calls from churn: mean interval", stat_intervals(churn_mean_scores, 0.05))
Service calls from loyal:  mean interval [1.4077193  1.49473684]
Service calls from churn:  mean interval [2.0621118  2.39761905]


In the end, we see that, with 95% probability, the average number of customer service calls from loyal customers lies between 1.40 and 1.50 while the churned clients called 2.06 through 2.40 times on average. Also, note that the interval for the loyal customers is narrower, which is reasonable since they make fewer calls (0, 1 or 2) in comparison with the churned clients who called until they became fed up and switched providers.

### 3. Bagging

Now that you've grasped the idea of bootstrapping, we can move on to bagging.

Suppose that we have a training set $\large X$. Using bootstrapping, we generate samples $\large X_1, \dots, X_M$. Now, for each bootstrap sample, we train its own classifier $\large a_i(x)$. The final classifier will average the outputs from all these individual classifiers. In the case of classification, this technique corresponds to voting: $$\large a(x) = \frac{1}{M}\sum_{i = 1}^M a_i(x).$$

The picture below illustrates this algorithm: Let's consider a regression problem with base algorithms $\large b_1(x), \dots , b_n(x)$. Assume that there exists an ideal target function of true answers $\large y(x)$ defined for all inputs and that the distribution $\large p(x)$ is defined. We can then express the error for each regression function as follows:

$$\large \varepsilon_i(x) = b_i(x) - y(x), \quad i = 1, \dots, n$$

And the expected value of the mean squared error:

$$\large \E_x\left[\left(b_i(x) - y(x)\right)^{2}\right] = \E_x\left[\varepsilon_i^{2}(x)\right].$$

Then, the mean error over all the regression functions will look as follows:
$$\large \E_1 = \frac{1}{n} \E_x\left[\varepsilon_i^{2}(x)\right]$$

We'll assume that the errors are unbiased and uncorrelated, that is:

$$\large \begin{array}{rcl} \E_x\left[\varepsilon_i(x)\right] &=& 0, \ \E_x\left[\varepsilon_i(x)\varepsilon_j(x)\right] &=& 0, \quad i \neq j. \end{array}$$

Now, let's construct a new regression function that will average the values from the individual functions:

$$\large a(x) = \frac{1}{n}\sum_{i=1}^{n}b_i(x)$$

Let's find its mean squared error:

$$\large \begin{array}{rcl}\E_n &=& \E_x\left[\frac{1}{n}\sum_{i=1}^{n}b_i(x)-y(x)\right]^2 \ &=& \E_x\left[\frac{1}{n}\sum_{i=1}^{n}\varepsilon_i\right]^2 \ &=& \frac{1}{n^2}\E_x\left[\sum_{i=1}^{n}\varepsilon_i^2(x) + \sum_{i \neq j}\varepsilon_i(x)\varepsilon_j(x)\right] \ &=& \frac{1}{n}\E_1\end{array}$$

Thus, by averaging the individual answers, we reduced the mean squared error by a factor of $\large n$.

From our previous lesson, let's recall the components that make up the total out-of-sample error:

$$\large \begin{array}{rcl} \Err\left(\vec{x}\right) &=& \E\left[\left(y - \hat{f}\left(\vec{x}\right)\right)^2\right] \ &=& \sigma^2 + f^2 + \Var\left(\hat{f}\right) + \E\left[\hat{f}\right]^2 - 2f\E\left[\hat{f}\right] \ &=& \left(f - \E\left[\hat{f}\right]\right)^2 + \Var\left(\hat{f}\right) + \sigma^2 \ &=& \Bias\left(\hat{f}\right)^2 + \Var\left(\hat{f}\right) + \sigma^2 \end{array}$$

Bagging reduces the variance of a classifier by decreasing the difference in error when we train the model on different datasets. In other words, bagging prevents overfitting. The efficacy of bagging comes from the fact that the individual models are quite different due to the different training data and their errors cancel out during voting. Additionally, outliers are likely omitted in some of the training bootstap samples.

The scikit-learn library supports bagging with meta-estimators BaggingRegressor and BaggingClassifier. You can use most of the algorithms in the library as a base.

Let's examine how bagging works in practice and compare it with the decision tree. For this, we will use an example from sklearn's documentation. The error for the decision tree: $$\large 0.0255 , (\Err) = 0.0003 , (\Bias^2) + 0.0152 , (\Var) + 0.0098 , (\sigma^2)$$

The error when using bagging: $$\large 0.0196 , (\Err) = 0.0004 , (\Bias^2) + 0.0092 , (\Var) + 0.0098 , (\sigma^2)$$

As you can see from the graph above, the variance in the error is much lower for bagging. Remember that we have already proved this theoretically.

Bagging is effective on small datasets. Dropping even a small part of training data leads to constructing substantially different base classifiers. If you have a large dataset, you would generate bootstrap samples of a much smaller size.

The example above is unlikely to be applicable to any real work. This is because we made the strong assumption that our individual errors are uncorrelated. More often than not, this is way too optimistic for real-world applications. When this assumption is false, the reduction in error will not be as significant. In the following lectures, we will discuss some more sophisticated ensemble methods, which enable more accurate predictions in real-world problems.

### 4. Out-of-bag error

Looking ahead, in the case of random forest models, there is no need to use cross-validation or hold-out samples in order to get an unbiased error estimation. Why? Because, in ensemble techniques, the error estimation takes place internally.

Random trees are constructed using different bootstrap samples of the original dataset. Approximately 37% of inputs are left out of a particular bootstrap sample and are not used in the construction of the $\large k$-th tree.

This is easy to prove. Suppose there are $\large \ell$ examples in our dataset. At each step, each data point has equal probability of ending up in a bootstrap sample with replacement, probability $\large\frac{1}{\ell}.$ The probability that there is no such bootstrap sample that contains a particular dataset element (i.e. it has been omitted $\large \ell$ times) equals $\large (1 - \frac{1}{\ell})^\ell$. When $\large \ell \rightarrow +\infty$, it becomes equal to the Second Remarkable Limit $\large \frac{1}{e}$. Then, the probability of selecting a specific example is $\large \approx 1 - \frac{1}{e} \approx 63%$.

Let's visualize how Out-of-Bag Error (or OOBE) estimation works: The top part of the figure above represents our original dataset. We split it into the training (left) and test (right) sets. In the left image, we draw a grid that perfectly divides our dataset according to classes. Now, we use the same grid to estimate the share of the correct answers on our test set. We can see that our classifier gave incorrect answers in those 4 cases that have not been used during training (on the left). Hence, the accuracy of our classifier is $\large \frac{11}{15}*100% = 73.33%$.

To sum up, each base algorithm is trained on $\large \approx 63%$ of the original examples. It can be validated on the remaining $\large \approx 37%$. The Out-of-Bag estimate is nothing more than the mean estimate of the base algorithms on those $\large \approx 37%$ of inputs that were left out of training.

# Topic 5. Composition of algorithms, random forest

## Part 2. Random forest

Leo Breiman managed to apply bootstrapping not only in statistics but also in machine learning. He, along with Adel Cutler, extended and improved the random forest algorithm proposed by Tin Kam Ho. They combined the construction of uncorrelated trees using CART, bagging, and the random subspace method.

Decision trees are a good choice for the base classifier in bagging because they are quite sophisticated and can achieve zero classification error on any sample. The random subspace method reduces the correlation between the trees and thus prevents overfitting. With bagging, the base algorithms are trained on different random subsets of the original feature set.

The following algorithm constructs an ensemble of models using the random subspace method:

1. Let the number of input-output examples be equal to $\large \ell$, and the number of feature dimensions be equal to $\large d$.
2. Choose $\large L$ as the number of individual models in the ensemble.
3. For each model $\large l$, choose the number of features $\large dl < d$. As a rule, the same value of $\large dl$ is used for all the models.
4. For each model $\large l$, create a training set by selecting $\large dl$ features at random from the whole set of $\large d$ feature dimensions.
5. Train each model.
6. Apply the resulting ensemble model to a new input by combining the results from all the models in $\large L$. You can use either majority voting or aggregation of the posterior probabilities.

### Article outline

1. Algorithm
2. Comparison with Decision Trees and Bagging
3. Parameters
4. Variance and Decorrelation
5. Bias
6. Extremely Randomized Trees
7. Similarities between Random Forest and k-Nearest Neighbors techniques
8. Transformation of a dataset into a high-dimensional representation
9. Pros and cons of random forests
10. Useful resources $\DeclareMathOperator{\Var}{Var}$ $\DeclareMathOperator{\Cov}{Cov}$ $\DeclareMathOperator{\Corr}{Corr}$ $\DeclareMathOperator{\Err}{Err}$ $\DeclareMathOperator{\Bias}{Bias}$ $\DeclareMathOperator{\E}{\mathbb{E}}$

### 1. Algorithm

The algorithm for constructing a random forest of $\large N$ trees goes as follows:

• For each $\large k = 1, \dots, N$:

• Generate a bootstrap sample $\large X_k$.

• Build a decision tree $\large b_k$ on the sample $\large X_k$:

• Pick the best feature dimension according to the given criteria. Split the sample by this feature to create a new tree level. Repeat this procedure until the sample is exhausted.
• Building the tree until any of its leaves contains no more than $\large n_\text{min}$ examples or until a certain depth is reached.
• For each split, we first randomly pick $\large m$ features from the $\large d$ original ones and then search for the next best split only among the subset.

The final classifier is defined by: $$\large a(x) = \frac{1}{N}\sum_{k = 1}^N b_k(x)$$

We use the majority voting for classification, the mean for regression.

For classification problems, it is advisable to set $\large m = \sqrt{d}$. For regression problems, we usually take $\large m = \frac{d}{3}$, where $\large d$ is the number of features. It is recommended to build each tree until all of its leaves contain only $\large n_\text{min} = 1$ examples for classification and $\large n_\text{min} = 5$ examples for regression.

You can see random forest as bagging of decision trees with the modification of selecting a random subset of feature dimensions at each candidate split.

### 2. Comparison with Decision Trees and Bagging

from __future__ import division, print_function

# Disable warnings in Anaconda
import warnings
import numpy as np
warnings.filterwarnings('ignore')

%matplotlib inline
from matplotlib import pyplot as plt
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = 10, 6
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split

n_train = 150
n_test = 1000
noise = 0.1

# Generate data
def f(x):
x = x.ravel()
return np.exp(-x ** 2) + 1.5 * np.exp(-(x - 2) ** 2)

def generate(n_samples, noise):
X = np.random.rand(n_samples) * 10 - 5
X = np.sort(X).ravel()
y = np.exp(-X ** 2) + 1.5 * np.exp(-(X - 2) ** 2)\
+ np.random.normal(0.0, noise, n_samples)
X = X.reshape((n_samples, 1))

return X, y

X_train, y_train = generate(n_samples=n_train, noise=noise)
X_test, y_test = generate(n_samples=n_test, noise=noise)

# One decision tree regressor
dtree = DecisionTreeRegressor().fit(X_train, y_train)
d_predict = dtree.predict(X_test)

plt.figure(figsize=(10, 6))
plt.plot(X_test, f(X_test), "b")
plt.scatter(X_train, y_train, c="b", s=20)
plt.plot(X_test, d_predict, "g", lw=2)
plt.xlim([-5, 5])
plt.title("Decision tree, MSE = %.2f"
% np.sum((y_test - d_predict) ** 2))

# Bagging with a decision tree regressor
bdt = BaggingRegressor(DecisionTreeRegressor()).fit(X_train, y_train)
bdt_predict = bdt.predict(X_test)

plt.figure(figsize=(10, 6))
plt.plot(X_test, f(X_test), "b")
plt.scatter(X_train, y_train, c="b", s=20)
plt.plot(X_test, bdt_predict, "y", lw=2)
plt.xlim([-5, 5])
plt.title("Bagging for decision trees, MSE = %.2f" % np.sum((y_test - bdt_predict) ** 2));

# Random Forest
rf = RandomForestRegressor(n_estimators=10).fit(X_train, y_train)
rf_predict = rf.predict(X_test)

plt.figure(figsize=(10, 6))
plt.plot(X_test, f(X_test), "b")
plt.scatter(X_train, y_train, c="b", s=20)
plt.plot(X_test, rf_predict, "r", lw=2)
plt.xlim([-5, 5])
plt.title("Random forest, MSE = %.2f" % np.sum((y_test - rf_predict) ** 2));   As we can see from our graphs and the MSE values above, a random forest of 10 trees achieves a better result than a single decision tree or bagging with 10 trees. The main difference between random forests and bagging is that, in a random forest, the best feature for a split is selected from a random subset of the available features while, in bagging, all features are considered for the next best split.

We can also look at the advantages of random forests and bagging in classification problems:

np.random.seed(42)
X, y = make_circles(n_samples=500, factor=0.1, noise=0.35, random_state=42)
X_train_circles, X_test_circles, y_train_circles, y_test_circles = train_test_split(X, y, test_size=0.2)

dtree = DecisionTreeClassifier(random_state=42)
dtree.fit(X_train_circles, y_train_circles)

x_range = np.linspace(X.min(), X.max(), 100)
xx1, xx2 = np.meshgrid(x_range, x_range)
y_hat = dtree.predict(np.c_[xx1.ravel(), xx2.ravel()])
y_hat = y_hat.reshape(xx1.shape)
plt.contourf(xx1, xx2, y_hat, alpha=0.2)
plt.scatter(X[:,0], X[:,1], c=y, cmap='autumn')
plt.title("Decision tree")
plt.show()

b_dtree = BaggingClassifier(DecisionTreeClassifier(),n_estimators=300, random_state=42)
b_dtree.fit(X_train_circles, y_train_circles)

x_range = np.linspace(X.min(), X.max(), 100)
xx1, xx2 = np.meshgrid(x_range, x_range)
y_hat = b_dtree.predict(np.c_[xx1.ravel(), xx2.ravel()])
y_hat = y_hat.reshape(xx1.shape)
plt.contourf(xx1, xx2, y_hat, alpha=0.2)
plt.scatter(X[:,0], X[:,1], c=y, cmap='autumn')
plt.title("Bagging (decision trees)")
plt.show()

rf = RandomForestClassifier(n_estimators=300, random_state=42)
rf.fit(X_train_circles, y_train_circles)

x_range = np.linspace(X.min(), X.max(), 100)
xx1, xx2 = np.meshgrid(x_range, x_range)
y_hat = rf.predict(np.c_[xx1.ravel(), xx2.ravel()])
y_hat = y_hat.reshape(xx1.shape)
plt.contourf(xx1, xx2, y_hat, alpha=0.2)
plt.scatter(X[:,0], X[:,1], c=y, cmap='autumn')
plt.title("Random forest")
plt.show()   The figures above show that the separating boundary of the decision tree is quite jagged and has a lot of acute angles that suggest overfitting and a weak ability to generalize. We would have trouble making reliable predictions on new test data. In contrast, the bagging algorithm has a rather smooth boundary and has no obvious signs of overfitting.

Now, let's investigate some parameters which can help us increase the model accuracy.

### 3. Parameters

The scikit-learn library implements random forests by providing two estimators: RandomForestClassifier and RandomForestRegressor.

The full list of random forest parameters for regression is shown below:

class sklearn.ensemble.RandomForestRegressor(
n_estimators вЂ” the number of trees in the forest (default = 10)
criterion вЂ” the function used to measure the quality of a split. Supported criteria are вЂњmseвЂќ for the mean squared error, which is equal to variance reduction as feature selection criterion, and вЂњmaeвЂќ for the mean absolute error (default = "mse")
max_features вЂ” the number of features to consider when looking for the best split. You can specify the number or percentage of features, or choose from the available values: "auto" (all features), "sqrt", "log2". (default = "auto")
max_depth вЂ” the maximum depth of the tree (default means that nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples)
min_samples_split вЂ” the minimum number of samples required to split an internal node. Can be specified as the number or as a percentage of a total number of samples (default = 2)
min_samples_leaf вЂ” the minimum number of samples required at a leaf node(default = 1)
min_weight_fraction_leaf вЂ” the minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided (default = 0)
max_leaf_nodes вЂ” the maximum number of leaves (default = no restrictions)
min_impurity_split вЂ” threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf (default = 1Рµ-7)
bootstrap вЂ” whether bootstrap samples are used when building trees(default = True)
oob_score вЂ” whether to use out-of-bag samples to estimate the R^2 on unseen data (default = False)
n_jobs вЂ” the number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores (default = 1)
random_state вЂ” if int, random_state is the seed used by the random number generator; if RandomState instance, random_state is the random number generator; if None, the random number generator is the RandomState instance used by np.random (default = None)
verbose вЂ” controls the verbosity of the tree building process (default = 0)
warm_start вЂ” when set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest (default = False)
)
  File "<ipython-input-1-3e14f956f4ac>", line 1
class sklearn.ensemble.RandomForestRegressor(
^
SyntaxError: invalid syntax


For classification, the parameters are mostly the same. Only the following differ between RandomForestClassifier and RandomForestRegressor:

class sklearn.ensemble.RandomForestClassifier(
criterion вЂ” the function used to measure the quality of a split. Supported criteria are вЂњginiвЂќ for the Gini impurity and вЂњentropyвЂќ for the information gain. Note: this parameter is tree-specific (default = "gini")
class_weight вЂ” the weight of each class (by default all weights equal to 1, but you can create a dictionary with weights or specify it as "balanced" - uses the values of classes to automatically adjust weights inversely proportional to class frequencies in the input data or as "balanced_subsample" - the same as вЂњbalancedвЂќ except that weights are computed based on the bootstrap sample for every tree grown)
)
  File "<ipython-input-2-253fea7aa72b>", line 1
class sklearn.ensemble.RandomForestClassifier(
^
SyntaxError: invalid syntax


Below are the parameters which we need to pay attention to when we are building a new model:

• n_estimators вЂ” the number of trees in the forest;
• criterion вЂ” the function used to measure the quality of a split;
• max_features вЂ” the number of features to consider when looking for the best split;
• min_samples_leaf вЂ” the minimum number of samples required to be at a leaf node;
• max_depth вЂ” the maximum depth of the tree.

#### Practice with random forests in a real problem

We will use the problem of fraud detection as our example. This is a classification problem, and we will use accuracy for model evaluation.

First, let's build a simple classifier which we will use as a baseline. For the sake of simplicity, we will use only numeric features.

import pandas as pd
from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import accuracy_score

# Choose the numeric features
cols = []
for i in df.columns:
if (df[i].dtype == "float64") or (df[i].dtype == 'int64'):
cols.append(i)

# Divide the dataset into the input and target
X, y = df[cols].copy(), np.asarray(df["Churn"],dtype='int8')

# Initialize a stratified split of our dataset for the validation process
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Initialize the classifier with the default parameters
rfc = RandomForestClassifier(random_state=42, n_jobs=-1, oob_score=True)

# Train it on the training set
results = cross_val_score(rfc, X, y, cv=skf)

# Evaluate the accuracy on the test set
print("CV accuracy score: {:.2f}%".format(results.mean()*100))
CV accuracy score: 91.48%


We have accuracy equal to 91.48%. Now, let's try to impove this result, and take a look at the behaviour of the learning curves when we change the basic parameters.

# Initialize the validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Create lists to save the values of accuracy on training and test sets
train_acc = []
test_acc = []
temp_train_acc = []
temp_test_acc = []
trees_grid = [5, 10, 15, 20, 30, 50, 75, 100]

# Train on the training set
for ntrees in trees_grid:
rfc = RandomForestClassifier(n_estimators=ntrees, random_state=42, n_jobs=-1, oob_score=True)
temp_train_acc = []
temp_test_acc = []
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y[train_index], y[test_index]
rfc.fit(X_train, y_train)
temp_train_acc.append(rfc.score(X_train, y_train))
temp_test_acc.append(rfc.score(X_test, y_test))
train_acc.append(temp_train_acc)
test_acc.append(temp_test_acc)

train_acc, test_acc = np.asarray(train_acc), np.asarray(test_acc)
print("Best accuracy on CV is {:.2f}% with {} trees".format(max(test_acc.mean(axis=1))*100,
trees_grid[np.argmax(test_acc.mean(axis=1))]))
Best accuracy on CV is 92.44% with 50 trees

plt.style.use('ggplot')

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(trees_grid, train_acc.mean(axis=1), alpha=0.5, color='blue', label='train')
ax.plot(trees_grid, test_acc.mean(axis=1), alpha=0.5, color='red', label='cv')
ax.fill_between(trees_grid, test_acc.mean(axis=1) - test_acc.std(axis=1), test_acc.mean(axis=1) + test_acc.std(axis=1), color='#888888', alpha=0.4)
ax.fill_between(trees_grid, test_acc.mean(axis=1) - 2*test_acc.std(axis=1), test_acc.mean(axis=1) + 2*test_acc.std(axis=1), color='#888888', alpha=0.2)
ax.legend(loc='best')
ax.set_ylim([0.88,1.02])
ax.set_ylabel("Accuracy")
ax.set_xlabel("N_estimators"); As you can see, when a certain number of trees is reached, our accuracy on the test set is very close to the asymptote. You can decide by yourself which value would be the optimal number of trees for your problem.

The figures also show that we achieved 100% accuracy on the training set, which tells us we overfit. In order to avoid the overfitting, we need to add regularization parameters to our model.

We will start with the maximum depth of trees max_depth and fix the number of trees at 100:

# Create lists to save accuracy values on the training and test sets
train_acc = []
test_acc = []
temp_train_acc = []
temp_test_acc = []
max_depth_grid = [3, 5, 7, 9, 11, 13, 15, 17, 20, 22, 24]

# Train on the training set
for max_depth in max_depth_grid:
rfc = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1, oob_score=True, max_depth=max_depth)
temp_train_acc = []
temp_test_acc = []
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y[train_index], y[test_index]
rfc.fit(X_train, y_train)
temp_train_acc.append(rfc.score(X_train, y_train))
temp_test_acc.append(rfc.score(X_test, y_test))
train_acc.append(temp_train_acc)
test_acc.append(temp_test_acc)

train_acc, test_acc = np.asarray(train_acc), np.asarray(test_acc)
print("Best accuracy on CV is {:.2f}% with {} max_depth".format(max(test_acc.mean(axis=1))*100,
max_depth_grid[np.argmax(test_acc.mean(axis=1))]))

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(max_depth_grid, train_acc.mean(axis=1), alpha=0.5, color='blue', label='train')
ax.plot(max_depth_grid, test_acc.mean(axis=1), alpha=0.5, color='red', label='cv')
ax.fill_between(max_depth_grid, test_acc.mean(axis=1) - test_acc.std(axis=1), test_acc.mean(axis=1) + test_acc.std(axis=1), color='#888888', alpha=0.4)
ax.fill_between(max_depth_grid, test_acc.mean(axis=1) - 2*test_acc.std(axis=1), test_acc.mean(axis=1) + 2*test_acc.std(axis=1), color='#888888', alpha=0.2)
ax.legend(loc='best')
ax.set_ylim([0.88,1.02])
ax.set_ylabel("Accuracy")
ax.set_xlabel("Max_depth");
Best accuracy on CV is 92.68% with 17 max_depth Parameter max_depth copes well with the regularization of our model and it does not overfit as badly as before. The model accuracy has increased slightly.

Another important parameter worth tuning is min_samples_leaf. It also contributes to regularization.

# Create lists to save accuracy values on the training and test sets
train_acc = []
test_acc = []
temp_train_acc = []
temp_test_acc = []
min_samples_leaf_grid = [1, 3, 5, 7, 9, 11, 13, 15, 17, 20, 22, 24]

# Train on the training set
for min_samples_leaf in min_samples_leaf_grid:
rfc = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1,
oob_score=True, min_samples_leaf=min_samples_leaf)
temp_train_acc = []
temp_test_acc = []
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y[train_index], y[test_index]
rfc.fit(X_train, y_train)
temp_train_acc.append(rfc.score(X_train, y_train))
temp_test_acc.append(rfc.score(X_test, y_test))
train_acc.append(temp_train_acc)
test_acc.append(temp_test_acc)

train_acc, test_acc = np.asarray(train_acc), np.asarray(test_acc)
print("Best accuracy on CV is {:.2f}% with {} min_samples_leaf".format(max(test_acc.mean(axis=1))*100,
min_samples_leaf_grid[np.argmax(test_acc.mean(axis=1))]))
Best accuracy on CV is 92.41% with 3 min_samples_leaf

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(min_samples_leaf_grid, train_acc.mean(axis=1), alpha=0.5, color='blue', label='train')
ax.plot(min_samples_leaf_grid, test_acc.mean(axis=1), alpha=0.5, color='red', label='cv')
ax.fill_between(min_samples_leaf_grid, test_acc.mean(axis=1) - test_acc.std(axis=1), test_acc.mean(axis=1) + test_acc.std(axis=1), color='#888888', alpha=0.4)
ax.fill_between(min_samples_leaf_grid, test_acc.mean(axis=1) - 2*test_acc.std(axis=1), test_acc.mean(axis=1) + 2*test_acc.std(axis=1), color='#888888', alpha=0.2)
ax.legend(loc='best')
ax.set_ylim([0.88,1.02])
ax.set_ylabel("Accuracy")
ax.set_xlabel("Min_samples_leaf"); In this case, we do not see an improvement in accuracy on the validation set, but we significantly reduce the overfitting down to 2% while keeping the accuracy at about 92%.

Let's consider the parameter max_features. For classification, the value $\large \sqrt{d}$ (the total number of features) is typically used as the default choice. Let's check whether it would be optimal to use 4 features in our case:

# Create lists to save accuracy values on the training and test sets
train_acc = []
test_acc = []
temp_train_acc = []
temp_test_acc = []
max_features_grid = [2, 4, 6, 8, 10, 12, 14, 16]

# Train on the training dataset
for max_features in max_features_grid:
rfc = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1,
oob_score=True, max_features=max_features)
temp_train_acc = []
temp_test_acc = []
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y[train_index], y[test_index]
rfc.fit(X_train, y_train)
temp_train_acc.append(rfc.score(X_train, y_train))
temp_test_acc.append(rfc.score(X_test, y_test))
train_acc.append(temp_train_acc)
test_acc.append(temp_test_acc)

train_acc, test_acc = np.asarray(train_acc), np.asarray(test_acc)
print("Best accuracy on CV is {:.2f}% with {} max_features".format(max(test_acc.mean(axis=1))*100,
max_features_grid[np.argmax(test_acc.mean(axis=1))]))

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(max_features_grid, train_acc.mean(axis=1), alpha=0.5, color='blue', label='train')
ax.plot(max_features_grid, test_acc.mean(axis=1), alpha=0.5, color='red', label='cv')
ax.fill_between(max_features_grid, test_acc.mean(axis=1) - test_acc.std(axis=1), test_acc.mean(axis=1) + test_acc.std(axis=1), color='#888888', alpha=0.4)
ax.fill_between(max_features_grid, test_acc.mean(axis=1) - 2*test_acc.std(axis=1), test_acc.mean(axis=1) + 2*test_acc.std(axis=1), color='#888888', alpha=0.2)
ax.legend(loc='best')
ax.set_ylim([0.88,1.02])
ax.set_ylabel("Accuracy")
ax.set_xlabel("Max_features");
Best accuracy on CV is 92.59% with 12 max_features In our case, the optimal number of features is equal 10. This is the value at which the best result is achieved.

We have seen how the learning curves change with different values of the basic parameters. Now, let's use GridSearch to find the optimal parameters for our example:

# Initialize the set of parameters for exhaustive search and fit
parameters = {'max_features': [4, 7, 10, 13], 'min_samples_leaf': [1, 3, 5, 7], 'max_depth': [5,10,15,20]}
rfc = RandomForestClassifier(n_estimators=100, random_state=42,
n_jobs=-1, oob_score=True)
gcv = GridSearchCV(rfc, parameters, n_jobs=-1, cv=skf, verbose=1)
gcv.fit(X, y)
Fitting 5 folds for each of 64 candidates, totalling 320 fits

[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    5.5s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   35.0s
[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed:  1.1min finished

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=True),
error_score='raise',
estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
oob_score=True, random_state=42, verbose=0, warm_start=False),
fit_params=None, iid=True, n_jobs=-1,
param_grid={'max_features': [4, 7, 10, 13], 'min_samples_leaf': [1, 3, 5, 7], 'max_depth': [5, 10, 15, 20]},
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=None, verbose=1)

gcv.best_estimator_, gcv.best_score_
(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=10, max_features=10, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
oob_score=True, random_state=42, verbose=0, warm_start=False),
0.9270927092709271)


The best accuracy we were able to achieve with GridSearch is 92.83%. We found the following parameter values: max_depth: 15, max_features: 7, min_samples_leaf: 3.

### 4. Variance and Decorrelation

Let's write the variance of a random forest as

$$\large \Var f(x) = \rho(x)\sigma^2(x)$$

$$\large \rho(x) = \Corr\left[T(x,\Theta_1(Z)),T(x_2,\Theta_2(Z))\right],$$

where

• $\large \rho(x)$ is the sample correlation coefficient between any two trees used in averaging:
• $\large \Theta_1(Z)$ and $\large \Theta_2(Z)$ are a randomly selected pair of trees on randomly selected elements of the sample $Z$;
• $\large T(x,\Theta_i(Z))$ is the output of the $\large i$-th tree classifier on an input vector $\large x$;
• $\large \sigma^2(x)$ is the sample variance of any randomly selected tree:

$$\large \sigma^2(x) = \Var\left[T(x,\Theta(X))\right]$$

It is easy to confuse $\large \rho(x)$ with the average correlation between the trained trees in a given random forest when we consider trees as N-vectors and calculate the average pairwise correlation between them. But this is not the case.

In fact, this conditional correlation is not directly related to the averaging process, and the dependence of $\large \rho(x)$ on $\large x$ warns us of this difference. $\large \rho(x)$ is the theoretical correlation between a pair of random trees estimated on the input $\large x$. Its value comes from the repeated sampling of the training set from the population $\large Z$ and the subsequent random choice of a pair of trees. In statistics jargon, this is the correlation caused by the sampling distribution of $\large Z$ and $\large \Theta$.

The conditional covariance of any pair of trees is equal to 0 because the bootstrapping and feature selection are independent and identically distributed.

If we consider the variance of a single tree, it barely depends on the parameters of the splitting ($\large m$). But they are crucial for ensembles. The variance of a tree is much higher than the one of an ensemble. The book The Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani Рё Jerome Friedman) has a great example that demostrates this fact: ### 5. Bias

Just as in bagging, the bias of a random forest is the same as the bias of a single tree $\large T(x,\Theta(Z))$:

$$\large \begin{array}{rcl} \Bias &=& \mu(x) - \E_Z , f_{rf}(x) \ &=& \mu(x) - \E_Z , \E_{\Theta | Z} , T(x,\Theta(Z))\end{array}$$

In absolute value, the bias is usually higher than that of an unprunned tree because randomization and sample space reduction impose their own restrictions on the model. Therefore, the improvements in prediction accuracy obtained by bagging and random forests are solely the result of variance reduction.

### 6. Extremely Randomized Trees

Extremely Randomized Trees employ a greater degree of randomization at the cut-point choice when splitting a tree node. As in random forests, a random subset of features is used. But, instead of the search for the optimal thresholds, their values are selected at random for each possible feature, and the best one among these randomly generated thresholds is used as the best rule to split the node. This usually trades off a slight reduction in the model variance with a small increase of the bias.

In the scikit-learn library, there are 2 implementations of Extremely Randomized Trees: ExtraTreesClassifier and ExtraTreesRegressor.

This method should be used if you have greatly overfit with random forests or gradient boosting.

### 7. Similarities between Random Forest and k-Nearest Neighbors techniques

The random forest method is similar to the nearest neighbors technique. Random forests predictions are based on labels of alike examples from the training set. The more often these examples appear in the same leaf of a tree, the higher their similarity. Let's prove this formally.

Let's consider a regression problem with the quadratic loss function. Let $\large T_n(x)$ be the number of the leaf of the $\large n$-th tree in a random forest with input $\large x$. The algorithm response for the input vector $\large x$ equals the averaged response over all the examples of the training sample that fall into the leaf $\large T_n(x)$. This can be written as

$$\large b_n(x) = \sum_{i=1}^{l}w_n(x,x_i)y_i,$$

where

$$\large w_n(x, x_i) = \frac{\left[T_n(x) = T_n(x_i)\right]}{\sum_{j=1}^{l}\left[T_n(x) = T_n(x_j)\right]}$$

Then, the response of the composition is

$$\large \begin{array}{rcl} a_n(x) &=& \frac{1}{N}\sum_{n=1}^{N}\sum_{i=1}^{l}w_n(x,x_i)y_i \ &=& \sum_{i=1}^{l}\left(\frac{1}{N}\sum_{j=1}^{N}w_n(x,x_j)\right)y_i \end{array}$$

You can see that the response of a random forest is a weighted sum of responses over all training examples.

It is also worth noting that the number of the leaf $\large T_n(x)$, where the instance $\large x$ ended up, is a valuable feature by itself. For example, the following approach works well: 1) A composition of a small number of trees is trained on a sample using a random forest or gradient boosting. 2) The categorical features $\large T_1(x), \dots, T_n(x)$ are added to the sample.

These new features are the result of the non-linear space splitting, and they provide information about similarity between examples. In the book The Elements of Statistical Learning, there is a good illustrative example that demonstrates this similarity between random forests and the k-nearest neighbors technique: ### 8. Transformation of a dataset into a high-dimensional representation

Random forests are mostly used in supervised learning, but there is a way to apply them in the unsupervised setting.

Using the scikit-learn method RandomTreesEmbedding, we can transform our dataset into a high-dimensional, sparse representation. We first build extremely randomized trees and then use the index of the leaf containing the example as a new feature.

For example, if the input appears in the first leaf, we assign $1$ as the feature value; if not, we assign $0$. This is a so-called binary coding. We can control the number of features and the sparseness of data by increasing or decreasing the number of trees and their depth. Because nearby data points are likely to fall into the same leaf, this transformation provides an implicit nonparametric estimate of their density.

### 9. Pros and cons of random forests

• High prediction accuracy; will perform better than linear algorithms in most problems; the accuracy is comparable with that of boosting.
• Robust to outliers, thanks to random sampling.
• Insensitive to the scaling of features as well as any other monotonic transformations due to the random subspace selection.
• Doesn't require fine-grained parameter tuning, works quite well out-of-the-box. With tuning, it is possible to achieve a 0.5вЂ“3% gain in accuracy, depending on the problem setting and data.
• Efficient for datasets with a large number of features and classes.
• Handles both continuous and discrete variables equally well.
• Rarely overfits. In practice, an increase in the tree number almost always improves the composition. But, after reaching a certain number of trees, the learning curve is very close to the asymptote.
• There are developed methods to estimate feature significance.
• Works well with missing data and maintains good accuracy levels even when a large part of data is missing.
• Provides means to weight classes on the whole dataset as well as for each tree sample.
• Under the hood, calculates proximities between pairs of examples that can subsequantly be used in clustering, outlier detection, or interesting data representations (by way of scaling).
• The above functionality and properties may be extended to unlabeled data to enable unsupervised clustering, data visualization, and outlier detection.
• Easily parallelized and highly scalable.

• In comparison with a single decision tree, Random Forest's output is more difficult to interpret.
• There are no formal p-values for feature significance estimation.
• Performs worse than linear methods in the case of sparse data: text inputs, bag of words, etc.
• Unlike linear regression, Random Forest is unable to extrapolate. But, this can be also regarded as an advantage because outliers do not cause extreme values in Random Forests.
• Prone to overfitting in some problems, especially, when dealing with noisy data.
• In the case of categorical variables with varying level numbers, random forests favor variables with a greater number of levels. The tree will fit more towards a feature with many levels because this gains greater accuracy.
• If a dataset contains groups of correlated features with similar importance for predicted classes, then the preference will be given to smaller groups.
• The resulting model is large. $\large O(NK)$ memory is required to store the model, where $\large K$ is the number of trees.

### 10. Useful resources

• Chapter 15 of the book вЂњElements of Statistical LearningвЂќ by Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie.
• Aleksander Dyakonov's blog.
• More about practical applications of random forests and other algorithms can be found in the official documentation of scikit-learn.
• Machine learning course by Eugeniy Sokolov (GitHub materials). In this course you can also find additional practice exercises to deepen your knowledge.
• For more in-depth discussion of variance and decorrelation of random forests, see the original paper.

# Topic 5. Composition of algorithms, random forest

## Part 3. Feature importance

Often times, you would like to make out the exact reasons why the algorithm outputted a particular answer, if not to be able to understand it completely, then at least to find out which input variables contributed the most to the result. With Random Forest, you can obtain such information quite easily.

### Article outline

1. Essence of the method
2. Practical example

### 1. Essence of the method

From the picture below, it is intuitively clear that, in our credit scoring problem, Age is much more important than Income. This can be formally explained using the concept of information gain. In the case of many decision trees or a random forest, the closer the mean position of a feature over all the trees to the root, the more significant it is for a given classification or regression problem. Gains in the splitting criterion, such as the Gini impurity, obtained at each optimal split in every tree is a measure of importance that is directly associated with the splitting feature. The value of this score is distinct for each feature and accumulates over all the trees.

Let's go a little deeper into the details.

The average reduction in accuracy caused by a variable is determined during the calculation of the out-of-bag error. The greater the reduction in accuracy due to an exclusion or permutation of the variable, the higher its importance score. For this reason, variables with a greater average reduction in accuracy are generally more significant for classification.

The average reduction in the Gini impurity вЂ“ or MSE for regression вЂ“ represents the contribution of each variable to the homogeneity of nodes and leaves in the resulting Random Forest model. Every time a selected variable is used for splitting, the Gini impurity of the child nodes is calculated and compared with that of the original node.

The Gini impurity is a score of homogenity with the range from $0$ (homogeneous) to $1$ (heterogeneous). The changes in the value of the splitting criterion are accumulated for each variable and normalized at the end of the calculation. A higher reduction in the Gini impurity signals that splitting results by this variable results in nodes with higher purity.

Now, let's represent this method in analytic form:

$$\large VI^{T} = \frac{\sum_{i \in \mathfrak{B}^T}I \Big(y_i=\hat{y}i^{T}\Big)}{\Big |\mathfrak{B}^T\Big |} - \frac{\sum{i \in \mathfrak{B}^T}I \Big(y_i=\hat{y}_{i,\pi_j}^{T}\Big)}{\Big |\mathfrak{B}^T\Big |}$$

where

• $\large \hat{y}_i^{(T)} = f^{T}(x_i)$ is a class prediction before the feature permutation or exclusion;
• $\large \hat{y}{i,\pi_j}^{(T)} = f^{T}(x{i,\pi_j})$ is a class prediction after the feature permutation or exclusion;
• and $\large x_{i,\pi_j} = (x_{i,1}, \dots , x_{i,j-1}, \quad x_{\pi_j(i),j}, \quad x_{i,j+1}, \dots , x_{i,p})$.

Note that $\large VI^{T}(x_j) = 0$ if $\large x_j$ is not in the tree $\large T$.

Now, we can give the feature importance calculation for ensembles

• not normalized:

$$\large VI(x_j) = \frac{\sum_{T=1}^{N}VI^{T}(x_j)}{N}$$

• normalized by the standard deviation of the differences:

$$\large z_j = \frac{VI(x_j)}{\frac{\hat{\sigma}}{\sqrt{N}}}$$

### 2. Practical example

Let's consider the results of a survey given to visitors of hostels listed on Booking.com and TripAdvisor.com. Our features here are the average ratings for different categories include service quality, room condition, value for money, etc. Our target variable is the hostel's overall rating on the website.

from __future__ import division, print_function

# Disable warnings in Anaconda
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns

# Set the font for titles in Russian
from matplotlib import rc
font = {'family': 'Verdana',
'weight': 'normal'}
rc('font', **font)

import pandas as pd
import numpy as np
from sklearn.ensemble.forest import RandomForestRegressor
hostel_data = pd.read_csv("../../data/hostel_factors.csv")
features = {"f1":u"Staff",
"f2":u"Hostel booking",
"f3":u"Check-in and check-out",
"f4":u"Room condition",
"f5":u"Shared kitchen condition",
"f6":u"Shared space condition",
"f7":u"Extra services",
"f8":u"General conditions & conveniences",
"f9":u"Value for money",
"f10":u"Co-opting Customer Competence"}

forest = RandomForestRegressor(n_estimators=1000, max_features=10,
random_state=0)

forest.fit(hostel_data.drop(['hostel', 'rating'], axis=1),
hostel_data['rating'])
importances = forest.feature_importances_

indices = np.argsort(importances)[::-1]
# Plot the feature importancies of the forest
num_to_plot = 10
feature_indices = [ind+1 for ind in indices[:num_to_plot]]

# Print the feature ranking
print("Feature ranking:")

for f in range(num_to_plot):
print("%d. %s %f " % (f + 1,
features["f"+str(feature_indices[f])],
importances[indices[f]]))
plt.figure(figsize=(15,5))
plt.title(u"Feature Importance")
bars = plt.bar(range(num_to_plot),
importances[indices[:num_to_plot]],
color=([str(i/float(num_to_plot+1))
for i in range(num_to_plot)]),
align="center")
ticks = plt.xticks(range(num_to_plot),
feature_indices)
plt.xlim([-1, num_to_plot])
plt.legend(bars, [u''.join(features["f"+str(i)])
for i in feature_indices]);
Feature ranking:
1. Staff 0.182757
2. Value for money 0.148373
3. Shared space condition 0.128296
4. Extra services 0.116604
5. Co-opting Customer Competence 0.106668
6. General conditions & conveniences 0.088589
7. Shared kitchen condition 0.074273
8. Check-in and check-out 0.061521
9. Hostel booking 0.053615
10. Room condition 0.039305 The picture above shows that, more often than not, customers pay great attention to the staff and the price-quality ratio. This couple of factors affects the resulting overall rating the most. But, the difference between these two features and the others, less significant features is not very large. We can therefore conclude that the exclusion of any of these features would lead to a reduction in model accuracy. Based on our analysis, we could recommend hostel owners to focus primarily on staff training and price-to-quality ratio.