dyerrington/overfitting_decision_trees.md

## overfitting_decision_trees.md

      
    Raw
  

              overfitting_decision_trees.md
            
          
    Overfitting in decision trees

Overfitting can be one problem that describes if your model no longer generalizes well.
Overfitting happens when any learning processing overly optimizes training set error at the cost test error.  While it’s possible for training and testing to perform equality well in cross validation, it could be as the result of the data being very close in characteristics, which may not be a huge problem.  In the case of decision tree’s they can learn a training set to a point of high granularity that makes them easily overfit.  Allowing a decision tree to split to a granular degree, is the behavior of this model that makes it prone to learning every point extremely well — to the point of perfect classification — ie: overfitting.
I recommend the following steps to avoid overfitting:

Use a test set that is not exactly like the training set, or different enough that error rates are going to be easy to see.

What is different enough?  Enough to generalize what is being predicted.  The problem should dictate this somewhat because if you are predicting on a baseline that is anomalous, you will need to approximate your validation set close enough.  If the problem is more general such as spam classification, having enough data will usually have enough entropy to account for enough variance that should exemplify a disparity is cross validation.  Additionally, you can use a one-hold-out dataset for extra assurance.  You can be more scientific like using “Minimum Description Length principle” which is something related to the size of the error vs the size of the tree but that’s getting  a bit in the weeds.

Ensure you have enough data.

It’s possible that you don’t have enough representitive data (more to my first points in this recommendation).  There may not be enough examples to describe a specific case.  Perhaps you’re classifying houses vs apartments and you don’t have enough data on apartments within a given range of values such as square feet and bedrooms, the model may not learn that apartments above 2 bedrooms and 2000 square feet in the city can be either house or apartment but is less likely to be an apartement if there are more houses in the dataset than there are in real life, the decision tree will consider the information gain in this range of variables to describe a split on assumptions it can only conclude with the data it observes.  I can’t think of a better example…
CHECK FOR CLASS IMBALANCE!

Reduce the complexity of the decision tree model

There’s a great quote for this: “Everything should be made as simple as possible, but no simpler.”
Pruning a tree can be done before or after, but in sklearn, pre-pruning isn’t possible, but you can choose to set the minimum amount of data that is required to create a leaf node, which will additionally qualify the conditions on which a split can occur.  Additinally, one can set the max-depth to control the complexity, limiting quantity of nodes quite a bit — you could adjust this paramter and check cross-validation each time after you’ve gridsearched a range that tends to perform well on the classification metric of your choice.

Use decision trees in an ensemble

Since decision trees are greedy, they are also prone to finding locally optimal solutions without considering a broader assumption about data.  This is one reason (in my opinion), that makes these ideal for ensembles.  Ensemble methods (bagging / random forrests, boosting, etc) that allows for weighting different aspects of decision trees (with bootstrapping, with random variable selection, with weighting of weak learners, with sample weight selection.. these are the main aspects that different ensembles mix and match to generalize better)

Reduce the dimensionality of your data

Additionally, choose better features.  Reducing the complexity of a model can also be done if you reduce the complexity of your data.  The potential splits (ie: complexity), can be reduced before you even put your data to the model.