Skip to content

Instantly share code, notes, and snippets.

@anindex
Last active May 26, 2020 14:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save anindex/85d18b776c4580b353771b53b2ed5495 to your computer and use it in GitHub Desktop.
Save anindex/85d18b776c4580b353771b53b2ed5495 to your computer and use it in GitHub Desktop.

3. Regression with Decision Tree and kNN

Decision Tree Classification vs Regression

Definitions

Decision trees used in data mining are of two main types[1]:

  • Decision Tree Classifier predicts discrete label outcomes associated with data.
  • Decision Tree Regressor predicts real/continuous outcome numbers (e.g. the price of a house, or a patient's length of stay in a hospital).

Decision Tree Regression

The regression problem addresses predicting real values from a data point with many attributes. Hence, the metric Information Gain[2] measuring based on discrete-label distributions may not be suitable for continuous range of outcome value. Alternatively, one of the best metric for regression application is Standard Deviation Reduction[3]. Intuitively, Information Gain metric favors spliting attribute resulting lower entropy outcome distribution. Standard Deviation Reduction metric also favors spliting attribute resulting lower outcome value variance, which has the same effect decreasing the uncertainty in outcome prediction as Information Gain metric.

Factors

For each spliting attribute value, these factors are calculated for the outcome:

  • Average outcome:

avg

  • Outcome standard deviation:

s

  • Coefficient of Deviation (CV) is used to decide when to stop branching (e.g typically < 10%)

cv

  • Standard deviation for outcome and predictor:

s2

  • Finally, the standard deviation reduction (SDR) metric votes for the best attribute to split:

sdr

Procedures

For each node(split), these steps are executed:

  • Step 1: Calculate SDR for each candidate spliting attribute of the current node: so
  • Step 2: Choose the best SDR attribute.
  • Step 3: For each value of the choosen attribute, calculate the average outcome, the outcome standard deviation and CV value.
  • Step 4: On each value of the choosen attribute, decide to stop branching or not on the choosen attribute based on CV value or number of data point. If no stop, branch the tree based on that attribute values. If stop, choose the calculated average outcome as the output for that leaf node.

For an example on predicting hours played based on weather using Decision Tree, please visit [3].

kNN Regression

In classification problem, kNN chooses in the top k-candidates the majority class using uniform distance or largest weighted class using weighted distance. For applying in regression problem, one of the simple modification is that kNN can output the distant-weighted sum of the k-similar training target values.

ws

where {xi} is the set of k similar data points to the new sample x* and {yi} are the corresponding target values.

References

[1] Wikipedia contributors. (2020, May 18). Decision tree learning. In Wikipedia, The Free Encyclopedia. Retrieved 11:02, May 26, 2020, from https://en.wikipedia.org/w/index.php?title=Decision_tree_learning&oldid=957324010 [2] Prof. Dr. Steffen Staab. (2020, May 25). Machine Learning - 5 Decision Trees. In Machine Learning course, IPVS, University of Stuttgart. [3] Saed Sayad. Decision Tree - Regression. Retrieved 12:01, May 26, 2020, from: https://saedsayad.com/decision_tree_reg.htm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment