Skip to content

Instantly share code, notes, and snippets.

@veekaybee
Created February 2, 2023 02:49
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save veekaybee/49bdb3568c9c2849248eff8594c486bc to your computer and use it in GitHub Desktop.
Save veekaybee/49bdb3568c9c2849248eff8594c486bc to your computer and use it in GitHub Desktop.

Isolation forests versus decision trees

Isolation forest paper Screen Shot 2023-02-01 at 9 47 19 PM

Screen Shot 2023-02-01 at 9 47 58 PM

Screen Shot 2023-02-01 at 9 49 41 PM

  • Isolated points should be lower and closer to the root of the tree

Isolation Forest is similar in principle to Random Forest and is built on the basis of decision trees. Isolation Forest, however, identifies anomalies or outliers rather than profiling normal data points. Isolation Forest isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of that selected feature. This split depends on how long it takes to separate the points

Anomaly detection:

Anomaly detection is a common data science problem where the goal is to identify odd or suspicious observations, events, or items in our data that might be indicative of some issues in our data collection process (such as broken sensors, typos in collected forms, etc.) or unexpected events like security breaches, server failures, and so on..

The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.

This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.

Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment