Skip to content

Instantly share code, notes, and snippets.

@tspspi
Created August 9, 2022 19:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tspspi/284b0aa53090db02b93e892463077e70 to your computer and use it in GitHub Desktop.
Save tspspi/284b0aa53090db02b93e892463077e70 to your computer and use it in GitHub Desktop.
ID3 algorithm demonstration
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "1d58a0a6",
"metadata": {},
"source": [
"# Introduction\n",
"\n",
"This notebook contains a pretty simple implementation of the ID3 algorithm for discrete data points. The algorithm that's implemented in this notebook is described in the [accompanying blog post](https://www.tspi.at/2022/08/09/id3algorithm.html), the comments in this notebook only describe implementation details. Note that this implementation has been extracted from a larger analysis application - which might explain some of the design decisions that might not make sense for a short demonstration. Note that a slow Python implementation is usually not usable for larger datasets - especially since ID3 requires many different probability calculations one has to take care when performing this algorithm on even medium amounts of data independent of the programming language, one might have to apply some kind of heuristics and try to reduce the number of counting operations by some optimizations"
]
},
{
"cell_type": "markdown",
"id": "db9fa1a7",
"metadata": {},
"source": [
"# Attribute handling"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "a5bdf115",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import csv\n",
"import math"
]
},
{
"cell_type": "markdown",
"id": "e023517c",
"metadata": {},
"source": [
"First we define some classes that make attribute handling easier. First there is the base class ```ID3Attribute```. This is just a wrapper that might be used later for automatic class determination for continuous variables. The ```ID3Attribute_Discrete``` and ```ID3Attribute_DiscreteLabel``` classes are base class and implementation of a wrapper that is able to keep track of a simple named set of values. The data sources later on register all fields that are encountered using the ```registerValue``` method - the attribute classes build up a set that keeps track of all possible values for the given attribute."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "1abb86cf",
"metadata": {},
"outputs": [],
"source": [
"class ID3Attribute:\n",
" def __init__(self, name, typestring, isTarget = False):\n",
" self.name = name\n",
" self.typestring = typestring\n",
" self.is_target = isTarget\n",
"\n",
"class ID3Attribute_Discrete(ID3Attribute):\n",
" def __init__(self, name, typestring, isTarget = False):\n",
" super().__init__(name, typestring, isTarget = isTarget)\n",
" self.distinctValueCount = 0\n",
"\n",
" def getValueByIndex(self, idx):\n",
" raise NotImplementedError()\n",
" def getIndexByValue(self, val):\n",
" raise NotImplementedError()\n",
"\n",
" def registerValue(self, value):\n",
" raise NotImplementedError()\n",
"\n",
"class ID3Attribute_DiscreteLabel(ID3Attribute_Discrete):\n",
" def __init__(self, name, isTarget = False):\n",
" super().__init__(name, \"Discrete label\", isTarget = isTarget)\n",
" self.labelmap = []\n",
"\n",
" def getValueByIndex(self, idx):\n",
" if (idx < 0) or (idx >= len(self.labelmap)):\n",
" raise ValueError(\"Index is out of range of known values\")\n",
" return self.labelmap[idx]\n",
"\n",
" def getIndexByValue(self, value):\n",
" if value in self.labelmap:\n",
" return self.labelmap.index(value)\n",
" else:\n",
" self.labelmap.append(value)\n",
" return len(self.labelmap)-1\n",
"\n",
" def registerValue(self, value):\n",
" if value not in self.labelmap:\n",
" self.labelmap.append(value)\n",
" self.distinctValueCount = len(self.labelmap)"
]
},
{
"cell_type": "markdown",
"id": "df21b880",
"metadata": {},
"source": [
"# The data source\n",
"\n",
"The data source wrapper allows one to access different types of datasources in a modular fashion (databases, CSV files, etc.). This notebook currently only contains the CSV version. It provides some abstract methods to add attributes by name, get current target state counts by applying the target filters and some base methods to:\n",
"\n",
"* ```addAttributeByName``` adds an attribute (described by one of the classes above) to the list of known attributes. This is usually used when fields are not autodetected later on. For the CSV data source this can only be used when a header line is present\n",
"* ```scanAttributes``` has to be implemented by the datasource. This fills all registered attributes with scanned data such as unique available values for discrete ones, the range for continuous binning ones, etc.\n",
"* ```getTargetCounts``` iterates over all target states possible (determined by attribute registration) and returns an array on how many elements in the given subset are assigned to the given target states\n",
"* ```_postScanAttributes``` is an internal method that has to be called after ```scanAttributes``` of the subclass has registered all sets (it's also called by the subclass). This creates all possible target states and filters for later processing.\n",
"* ```getElementCount``` is the workhorse of the whole algorithm. It counts how many elements are available in the whole dataset that match the given filter. Many datasources will just iterate over all of the data which is of course pretty inefficient - but on the other hand the algorithm might require all possible combinations of filters - this is one of the reasons this algorithm requires so much processing power even for moderate datasets"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "87da8fdf",
"metadata": {},
"outputs": [],
"source": [
"class ID3DataSource:\n",
" def __init__(self):\n",
" self.attributes = []\n",
" self.scanned = False\n",
"\n",
" def addAttributeByName(self, attDescription):\n",
" self.attributes.append(attDescription)\n",
" \n",
" def scanAttributes(self):\n",
" raise NotImplementedError()\n",
" \n",
" def getElementCount(self, attFilter = None):\n",
" raise NotImplementedError()\n",
"\n",
" def getTargetCounts(self, attFilter = []):\n",
" if not self.scanned:\n",
" raise ValueError(\"Attributes have not been scanned up until now\")\n",
"\n",
" # We count the target class elements ...\n",
" res = []\n",
"\n",
" for targetSelection in self.targetFilters:\n",
" tempAttFilter = attFilter + targetSelection\n",
" cnt = self.getElementCount(tempAttFilter)\n",
" res.append(cnt)\n",
"\n",
" return res\n",
"\n",
" def _postScanAttributes(self):\n",
" # Count the number of target classes\n",
" self.targetClasses = 0\n",
" self.targetAttributes = []\n",
" self.targetFilters = []\n",
" for idx, att in enumerate(self.attributes):\n",
" if att.is_target:\n",
" if self.targetClasses == 0:\n",
" self.targetClasses = att.distinctValueCount\n",
" for i in range(att.distinctValueCount):\n",
" self.targetFilters.append([(idx, i)])\n",
" else:\n",
" self.targetClasses = self.targetClasses * att.distinctValueCount\n",
"\n",
" # We have to build the cross products ...\n",
" newTargetFilters = []\n",
" for i in range(att.distinctValueCount):\n",
" for otherfilter in self.targetFilters:\n",
" newTargetFilters.append([ (idx, i) ] + otherfilter)\n",
"\n",
" self.targetFilters = newTargetFilters\n",
" self.targetAttributes.append(idx)\n",
"\n",
" self.scanned = True\n"
]
},
{
"cell_type": "markdown",
"id": "a617bbda",
"metadata": {},
"source": [
"# The actual DataSource(s)\n",
"\n",
"## The CSV data source\n",
"\n",
"The CSV data source accesses a simple CSV file. If there is a header present one can use the ```addAttributesAsLabelsByHeader``` method to auto-detect _all_ columns as possible candidates - then it's assumes all are discrete labels. The target column(s) can be specified by the ```targets``` list - in case this is not supplied it's assumed the first column is the target.\n",
"\n",
"The CSV datasource really iterates over all elements whenever counting or scanning is requested."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a68a62ca",
"metadata": {},
"outputs": [],
"source": [
"class ID3DataSource_CSV(ID3DataSource):\n",
" def __init__(\n",
" self,\n",
" filename,\n",
" hasHeader = False\n",
" ):\n",
" if not os.path.exists(filename):\n",
" raise FileNotFoundError(filename)\n",
" self._hasheader = hasHeader\n",
" self._filename = filename\n",
" self._columNames = None\n",
" super().__init__()\n",
"\n",
" def addAttributesAsLabelsByHeader(self, targets = [ 0 ]):\n",
" with open(self._filename) as srcfile:\n",
" rdr = csv.reader(srcfile)\n",
" for row in rdr:\n",
" # We only process first row and then break\n",
" for idx, title in enumerate(row):\n",
" if idx in targets:\n",
" targ = True\n",
" else:\n",
" targ = False\n",
" self.addAttributeByName(ID3Attribute_DiscreteLabel(title, isTarget = targ))\n",
" break\n",
"\n",
" def addAttributeByColumnIndex(self, index, attDescription):\n",
" attDescription.csvIndex = index\n",
" self.attributes.append(attDescription)\n",
"\n",
" def scanAttributes(self):\n",
" scannedRecords = 0\n",
"\n",
" with open(self._filename) as srcfile:\n",
" rdr = csv.reader(srcfile)\n",
" firstrow = True\n",
" for row in rdr:\n",
" if self._hasheader and firstrow:\n",
" firstrow = False\n",
" self._columnNames = row\n",
" for att in self.attributes:\n",
" if \"csvIndex\" not in dir(att):\n",
" if att.name not in self._columnNames:\n",
" raise ValueError(f\"Unknown column name {att.name} not found in CSV\")\n",
" att.csvIndex = self._columnNames.index(att.name)\n",
" continue\n",
"\n",
" scannedRecords = scannedRecords + 1\n",
"\n",
" # Register all values ...\n",
" for att in self.attributes:\n",
" att.registerValue(row[att.csvIndex])\n",
"\n",
" self._postScanAttributes()\n",
"\n",
" return (scannedRecords, len(self.attributes))\n",
"\n",
" def getElementCount(self, attFilter = None):\n",
" # Always iterates again over the whole file ...\n",
" cnt = 0\n",
" with open(self._filename) as srcfile:\n",
" rdr = csv.reader(srcfile)\n",
" firstrow = True\n",
" for row in rdr:\n",
" if self._hasheader and firstrow:\n",
" firstrow = False\n",
" continue\n",
"\n",
" # Check if our attribute filter matches the given specification list. Each filter is att. index and att. value tuple\n",
" if attFilter is not None:\n",
" noMatch = False\n",
" for afilter in attFilter:\n",
" csvIdx = self.attributes[afilter[0]].csvIndex\n",
" if afilter[1] != self.attributes[afilter[0]].getIndexByValue(row[csvIdx]):\n",
" noMatch = True\n",
" break\n",
" if noMatch:\n",
" continue\n",
"\n",
" # Match ...\n",
" cnt = cnt + 1\n",
"\n",
" return cnt"
]
},
{
"cell_type": "markdown",
"id": "3a698723",
"metadata": {},
"source": [
"# The ID3 tree builder\n",
"\n",
"The treebuilder drives the building of the tree. For each level it tries to:\n",
"\n",
"* Calculate the current shanon entropy\n",
"* Calculates the current possible outcomes, their confidence intervals and their probabilities\n",
"* Iterates over all remaining attributes (that are not already filtered and not target attributes) and calculates the information gain for all candidates\n",
" * It selects the candidate with the highest information gain to recurse further.\n",
" * For each subtree it recursivly does exactly the same after applying the filter for the given subset.\n",
" * When the tree reaches an empty subset of a state where no further improvement is made it stops, adds a terminal node and recurses upwards \n",
" * In case the improvement is not good enough and a threshold has been set the algorithm also stops and recurses upwards. This prevents the algorithm to recurse into unnecessarily deep levels with no gain."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "87fea7c8",
"metadata": {},
"outputs": [],
"source": [
"class ID3TreeBuilder:\n",
" def __init__(\n",
" self,\n",
" debugTrace = False,\n",
" z = 2.58\n",
" ):\n",
" self._debugTrace = debugTrace\n",
" self._confidenceIntervalStandardScore = z\n",
"\n",
" def _get_branching_candidates(self, dataSource, currentFilter):\n",
" if len(currentFilter) >= (len(dataSource.attributes) - len(dataSource.targetAttributes)):\n",
" return None\n",
"\n",
" # Build a list of all elements that are already filtered or are target attributes ...\n",
" filteredElements = []\n",
" for f in currentFilter:\n",
" if f[0] not in filteredElements:\n",
" filteredElements.append(f[0])\n",
" for f in dataSource.targetAttributes:\n",
" if f not in filteredElements:\n",
" filteredElements.append(f)\n",
"\n",
" # ... and now invert the list to include only the attributes that have not already been filtered or are targets\n",
" candidates = []\n",
" for i in range(len(dataSource.attributes)):\n",
" if i not in filteredElements:\n",
" candidates.append(i)\n",
"\n",
" return candidates\n",
"\n",
" def _get_shanon_entropy(self, dataSource, attfilter):\n",
" # This method calculates the shanon entropy of the specified subset after applying\n",
" # the passed attribute filter\n",
"\n",
" targetCounts = dataSource.getTargetCounts(attfilter)\n",
" N = sum(targetCounts)\n",
" shanonEntropy = 0\n",
"\n",
" for cnt in targetCounts:\n",
" # The \"if\" is required to prevent log2(0) calls\n",
" if cnt > 0:\n",
" p = cnt/N\n",
" shanonEntropy = shanonEntropy - p * math.log2(p)\n",
"\n",
" return shanonEntropy\n",
"\n",
" def _get_target_probabilities(self, dataSource, attfilter):\n",
" z = self._confidenceIntervalStandardScore\n",
"\n",
" # This methods calculated the probabilities of all possibe target outcomes\n",
" # after applying the specified attribute filter. The \"z\" value is the standard score\n",
" # that describes the width of the confidence interval (calculated after Clopper-Pearson\n",
" # to take care of the limits and thus asymetric character of the interval)\n",
" #\n",
" # Typical values:\n",
" # * 1.96 for 95% confidence interval\n",
" # * 2.58 for 99% confidence interval\n",
"\n",
" nOutcomes = dataSource.getElementCount(attfilter)\n",
" if nOutcomes == 0:\n",
" return None\n",
"\n",
" targets = []\n",
" for target in dataSource.targetFilters:\n",
" nElements = dataSource.getElementCount(attfilter + target)\n",
" p = nElements / nOutcomes\n",
"\n",
" # Calculate Clopper-Pearson interval\n",
" p1 = 1 / (1 + z*z / nOutcomes) * (p + z*z/(2 * nOutcomes))\n",
" p2 = z / (1 + z*z / nOutcomes) * math.sqrt(p * (1 - p) / nOutcomes + z * z / (4 * nOutcomes * nOutcomes))\n",
" targets.append({\n",
" 'p' : round(p, 7),\n",
" 'confidence' : [round(p1 - p2, 7), round(p1 + p2, 7)],\n",
" 'filter' : target\n",
" })\n",
" return targets\n",
"\n",
" \n",
" \n",
" def _buildTreeRecursive(self, dataSource, currentFilter = [], parentShanonEntropy = None, parentCount = None, gainThreshold = 0.1):\n",
" # The recursive tree building routine is the workhorse of the tree-builder. After each subdivision step\n",
" # that applies a new filter onto the tree it:\n",
" #\n",
" # - Sets up a list of branching candidates. If none are encountered it stops\n",
" # - Calculates it's own shanon entropy and element count (this can be passed by argument to speed up the method a little bit)\n",
" # - Checks if we are already decided (Shanon entropy of 0) so we don't do anything any more\n",
" # - Iterates over all branching candidates and calculates their Shanon entropy and information gain. This is the most time\n",
" # consuming step in the whole program\n",
" # - Selects the maximum available gain and checks if it's above threshold to make a decision. If it's not above\n",
" # threshold the algorithm decides that no more useful distinctions are possible and finishes up the current\n",
" # branch else it recurses into the next level of the tree after recording some essential information\n",
"\n",
" if self._debugTrace:\n",
" print(f\"Inside subtree {self._filterToString(dataSource, currentFilter)}\")\n",
"\n",
" # First get a list of branching candidates ...\n",
" candidates = self._get_branching_candidates(dataSource, currentFilter)\n",
" if self._debugTrace:\n",
" strCand = None\n",
" for nxCand in candidates:\n",
" if strCand is None:\n",
" strCand = dataSource.attributes[nxCand].name\n",
" else:\n",
" strCand = strCand + \", \" + dataSource.attributes[nxCand].name\n",
" print(f\"Branching candidates {strCand}\")\n",
"\n",
" if candidates is None:\n",
" if self._debugTrace:\n",
" print(\"No new candidates for branching, finishing up\")\n",
"\n",
" # We just can count probabilities in this subtree ...\n",
" return {\n",
" 'targets' : self._get_target_probabilities(dataSource, currentFilter)\n",
" }\n",
"\n",
" if not parentShanonEntropy:\n",
" parentShanonEntropy = self._get_shanon_entropy(dataSource, currentFilter)\n",
"\n",
" if parentShanonEntropy == 0:\n",
" if self._debugTrace:\n",
" print(\"Our entropy is already 0 - finishing up\")\n",
"\n",
" # We just can count probabilities in this subtree ...\n",
" return {\n",
" 'targets' : self._get_target_probabilities(dataSource, currentFilter)\n",
" }\n",
"\n",
" if not parentCount:\n",
" parentCount = dataSource.getElementCount(currentFilter)\n",
"\n",
" if self._debugTrace:\n",
" print(f\"Element count: {parentCount}, Shanon entropy: {parentShanonEntropy}\")\n",
"\n",
" # Iterate over all candidates and calculate information gain ...\n",
" gains = []\n",
" for cand in candidates:\n",
" gain = parentShanonEntropy\n",
" for a in range(dataSource.attributes[cand].distinctValueCount):\n",
" newFilter = currentFilter + [ (cand, a) ]\n",
" if parentCount > 0:\n",
" gain = gain - dataSource.getElementCount(newFilter) / parentCount * self._get_shanon_entropy(dataSource, newFilter)\n",
" gains.append(gain)\n",
" if self._debugTrace:\n",
" print(f\"Possible gain for {dataSource.attributes[cand].name}: {gain}\")\n",
"\n",
" maxGainIndex = gains.index(max(gains))\n",
" maxGain = gains[maxGainIndex]\n",
" \n",
" if self._debugTrace:\n",
" print(f\"Selected maximum gain {maxGain} for candidate {dataSource.attributes[candidates[maxGainIndex]].name}\")\n",
"\n",
" if (gainThreshold is not None) and (abs(maxGain) < gainThreshold):\n",
" if self._debugTrace:\n",
" print(\"New gain below threshold, finalizing branch\")\n",
" return {\n",
" 'targets' : self._get_target_probabilities(dataSource, currentFilter)\n",
" }\n",
"\n",
" # We select this element to branch ...\n",
" res = {\n",
" 'branch_att' : candidates[maxGainIndex],\n",
" 'shanonentropy' : parentShanonEntropy,\n",
" 'filter' : currentFilter,\n",
" 'elementcount' : parentCount,\n",
" 'gain' : maxGain,\n",
" 'branch_name' : dataSource.attributes[candidates[maxGainIndex]].name,\n",
" 'targets' : self._get_target_probabilities(dataSource, currentFilter),\n",
" 'children' : []\n",
" }\n",
" for iChild in range(dataSource.attributes[candidates[maxGainIndex]].distinctValueCount):\n",
" subtree = self._buildTreeRecursive(dataSource, currentFilter + [ (candidates[maxGainIndex], iChild )] )\n",
" res['children'].append(\n",
" {\n",
" 'value_idx' : iChild,\n",
" 'value' : dataSource.attributes[candidates[maxGainIndex]].getValueByIndex(iChild),\n",
" 'filter' : (candidates[maxGainIndex], iChild),\n",
" 'tree' : subtree\n",
" }\n",
" )\n",
" return res\n",
"\n",
" \n",
" \n",
" \n",
" def _filterToString(self, dataSource, attfilter):\n",
" # A simple helper routine to simply display a list of filter tuples in string\n",
" # form. This is used for debugging and pretty printing\n",
"\n",
" res = None\n",
" if not isinstance(attfilter, list):\n",
" attfilter = [ attfilter ]\n",
"\n",
" for f in attfilter:\n",
" attname = dataSource.attributes[f[0]].name\n",
" attvalue = dataSource.attributes[f[0]].getValueByIndex(f[1])\n",
" if res is None:\n",
" res = f\"{attname} = {attvalue}\"\n",
" else:\n",
" res = res + f\", {attname} = {attvalue}\"\n",
" return res\n",
"\n",
" def prettyPrintTree(self, dataSource, tree, level = 1, onlyTerminalProbabilities = False):\n",
" # A simple pretty(er) print of the tree. Not optimal but should still be readable\n",
"\n",
" if level > 0:\n",
" indent = \"| \" * level\n",
" else:\n",
" indent = \"+\"\n",
"\n",
"\n",
" if \"children\" in tree:\n",
" targets = self._get_target_probabilities(dataSource, tree['filter'])\n",
" print(f\"{indent}Entropy: {tree['shanonentropy']}, gain: {tree['gain']}\")\n",
" if targets is not None:\n",
" if not onlyTerminalProbabilities:\n",
" for targ in targets:\n",
" if targ['p'] > 1e-4:\n",
" print(f\"{indent}{self._filterToString(dataSource, targ['filter'])}: {targ['p']*100.0}% [{targ['confidence'][0]*100}%, {targ['confidence'][1]*100}%]\")\n",
" else:\n",
" print(f\"{indent}Not possible according to known data\")\n",
" if not onlyTerminalProbabilities:\n",
" print(f\"{indent}Branching on {tree['branch_name']}\")\n",
" for childdecissions in tree['children']:\n",
" print(f\"{indent}{self._filterToString(dataSource, childdecissions['filter'])}:\")\n",
" self.prettyPrintTree(dataSource, childdecissions['tree'], level = level + 2, onlyTerminalProbabilities = onlyTerminalProbabilities)\n",
" else:\n",
" if not onlyTerminalProbabilities:\n",
" print(f\"{indent}Terminal: \")\n",
" if tree['targets'] is not None:\n",
" for targ in tree['targets']:\n",
" if targ['p'] > 1e-4:\n",
" print(f\"{indent}{self._filterToString(dataSource, targ['filter'])}: {targ['p']*100.0}% [{targ['confidence'][0]*100}%, {targ['confidence'][1]*100}%]\")\n",
" else:\n",
" print(f\"{indent}Not possible according to known data\")\n",
"\n",
"\n",
" def buildTree(self,dataSource):\n",
" # Basically we do calculate the information gain for each and every new branching candidate at every level\n",
" # of the Tree in a recursive fashion, select the next candidate and repeat. When there are no more candidates\n",
" # or when the gain is below a threshold we are done on this descent and jump up one step again\n",
" return self._buildTreeRecursive(dataSource)"
]
},
{
"cell_type": "markdown",
"id": "f19fd9a9",
"metadata": {},
"source": [
"# Try on weather example"
]
},
{
"cell_type": "markdown",
"id": "3f311f19",
"metadata": {},
"source": [
"The first try of the implementation is done against a dataset that describes if a person wants to go for a walk, go riding, etc. - this dataset is found at a myriad of webpages that discuss the ID3 algorithm so it was also a nice verification of the own implementation when one uses the same dataset.\n",
"\n",
"First the datasource is created and configured to process a CSV header - so it could also load attributes from the header when one wants to do that. In this case we're going to manually add them by name (i.e. the strings below have to match the header lines in the CSV)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "93c0e6e6",
"metadata": {},
"outputs": [],
"source": [
"tstSource = ID3DataSource_CSV(\"../../Data/Classification/test_weather.csv\", hasHeader = True)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "56f2eb42",
"metadata": {},
"outputs": [],
"source": [
"tstSource.addAttributeByName(ID3Attribute_DiscreteLabel(\"Outlook\"))\n",
"tstSource.addAttributeByName(ID3Attribute_DiscreteLabel(\"Temperature\"))\n",
"tstSource.addAttributeByName(ID3Attribute_DiscreteLabel(\"Humidity\"))\n",
"tstSource.addAttributeByName(ID3Attribute_DiscreteLabel(\"Wind\"))\n",
"tstSource.addAttributeByName(ID3Attribute_DiscreteLabel(\"Decision\", isTarget = True))"
]
},
{
"cell_type": "markdown",
"id": "318abe8b",
"metadata": {},
"source": [
"Next we make an inventory of all of the attributes by iterating over the whole dataset - this scanning records which values are possible for every attribute. The return value tells us how many records and how many different attributes have been scanned"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c67a5cfc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(14, 5)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tstSource.scanAttributes()\n"
]
},
{
"cell_type": "markdown",
"id": "48e20fd6",
"metadata": {},
"source": [
"Now we run the tree builder and pretty print the tree. In this case we use trace output to show how the algorithm improves since the dataset is not too huge."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "c9460b4f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Inside subtree None\n",
"Branching candidates Outlook, Temperature, Humidity, Wind\n",
"Element count: 14, Shanon entropy: 0.9402859586706309\n",
"Possible gain for Outlook: 0.2467498197744391\n",
"Possible gain for Temperature: 0.029222565658954647\n",
"Possible gain for Humidity: 0.15183550136234142\n",
"Possible gain for Wind: 0.04812703040826932\n",
"Selected maximum gain 0.2467498197744391 for candidate Outlook\n",
"Inside subtree Outlook = Sunny\n",
"Branching candidates Temperature, Humidity, Wind\n",
"Element count: 5, Shanon entropy: 0.9709505944546686\n",
"Possible gain for Temperature: 0.5709505944546686\n",
"Possible gain for Humidity: 0.9709505944546686\n",
"Possible gain for Wind: 0.01997309402197489\n",
"Selected maximum gain 0.9709505944546686 for candidate Humidity\n",
"Inside subtree Outlook = Sunny, Humidity = High\n",
"Branching candidates Temperature, Wind\n",
"Our entropy is already 0 - finishing up\n",
"Inside subtree Outlook = Sunny, Humidity = Normal\n",
"Branching candidates Temperature, Wind\n",
"Our entropy is already 0 - finishing up\n",
"Inside subtree Outlook = Overcast\n",
"Branching candidates Temperature, Humidity, Wind\n",
"Our entropy is already 0 - finishing up\n",
"Inside subtree Outlook = Rain\n",
"Branching candidates Temperature, Humidity, Wind\n",
"Element count: 5, Shanon entropy: 0.9709505944546686\n",
"Possible gain for Temperature: 0.01997309402197489\n",
"Possible gain for Humidity: 0.01997309402197489\n",
"Possible gain for Wind: 0.9709505944546686\n",
"Selected maximum gain 0.9709505944546686 for candidate Wind\n",
"Inside subtree Outlook = Rain, Wind = Weak\n",
"Branching candidates Temperature, Humidity\n",
"Our entropy is already 0 - finishing up\n",
"Inside subtree Outlook = Rain, Wind = Strong\n",
"Branching candidates Temperature, Humidity\n",
"Our entropy is already 0 - finishing up\n",
"\n",
"| Entropy: 0.9402859586706309, gain: 0.2467498197744391\n",
"| Decision = No: 35.71429% [12.73086%, 67.90469%]\n",
"| Decision = Yes: 64.28571% [32.09531%, 87.26914000000001%]\n",
"| Branching on Outlook\n",
"| Outlook = Sunny:\n",
"| | | Entropy: 0.9709505944546686, gain: 0.9709505944546686\n",
"| | | Decision = No: 60.0% [16.83108%, 91.7479%]\n",
"| | | Decision = Yes: 40.0% [8.2521%, 83.16892%]\n",
"| | | Branching on Humidity\n",
"| | | Humidity = High:\n",
"| | | | | Terminal: \n",
"| | | | | Decision = No: 100.0% [31.067479999999996%, 100.0%]\n",
"| | | Humidity = Normal:\n",
"| | | | | Terminal: \n",
"| | | | | Decision = Yes: 100.0% [23.10429%, 100.0%]\n",
"| Outlook = Overcast:\n",
"| | | Terminal: \n",
"| | | Decision = Yes: 100.0% [37.53613%, 100.0%]\n",
"| Outlook = Rain:\n",
"| | | Entropy: 0.9709505944546686, gain: 0.9709505944546686\n",
"| | | Decision = No: 40.0% [8.2521%, 83.16892%]\n",
"| | | Decision = Yes: 60.0% [16.83108%, 91.7479%]\n",
"| | | Branching on Wind\n",
"| | | Wind = Weak:\n",
"| | | | | Terminal: \n",
"| | | | | Decision = Yes: 100.0% [31.067479999999996%, 100.0%]\n",
"| | | Wind = Strong:\n",
"| | | | | Terminal: \n",
"| | | | | Decision = No: 100.0% [23.10429%, 100.0%]\n"
]
}
],
"source": [
"builder = ID3TreeBuilder(debugTrace = True)\n",
"tree = builder.buildTree(tstSource)\n",
"print(\"\")\n",
"builder.prettyPrintTree(tstSource, tree)"
]
},
{
"cell_type": "markdown",
"id": "3b6596ba",
"metadata": {},
"source": [
"## Weather example without humidity"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "bd204f45",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"| Entropy: 0.9402859586706309, gain: 0.2467498197744391\n",
"| Decision = No: 35.71429% [12.73086%, 67.90469%]\n",
"| Decision = Yes: 64.28571% [32.09531%, 87.26914000000001%]\n",
"| Branching on Outlook\n",
"| Outlook = Sunny:\n",
"| | | Entropy: 0.9709505944546686, gain: 0.5709505944546686\n",
"| | | Decision = No: 60.0% [16.83108%, 91.7479%]\n",
"| | | Decision = Yes: 40.0% [8.2521%, 83.16892%]\n",
"| | | Branching on Temperature\n",
"| | | Temperature = Hot:\n",
"| | | | | Terminal: \n",
"| | | | | Decision = No: 100.0% [23.10429%, 100.0%]\n",
"| | | Temperature = Mild:\n",
"| | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | Decision = No: 50.0% [6.1549%, 93.8451%]\n",
"| | | | | Decision = Yes: 50.0% [6.1549%, 93.8451%]\n",
"| | | | | Branching on Wind\n",
"| | | | | Wind = Weak:\n",
"| | | | | | | Terminal: \n",
"| | | | | | | Decision = No: 100.0% [13.06097%, 100.0%]\n",
"| | | | | Wind = Strong:\n",
"| | | | | | | Terminal: \n",
"| | | | | | | Decision = Yes: 100.0% [13.06097%, 100.0%]\n",
"| | | Temperature = Cool:\n",
"| | | | | Terminal: \n",
"| | | | | Decision = Yes: 100.0% [13.06097%, 100.0%]\n",
"| Outlook = Overcast:\n",
"| | | Terminal: \n",
"| | | Decision = Yes: 100.0% [37.53613%, 100.0%]\n",
"| Outlook = Rain:\n",
"| | | Entropy: 0.9709505944546686, gain: 0.9709505944546686\n",
"| | | Decision = No: 40.0% [8.2521%, 83.16892%]\n",
"| | | Decision = Yes: 60.0% [16.83108%, 91.7479%]\n",
"| | | Branching on Wind\n",
"| | | Wind = Weak:\n",
"| | | | | Terminal: \n",
"| | | | | Decision = Yes: 100.0% [31.067479999999996%, 100.0%]\n",
"| | | Wind = Strong:\n",
"| | | | | Terminal: \n",
"| | | | | Decision = No: 100.0% [23.10429%, 100.0%]\n",
"CPU times: user 10.8 ms, sys: 0 ns, total: 10.8 ms\n",
"Wall time: 10.9 ms\n"
]
}
],
"source": [
"%%time\n",
"tstSource = ID3DataSource_CSV(\"../../Data/Classification/test_weather.csv\", hasHeader = True)\n",
"tstSource.addAttributeByName(ID3Attribute_DiscreteLabel(\"Outlook\"))\n",
"tstSource.addAttributeByName(ID3Attribute_DiscreteLabel(\"Temperature\"))\n",
"tstSource.addAttributeByName(ID3Attribute_DiscreteLabel(\"Wind\"))\n",
"tstSource.addAttributeByName(ID3Attribute_DiscreteLabel(\"Decision\", isTarget = True))\n",
"tstSource.scanAttributes()\n",
"\n",
"builder = ID3TreeBuilder(debugTrace = False)\n",
"tree = builder.buildTree(tstSource)\n",
"builder.prettyPrintTree(tstSource, tree)"
]
},
{
"cell_type": "markdown",
"id": "7b36bf05",
"metadata": {},
"source": [
"## Only removing outlook column"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "54976a49",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"| Entropy: 0.9402859586706309, gain: 0.15183550136234142\n",
"| Decision = No: 35.71429% [12.73086%, 67.90469%]\n",
"| Decision = Yes: 64.28571% [32.09531%, 87.26914000000001%]\n",
"| Branching on Humidity\n",
"| Humidity = High:\n",
"| | | Terminal: \n",
"| | | Decision = No: 57.14286% [18.93662%, 88.38595000000001%]\n",
"| | | Decision = Yes: 42.85714% [11.614049999999999%, 81.06338%]\n",
"| Humidity = Normal:\n",
"| | | Entropy: 0.5916727785823275, gain: 0.19811742113040343\n",
"| | | Decision = No: 14.285709999999998% [1.6956700000000002%, 61.69146%]\n",
"| | | Decision = Yes: 85.71429% [38.30854%, 98.30433%]\n",
"| | | Branching on Wind\n",
"| | | Wind = Weak:\n",
"| | | | | Terminal: \n",
"| | | | | Decision = Yes: 100.0% [37.53613%, 100.0%]\n",
"| | | Wind = Strong:\n",
"| | | | | Entropy: 0.9182958340544896, gain: 0.2516291673878229\n",
"| | | | | Decision = No: 33.33333% [4.03207%, 85.6121%]\n",
"| | | | | Decision = Yes: 66.66667% [14.3879%, 95.96793%]\n",
"| | | | | Branching on Temperature\n",
"| | | | | Temperature = Hot:\n",
"| | | | | | | Terminal: \n",
"| | | | | | | Not possible according to known data\n",
"| | | | | Temperature = Mild:\n",
"| | | | | | | Terminal: \n",
"| | | | | | | Decision = Yes: 100.0% [13.06097%, 100.0%]\n",
"| | | | | Temperature = Cool:\n",
"| | | | | | | Terminal: \n",
"| | | | | | | Decision = No: 50.0% [6.1549%, 93.8451%]\n",
"| | | | | | | Decision = Yes: 50.0% [6.1549%, 93.8451%]\n"
]
}
],
"source": [
"tstSource = ID3DataSource_CSV(\"../../Data/Classification/test_weather.csv\", hasHeader = True)\n",
"tstSource.addAttributeByName(ID3Attribute_DiscreteLabel(\"Temperature\"))\n",
"tstSource.addAttributeByName(ID3Attribute_DiscreteLabel(\"Humidity\"))\n",
"tstSource.addAttributeByName(ID3Attribute_DiscreteLabel(\"Wind\"))\n",
"tstSource.addAttributeByName(ID3Attribute_DiscreteLabel(\"Decision\", isTarget = True))\n",
"tstSource.scanAttributes()\n",
"\n",
"builder = ID3TreeBuilder(debugTrace = False)\n",
"tree = builder.buildTree(tstSource)\n",
"builder.prettyPrintTree(tstSource, tree)"
]
},
{
"cell_type": "markdown",
"id": "18e3f464",
"metadata": {},
"source": [
"# Try on mushrooms dataset\n",
"\n",
"To make life a little bit more interesting there has also been an often used dataset extracted by Jeff Schlimmer from a section about Agaricus and Lepiota inside the Audobon Society Field Guide that has been publically available. This contains 22 properties from 8124 mushrooms from those two mushroom families. Please not this should not be used as a guide. There is no guarantee this decision tree won’t kill you.\n",
"\n",
"* Edible (Target)\n",
"* Cap Shape\n",
"* Cap Surface\n",
"* Cap Color\n",
"* Bruises\n",
"* Odor\n",
"* Gill attachment\n",
"* Gill spacing\n",
"* Gill size\n",
"* Gill color\n",
"* Stalk shape\n",
"* Stalk root\n",
"* Stalk surface above ring\n",
"* Stalk surface below ring\n",
"* Stalk color above ring\n",
"* Stalk color below ring\n",
"* Veil type\n",
"* Veil color\n",
"* Ring number\n",
"* Ring type\n",
"* Spore print colorPopulation\n",
"* Habitat"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "1a95a1d9",
"metadata": {},
"outputs": [],
"source": [
"tstSource = ID3DataSource_CSV(\"../../Data/Classification/test_fungi.csv\", hasHeader = True)"
]
},
{
"cell_type": "markdown",
"id": "fe977159",
"metadata": {},
"source": [
"In this case we index all attributes by their header names and use the first column as the result / target column"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "1d7cb456",
"metadata": {},
"outputs": [],
"source": [
"tstSource.addAttributesAsLabelsByHeader()\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "332aa2cd",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(8416, 23)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tstSource.scanAttributes()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "96acf7ca",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"| Entropy: 0.9968038285222955, gain: 0.9054400254210326\n",
"| Edible = EDIBLE: 53.327000000000005% [51.921870000000006%, 54.726870000000005%]\n",
"| Edible = POISONOUS: 46.672999999999995% [45.27313%, 48.07813%]\n",
"| Branching on Odor\n",
"| Odor = ALMOND:\n",
"| | | Terminal: \n",
"| | | Edible = EDIBLE: 100.0% [98.36314%, 100.0%]\n",
"| Odor = ANISE:\n",
"| | | Terminal: \n",
"| | | Edible = EDIBLE: 100.0% [98.36314%, 100.0%]\n",
"| Odor = NONE:\n",
"| | | Entropy: 0.20192168248430362, gain: 0.1370967382781343\n",
"| | | Edible = EDIBLE: 96.84874% [96.03266%, 97.50132%]\n",
"| | | Edible = POISONOUS: 3.15126% [2.49868%, 3.9673399999999996%]\n",
"| | | Branching on Spore print color\n",
"| | | Spore print color = PURPLE:\n",
"| | | | | Terminal: \n",
"| | | | | Not possible according to known data\n",
"| | | Spore print color = BROWN:\n",
"| | | | | Terminal: \n",
"| | | | | Edible = EDIBLE: 100.0% [99.54983%, 100.0%]\n",
"| | | Spore print color = BLACK:\n",
"| | | | | Terminal: \n",
"| | | | | Edible = EDIBLE: 100.0% [99.53473000000001%, 100.0%]\n",
"| | | Spore print color = CHOCOLATE:\n",
"| | | | | Terminal: \n",
"| | | | | Edible = EDIBLE: 100.0% [87.82137%, 100.0%]\n",
"| | | Spore print color = GREEN:\n",
"| | | | | Terminal: \n",
"| | | | | Edible = POISONOUS: 100.0% [91.53737%, 100.0%]\n",
"| | | Spore print color = WHITE:\n",
"| | | | | Entropy: 0.3809465857053901, gain: 0.2434890145144485\n",
"| | | | | Edible = EDIBLE: 92.59259% [89.48345%, 94.83559%]\n",
"| | | | | Edible = POISONOUS: 7.4074100000000005% [5.16441%, 10.516549999999999%]\n",
"| | | | | Branching on Habitat\n",
"| | | | | Habitat = WOODS:\n",
"| | | | | | | Entropy: 0.7219280948873623, gain: 0.7219280948873623\n",
"| | | | | | | Edible = EDIBLE: 20.0% [8.576920000000001%, 39.98319%]\n",
"| | | | | | | Edible = POISONOUS: 80.0% [60.01681%, 91.42308%]\n",
"| | | | | | | Branching on Gill size\n",
"| | | | | | | Gill size = NARROW:\n",
"| | | | | | | | | Terminal: \n",
"| | | | | | | | | Edible = POISONOUS: 100.0% [82.7806%, 100.0%]\n",
"| | | | | | | Gill size = BROAD:\n",
"| | | | | | | | | Terminal: \n",
"| | | | | | | | | Edible = EDIBLE: 100.0% [54.58366%, 100.0%]\n",
"| | | | | Habitat = MEADOWS:\n",
"| | | | | | | Terminal: \n",
"| | | | | | | Not possible according to known data\n",
"| | | | | Habitat = GRASSES:\n",
"| | | | | | | Terminal: \n",
"| | | | | | | Edible = EDIBLE: 100.0% [97.74096%, 100.0%]\n",
"| | | | | Habitat = PATHS:\n",
"| | | | | | | Terminal: \n",
"| | | | | | | Edible = EDIBLE: 100.0% [85.73315000000001%, 100.0%]\n",
"| | | | | Habitat = URBAN:\n",
"| | | | | | | Terminal: \n",
"| | | | | | | Not possible according to known data\n",
"| | | | | Habitat = LEAVES:\n",
"| | | | | | | Entropy: 0.6840384356390417, gain: 0.6840384356390417\n",
"| | | | | | | Edible = EDIBLE: 81.81818% [69.11085%, 90.0505%]\n",
"| | | | | | | Edible = POISONOUS: 18.181820000000002% [9.9495%, 30.889149999999997%]\n",
"| | | | | | | Branching on Cap Color\n",
"| | | | | | | Cap Color = WHITE:\n",
"| | | | | | | | | Terminal: \n",
"| | | | | | | | | Edible = POISONOUS: 100.0% [54.58366%, 100.0%]\n",
"| | | | | | | Cap Color = YELLOW:\n",
"| | | | | | | | | Terminal: \n",
"| | | | | | | | | Edible = POISONOUS: 100.0% [54.58366%, 100.0%]\n",
"| | | | | | | Cap Color = BROWN:\n",
"| | | | | | | | | Terminal: \n",
"| | | | | | | | | Edible = EDIBLE: 100.0% [87.82137%, 100.0%]\n",
"| | | | | | | Cap Color = GRAY:\n",
"| | | | | | | | | Terminal: \n",
"| | | | | | | | | Not possible according to known data\n",
"| | | | | | | Cap Color = RED:\n",
"| | | | | | | | | Terminal: \n",
"| | | | | | | | | Not possible according to known data\n",
"| | | | | | | Cap Color = PINK:\n",
"| | | | | | | | | Terminal: \n",
"| | | | | | | | | Not possible according to known data\n",
"| | | | | | | Cap Color = PURPLE:\n",
"| | | | | | | | | Terminal: \n",
"| | | | | | | | | Not possible according to known data\n",
"| | | | | | | Cap Color = GREEN:\n",
"| | | | | | | | | Terminal: \n",
"| | | | | | | | | Not possible according to known data\n",
"| | | | | | | Cap Color = BUFF:\n",
"| | | | | | | | | Terminal: \n",
"| | | | | | | | | Not possible according to known data\n",
"| | | | | | | Cap Color = CINNAMON:\n",
"| | | | | | | | | Terminal: \n",
"| | | | | | | | | Edible = EDIBLE: 100.0% [78.28708%, 100.0%]\n",
"| | | | | Habitat = WASTE:\n",
"| | | | | | | Terminal: \n",
"| | | | | | | Edible = EDIBLE: 100.0% [96.64929%, 100.0%]\n",
"| | | Spore print color = YELLOW:\n",
"| | | | | Terminal: \n",
"| | | | | Edible = EDIBLE: 100.0% [87.82137%, 100.0%]\n",
"| | | Spore print color = ORANGE:\n",
"| | | | | Terminal: \n",
"| | | | | Edible = EDIBLE: 100.0% [87.82137%, 100.0%]\n",
"| | | Spore print color = BUFF:\n",
"| | | | | Terminal: \n",
"| | | | | Edible = EDIBLE: 100.0% [87.82137%, 100.0%]\n",
"| Odor = PUNGENT:\n",
"| | | Terminal: \n",
"| | | Edible = POISONOUS: 100.0% [97.46574%, 100.0%]\n",
"| Odor = CREOSOTE:\n",
"| | | Terminal: \n",
"| | | Edible = POISONOUS: 100.0% [96.64929%, 100.0%]\n",
"| Odor = FOUL:\n",
"| | | Terminal: \n",
"| | | Edible = POISONOUS: 100.0% [99.69278%, 100.0%]\n",
"| Odor = FISHY:\n",
"| | | Terminal: \n",
"| | | Edible = POISONOUS: 100.0% [98.85758%, 100.0%]\n",
"| Odor = SPICY:\n",
"| | | Terminal: \n",
"| | | Edible = POISONOUS: 100.0% [98.85758%, 100.0%]\n",
"| Odor = MUSTY:\n",
"| | | Terminal: \n",
"| | | Edible = POISONOUS: 100.0% [87.82137%, 100.0%]\n",
"CPU times: user 56.8 s, sys: 526 ms, total: 57.3 s\n",
"Wall time: 57.4 s\n"
]
}
],
"source": [
"%%time\n",
"builder = ID3TreeBuilder(debugTrace = False)\n",
"tree = builder.buildTree(tstSource)\n",
"builder.prettyPrintTree(tstSource, tree)"
]
},
{
"cell_type": "markdown",
"id": "4333cbd2",
"metadata": {},
"source": [
"# Applying to simple medical data\n",
"\n",
"As a last example the algorithm is applied to simple medical data. The data source has been available on [Kaggle](https://www.kaggle.com/datasets/itachi9604/disease-symptom-description-dataset?resource=download), only a small subset of the data will be used. The main CSV file contains a list of diseases and a list of symptoms encountered. The most simple approach will be to build a binary table that tells if a patient has a given symptom or not - this will not account for any severity of symptoms though which would be really important for a realworld application (and pretty simple to implement)."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "f5da2f72",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Extracted 131 symptoms\n"
]
}
],
"source": [
"symptoms = []\n",
"\n",
"with open(\"../../Data/Classification/test_disease_source.csv\") as srcfile:\n",
" rsr = csv.reader(srcfile)\n",
" firstLine = True\n",
" for row in rsr:\n",
" if firstLine:\n",
" firstLine = False\n",
" continue\n",
" # First we extract a list of all possible symptoms\n",
" for i in range(1, len(row)):\n",
" sympt = row[i].strip()\n",
" if sympt and sympt != \"\":\n",
" if sympt not in symptoms:\n",
" symptoms.append(sympt)\n",
"\n",
"print(f\"Extracted {len(symptoms)} symptoms\")\n",
"\n",
"# Now build the table ...\n",
"with open(\"../../Data/Classification/test_disease_source_binary.csv\", 'w') as dstfile:\n",
" headerline = \"Disease\"\n",
" for sympt in symptoms:\n",
" headerline = headerline + \",\" + sympt\n",
" dstfile.write(headerline + \"\\n\")\n",
"\n",
" with open(\"../../Data/Classification/test_disease_source.csv\") as srcfile:\n",
" rsr = csv.reader(srcfile)\n",
" firstLine = True\n",
" for row in rsr:\n",
" if firstLine:\n",
" firstLine = False\n",
" continue\n",
"\n",
"\n",
" line = row.pop(0).strip()\n",
"\n",
" for sympt in symptoms:\n",
" # Check if this symptom is present\n",
" hasSympt = False\n",
" for ent in row:\n",
" if ent.strip() == sympt:\n",
" hasSympt = True\n",
" break\n",
" if hasSympt:\n",
" line = line + \",YES\"\n",
" else:\n",
" line = line + \",NO\"\n",
"\n",
" dstfile.write(line + \"\\n\")\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "2ab73189",
"metadata": {},
"source": [
"## Applying ID3 Algorithm"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "f5e70c59",
"metadata": {},
"outputs": [],
"source": [
"tstSource = ID3DataSource_CSV(\"../../Data/Classification/test_disease_source_binary.csv\", hasHeader = True)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "cc698dad",
"metadata": {},
"outputs": [],
"source": [
"tstSource.addAttributesAsLabelsByHeader()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "596164ef",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(4920, 132)"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tstSource.scanAttributes()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "2c5aa7b0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 19h 1min 57s, sys: 11min 16s, total: 19h 13min 13s\n",
"Wall time: 19h 14min 18s\n"
]
}
],
"source": [
"%%time\n",
"builder = ID3TreeBuilder(debugTrace = False)\n",
"tree = builder.buildTree(tstSource)\n",
"# builder.prettyPrintTree(tstSource, tree)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "017d74c2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"| Entropy: 5.357552004618081, gain: 0.8458372108970891\n",
"| fatigue = NO:\n",
"| | | Entropy: 4.786255977156568, gain: 0.8175104414185426\n",
"| | | vomiting = NO:\n",
"| | | | | Entropy: 4.2634667101159796, gain: 0.7724397067592355\n",
"| | | | | skin_rash = YES:\n",
"| | | | | | | Entropy: 2.3817253589307477, gain: 0.7884441273192315\n",
"| | | | | | | itching = YES:\n",
"| | | | | | | | | Entropy: 1.161378479448699, gain: 0.7287131042890482\n",
"| | | | | | | | | stomach_pain = NO:\n",
"| | | | | | | | | | | Entropy: 0.7742433029172697, gain: 0.48546076074591343\n",
"| | | | | | | | | | | burning_micturition = NO:\n",
"| | | | | | | | | | | | | Entropy: 0.3227569588973982, gain: 0.3227569588973982\n",
"| | | | | | | | | | | | | loss_of_appetite = NO:\n",
"| | | | | | | | | | | | | | | Disease = Fungal infection: 100.0% [93.51585%, 100.0%]\n",
"| | | | | | | | | | | | | loss_of_appetite = YES:\n",
"| | | | | | | | | | | | | | | Disease = Chicken pox: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | burning_micturition = YES:\n",
"| | | | | | | | | | | | | Disease = Drug Reaction: 100.0% [64.32109%, 100.0%]\n",
"| | | | | | | | | stomach_pain = YES:\n",
"| | | | | | | | | | | Disease = Drug Reaction: 100.0% [93.11334%, 100.0%]\n",
"| | | | | | | itching = NO:\n",
"| | | | | | | | | Entropy: 1.838026124503779, gain: 0.7870913537395843\n",
"| | | | | | | | | joint_pain = NO:\n",
"| | | | | | | | | | | Entropy: 1.5013353868059924, gain: 0.8506573567612395\n",
"| | | | | | | | | | | blister = NO:\n",
"| | | | | | | | | | | | | Entropy: 1.1386865525783176, gain: 0.48654136697818307\n",
"| | | | | | | | | | | | | pus_filled_pimples = NO:\n",
"| | | | | | | | | | | | | | | Entropy: 2.2359263506290326, gain: 0.863120568566631\n",
"| | | | | | | | | | | | | | | nodal_skin_eruptions = YES:\n",
"| | | | | | | | | | | | | | | | | Disease = Fungal infection: 100.0% [64.32109%, 100.0%]\n",
"| | | | | | | | | | | | | | | nodal_skin_eruptions = NO:\n",
"| | | | | | | | | | | | | | | | | Entropy: 1.9219280948873623, gain: 0.9709505944546687\n",
"| | | | | | | | | | | | | | | | | blackheads = NO:\n",
"| | | | | | | | | | | | | | | | | | | Entropy: 1.584962500721156, gain: 0.9182958340544894\n",
"| | | | | | | | | | | | | | | | | | | stomach_pain = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | | | | | | | high_fever = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | Disease = Psoriasis: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | high_fever = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | Disease = Impetigo: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | stomach_pain = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | Disease = Drug Reaction: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | blackheads = YES:\n",
"| | | | | | | | | | | | | | | | | | | Disease = Acne: 100.0% [64.32109%, 100.0%]\n",
"| | | | | | | | | | | | | pus_filled_pimples = YES:\n",
"| | | | | | | | | | | | | | | Disease = Acne: 100.0% [93.87389999999999%, 100.0%]\n",
"| | | | | | | | | | | blister = YES:\n",
"| | | | | | | | | | | | | Disease = Impetigo: 100.0% [94.19448%, 100.0%]\n",
"| | | | | | | | | joint_pain = YES:\n",
"| | | | | | | | | | | Disease = Psoriasis: 100.0% [94.19448%, 100.0%]\n",
"| | | | | skin_rash = NO:\n",
"| | | | | | | Entropy: 3.9828871664512895, gain: 0.6468749738357373\n",
"| | | | | | | headache = NO:\n",
"| | | | | | | | | Entropy: 3.7560383874069343, gain: 0.6991724211329374\n",
"| | | | | | | | | swelling_joints = NO:\n",
"| | | | | | | | | | | Entropy: 3.648994047474087, gain: 0.6239247592651546\n",
"| | | | | | | | | | | dizziness = NO:\n",
"| | | | | | | | | | | | | Entropy: 3.4860352643091366, gain: 0.6908574161896694\n",
"| | | | | | | | | | | | | high_fever = NO:\n",
"| | | | | | | | | | | | | | | Entropy: 3.2946870140830082, gain: 0.6952585028600295\n",
"| | | | | | | | | | | | | | | constipation = NO:\n",
"| | | | | | | | | | | | | | | | | Entropy: 3.336579880077256, gain: 0.7747942362434017\n",
"| | | | | | | | | | | | | | | | | bladder_discomfort = NO:\n",
"| | | | | | | | | | | | | | | | | | | Entropy: 3.5758257945180882, gain: 0.7590191722627639\n",
"| | | | | | | | | | | | | | | | | | | continuous_sneezing = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | Entropy: 4.506890595608519, gain: 0.7219280948873625\n",
"| | | | | | | | | | | | | | | | | | | | | itching = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.9182958340544893, gain: 0.9182958340544893\n",
"| | | | | | | | | | | | | | | | | | | | | | | nodal_skin_eruptions = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | Disease = Fungal infection: 100.0% [64.32109%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | nodal_skin_eruptions = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.5, gain: 1.0\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | stomach_pain = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | nausea = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Hepatitis B: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | nausea = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Chronic cholestasis: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | stomach_pain = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Drug Reaction: 100.0% [64.32109%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | itching = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | Entropy: 4.251629167387823, gain: 0.6500224216483548\n",
"| | | | | | | | | | | | | | | | | | | | | | | diarrhoea = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 4.021928094887362, gain: 0.7219280948873619\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | chest_pain = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 3.875, gain: 0.5435644431995956\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | yellowish_skin = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 3.6644977792004623, gain: 0.5916727785823288\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | shivering = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 3.584962500721156, gain: 0.6500224216483542\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | indigestion = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 3.321928094887362, gain: 0.7219280948873619\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | obesity = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 3.0, gain: 0.8112781244591329\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | neck_pain = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 2.584962500721156, gain: 0.6500224216483541\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | burning_micturition = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 2.321928094887362, gain: 0.7219280948873621\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | muscle_wasting = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 2.0, gain: 0.8112781244591329\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | stiff_neck = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.584962500721156, gain: 0.9182958340544894\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | joint_pain = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pain_during_bowel_movements = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Acne: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pain_during_bowel_movements = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Dimorphic hemmorhoids(piles): 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | joint_pain = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Psoriasis: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | stiff_neck = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Arthritis: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | muscle_wasting = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = AIDS: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | burning_micturition = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Urinary tract infection: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | neck_pain = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | loss_of_balance = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Osteoarthristis: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | loss_of_balance = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Cervical spondylosis: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | obesity = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | weight_loss = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Varicose veins: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | weight_loss = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Diabetes: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | indigestion = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | acidity = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Peptic ulcer diseae: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | acidity = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Migraine: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | shivering = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Allergy: 100.0% [64.32109%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | yellowish_skin = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | nausea = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Alcoholic hepatitis: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | nausea = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Hepatitis C: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | chest_pain = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | stomach_pain = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Heart attack: 100.0% [64.32109%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | stomach_pain = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = GERD: 100.0% [64.32109%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | diarrhoea = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.5, gain: 1.0\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | sunken_eyes = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | yellowish_skin = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Hyperthyroidism: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | yellowish_skin = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = hepatitis A: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | sunken_eyes = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Gastroenteritis: 100.0% [64.32109%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | continuous_sneezing = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | Disease = Allergy: 100.0% [94.19448%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | bladder_discomfort = YES:\n",
"| | | | | | | | | | | | | | | | | | | Disease = Urinary tract infection: 100.0% [94.48318%, 100.0%]\n",
"| | | | | | | | | | | | | | | constipation = YES:\n",
"| | | | | | | | | | | | | | | | | Disease = Dimorphic hemmorhoids(piles): 100.0% [94.48318%, 100.0%]\n",
"| | | | | | | | | | | | | high_fever = YES:\n",
"| | | | | | | | | | | | | | | Entropy: 0.9274479232123118, gain: 0.558629373452199\n",
"| | | | | | | | | | | | | | | cough = NO:\n",
"| | | | | | | | | | | | | | | | | Entropy: 0.28639695711595625, gain: 0.28639695711595625\n",
"| | | | | | | | | | | | | | | | | blister = NO:\n",
"| | | | | | | | | | | | | | | | | | | Disease = AIDS: 100.0% [94.48318%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | blister = YES:\n",
"| | | | | | | | | | | | | | | | | | | Disease = Impetigo: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | cough = YES:\n",
"| | | | | | | | | | | | | | | | | Entropy: 0.9182958340544896, gain: 0.9182958340544896\n",
"| | | | | | | | | | | | | | | | | chills = NO:\n",
"| | | | | | | | | | | | | | | | | | | Disease = Bronchial Asthma: 100.0% [64.32109%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | chills = YES:\n",
"| | | | | | | | | | | | | | | | | | | Disease = Pneumonia: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | dizziness = YES:\n",
"| | | | | | | | | | | | | Entropy: 0.8404914014731815, gain: 0.5096374678020158\n",
"| | | | | | | | | | | | | neck_pain = NO:\n",
"| | | | | | | | | | | | | | | Entropy: 1.5219280948873621, gain: 0.9709505944546685\n",
"| | | | | | | | | | | | | | | chest_pain = NO:\n",
"| | | | | | | | | | | | | | | | | Entropy: 0.9182958340544896, gain: 0.9182958340544896\n",
"| | | | | | | | | | | | | | | | | lethargy = NO:\n",
"| | | | | | | | | | | | | | | | | | | Disease = Cervical spondylosis: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | lethargy = YES:\n",
"| | | | | | | | | | | | | | | | | | | Disease = Hypothyroidism: 100.0% [64.32109%, 100.0%]\n",
"| | | | | | | | | | | | | | | chest_pain = YES:\n",
"| | | | | | | | | | | | | | | | | Disease = Hypertension: 100.0% [64.32109%, 100.0%]\n",
"| | | | | | | | | | | | | neck_pain = YES:\n",
"| | | | | | | | | | | | | | | Disease = Cervical spondylosis: 100.0% [94.19448%, 100.0%]\n",
"| | | | | | | | | swelling_joints = YES:\n",
"| | | | | | | | | | | Entropy: 1.0, gain: 0.8492647594126546\n",
"| | | | | | | | | | | stiff_neck = NO:\n",
"| | | | | | | | | | | | | Entropy: 0.28639695711595625, gain: 0.28639695711595625\n",
"| | | | | | | | | | | | | muscle_weakness = NO:\n",
"| | | | | | | | | | | | | | | Disease = Osteoarthristis: 100.0% [94.48318%, 100.0%]\n",
"| | | | | | | | | | | | | muscle_weakness = YES:\n",
"| | | | | | | | | | | | | | | Disease = Arthritis: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | stiff_neck = YES:\n",
"| | | | | | | | | | | | | Disease = Arthritis: 100.0% [94.19448%, 100.0%]\n",
"| | | | | | | headache = YES:\n",
"| | | | | | | | | Entropy: 1.6359061660790049, gain: 0.8525666663983983\n",
"| | | | | | | | | loss_of_balance = NO:\n",
"| | | | | | | | | | | Entropy: 1.1386865525783176, gain: 0.575779260731362\n",
"| | | | | | | | | | | acidity = NO:\n",
"| | | | | | | | | | | | | Entropy: 2.2516291673878226, gain: 0.9182958340544893\n",
"| | | | | | | | | | | | | chills = NO:\n",
"| | | | | | | | | | | | | | | Entropy: 1.5, gain: 1.0\n",
"| | | | | | | | | | | | | | | weakness_of_one_body_side = NO:\n",
"| | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | | | chest_pain = NO:\n",
"| | | | | | | | | | | | | | | | | | | Disease = Migraine: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | chest_pain = YES:\n",
"| | | | | | | | | | | | | | | | | | | Disease = Hypertension: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | weakness_of_one_body_side = YES:\n",
"| | | | | | | | | | | | | | | | | Disease = Paralysis (brain hemorrhage): 100.0% [64.32109%, 100.0%]\n",
"| | | | | | | | | | | | | chills = YES:\n",
"| | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | continuous_sneezing = NO:\n",
"| | | | | | | | | | | | | | | | | Disease = Malaria: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | continuous_sneezing = YES:\n",
"| | | | | | | | | | | | | | | | | Disease = Common Cold: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | acidity = YES:\n",
"| | | | | | | | | | | | | Disease = Migraine: 100.0% [94.19448%, 100.0%]\n",
"| | | | | | | | | loss_of_balance = YES:\n",
"| | | | | | | | | | | Entropy: 0.3095434291503252, gain: 0.3095434291503252\n",
"| | | | | | | | | | | nausea = NO:\n",
"| | | | | | | | | | | | | Disease = Hypertension: 100.0% [93.87389999999999%, 100.0%]\n",
"| | | | | | | | | | | nausea = YES:\n",
"| | | | | | | | | | | | | Disease = (vertigo) Paroymsal Positional Vertigo: 100.0% [47.40685%, 100.0%]\n",
"| | | vomiting = YES:\n",
"| | | | | Entropy: 3.4990336640731607, gain: 0.8507115768962774\n",
"| | | | | nausea = NO:\n",
"| | | | | | | Entropy: 2.8781892225870314, gain: 0.8236948259200888\n",
"| | | | | | | abdominal_pain = NO:\n",
"| | | | | | | | | Entropy: 2.367635889995596, gain: 0.849308608237843\n",
"| | | | | | | | | chest_pain = NO:\n",
"| | | | | | | | | | | Entropy: 1.8180959929710643, gain: 0.8525666663983981\n",
"| | | | | | | | | | | diarrhoea = NO:\n",
"| | | | | | | | | | | | | Entropy: 1.457518749639422, gain: 0.6387068973726207\n",
"| | | | | | | | | | | | | altered_sensorium = NO:\n",
"| | | | | | | | | | | | | | | Entropy: 2.807354922057604, gain: 0.8631205685666311\n",
"| | | | | | | | | | | | | | | headache = NO:\n",
"| | | | | | | | | | | | | | | | | Entropy: 2.321928094887362, gain: 0.7219280948873621\n",
"| | | | | | | | | | | | | | | | | stomach_pain = NO:\n",
"| | | | | | | | | | | | | | | | | | | Entropy: 2.0, gain: 0.8112781244591329\n",
"| | | | | | | | | | | | | | | | | | | yellowish_skin = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | Entropy: 1.584962500721156, gain: 0.9182958340544894\n",
"| | | | | | | | | | | | | | | | | | | | | loss_of_appetite = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | | | | | | | | | sunken_eyes = NO:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | Disease = Heart attack: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | | | sunken_eyes = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | | | Disease = Gastroenteritis: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | | | loss_of_appetite = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | | | Disease = Peptic ulcer diseae: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | | | yellowish_skin = YES:\n",
"| | | | | | | | | | | | | | | | | | | | | Disease = Alcoholic hepatitis: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | stomach_pain = YES:\n",
"| | | | | | | | | | | | | | | | | | | Disease = GERD: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | headache = YES:\n",
"| | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | | | loss_of_balance = NO:\n",
"| | | | | | | | | | | | | | | | | | | Disease = Paralysis (brain hemorrhage): 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | loss_of_balance = YES:\n",
"| | | | | | | | | | | | | | | | | | | Disease = (vertigo) Paroymsal Positional Vertigo: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | altered_sensorium = YES:\n",
"| | | | | | | | | | | | | | | Disease = Paralysis (brain hemorrhage): 100.0% [93.87389999999999%, 100.0%]\n",
"| | | | | | | | | | | diarrhoea = YES:\n",
"| | | | | | | | | | | | | Entropy: 0.3095434291503252, gain: 0.3095434291503252\n",
"| | | | | | | | | | | | | chills = NO:\n",
"| | | | | | | | | | | | | | | Disease = Gastroenteritis: 100.0% [93.87389999999999%, 100.0%]\n",
"| | | | | | | | | | | | | chills = YES:\n",
"| | | | | | | | | | | | | | | Disease = Malaria: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | chest_pain = YES:\n",
"| | | | | | | | | | | Entropy: 1.1586048283017796, gain: 0.8426433989885903\n",
"| | | | | | | | | | | cough = NO:\n",
"| | | | | | | | | | | | | Entropy: 0.3095434291503252, gain: 0.3095434291503252\n",
"| | | | | | | | | | | | | stomach_pain = NO:\n",
"| | | | | | | | | | | | | | | Disease = Heart attack: 100.0% [93.87389999999999%, 100.0%]\n",
"| | | | | | | | | | | | | stomach_pain = YES:\n",
"| | | | | | | | | | | | | | | Disease = GERD: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | cough = YES:\n",
"| | | | | | | | | | | | | Entropy: 0.3227569588973982, gain: 0.3227569588973982\n",
"| | | | | | | | | | | | | chills = NO:\n",
"| | | | | | | | | | | | | | | Disease = GERD: 100.0% [93.51585%, 100.0%]\n",
"| | | | | | | | | | | | | chills = YES:\n",
"| | | | | | | | | | | | | | | Disease = Tuberculosis: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | abdominal_pain = YES:\n",
"| | | | | | | | | Entropy: 1.4362406790693445, gain: 0.8566594912242682\n",
"| | | | | | | | | yellowish_skin = NO:\n",
"| | | | | | | | | | | Entropy: 0.2974722489192896, gain: 0.2974722489192896\n",
"| | | | | | | | | | | swelling_of_stomach = NO:\n",
"| | | | | | | | | | | | | Disease = Peptic ulcer diseae: 100.0% [94.19448%, 100.0%]\n",
"| | | | | | | | | | | swelling_of_stomach = YES:\n",
"| | | | | | | | | | | | | Disease = Alcoholic hepatitis: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | yellowish_skin = YES:\n",
"| | | | | | | | | | | Entropy: 0.847584679824574, gain: 0.46899559358928133\n",
"| | | | | | | | | | | itching = YES:\n",
"| | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | loss_of_appetite = NO:\n",
"| | | | | | | | | | | | | | | Disease = Jaundice: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | loss_of_appetite = YES:\n",
"| | | | | | | | | | | | | | | Disease = Chronic cholestasis: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | itching = NO:\n",
"| | | | | | | | | | | | | Entropy: 0.3095434291503252, gain: 0.3095434291503252\n",
"| | | | | | | | | | | | | loss_of_appetite = NO:\n",
"| | | | | | | | | | | | | | | Disease = Alcoholic hepatitis: 100.0% [93.87389999999999%, 100.0%]\n",
"| | | | | | | | | | | | | loss_of_appetite = YES:\n",
"| | | | | | | | | | | | | | | Disease = hepatitis A: 100.0% [47.40685%, 100.0%]\n",
"| | | | | nausea = YES:\n",
"| | | | | | | Entropy: 2.297472248919289, gain: 0.9995003941817583\n",
"| | | | | | | muscle_pain = NO:\n",
"| | | | | | | | | Entropy: 1.4362406790693445, gain: 0.8566594912242681\n",
"| | | | | | | | | yellowish_skin = NO:\n",
"| | | | | | | | | | | Entropy: 0.5689955935892812, gain: 0.33125121848110783\n",
"| | | | | | | | | | | loss_of_balance = NO:\n",
"| | | | | | | | | | | | | Entropy: 1.584962500721156, gain: 0.9182958340544894\n",
"| | | | | | | | | | | | | itching = YES:\n",
"| | | | | | | | | | | | | | | Disease = Chronic cholestasis: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | itching = NO:\n",
"| | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | blurred_and_distorted_vision = NO:\n",
"| | | | | | | | | | | | | | | | | Disease = (vertigo) Paroymsal Positional Vertigo: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | blurred_and_distorted_vision = YES:\n",
"| | | | | | | | | | | | | | | | | Disease = Hypoglycemia: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | loss_of_balance = YES:\n",
"| | | | | | | | | | | | | Disease = (vertigo) Paroymsal Positional Vertigo: 100.0% [93.87389999999999%, 100.0%]\n",
"| | | | | | | | | yellowish_skin = YES:\n",
"| | | | | | | | | | | Entropy: 0.5907239186406502, gain: 0.4854607607459134\n",
"| | | | | | | | | | | dark_urine = NO:\n",
"| | | | | | | | | | | | | Disease = Chronic cholestasis: 100.0% [93.87389999999999%, 100.0%]\n",
"| | | | | | | | | | | dark_urine = YES:\n",
"| | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | high_fever = NO:\n",
"| | | | | | | | | | | | | | | Disease = Hepatitis D: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | high_fever = YES:\n",
"| | | | | | | | | | | | | | | Disease = Hepatitis E: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | muscle_pain = YES:\n",
"| | | | | | | | | Entropy: 1.1522290399012944, gain: 0.9994730201859836\n",
"| | | | | | | | | yellowing_of_eyes = NO:\n",
"| | | | | | | | | | | Entropy: 0.2974722489192896, gain: 0.2974722489192896\n",
"| | | | | | | | | | | skin_rash = YES:\n",
"| | | | | | | | | | | | | Disease = Dengue: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | skin_rash = NO:\n",
"| | | | | | | | | | | | | Disease = Malaria: 100.0% [94.19448%, 100.0%]\n",
"| | | | | | | | | yellowing_of_eyes = YES:\n",
"| | | | | | | | | | | Disease = hepatitis A: 100.0% [94.19448%, 100.0%]\n",
"| fatigue = YES:\n",
"| | | Entropy: 4.087113833003859, gain: 0.9011019245114962\n",
"| | | loss_of_appetite = NO:\n",
"| | | | | Entropy: 3.4394507096117723, gain: 0.8817069873806092\n",
"| | | | | high_fever = NO:\n",
"| | | | | | | Entropy: 2.7185962248540316, gain: 0.9914266810680207\n",
"| | | | | | | irritability = NO:\n",
"| | | | | | | | | Entropy: 1.9047143071995363, gain: 0.9824740868386415\n",
"| | | | | | | | | increased_appetite = NO:\n",
"| | | | | | | | | | | Entropy: 1.596184996778472, gain: 0.6731080737015489\n",
"| | | | | | | | | | | obesity = NO:\n",
"| | | | | | | | | | | | | Entropy: 3.0, gain: 1.0\n",
"| | | | | | | | | | | | | yellowish_skin = NO:\n",
"| | | | | | | | | | | | | | | Entropy: 2.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | chills = NO:\n",
"| | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | | | cough = NO:\n",
"| | | | | | | | | | | | | | | | | | | Disease = Varicose veins: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | cough = YES:\n",
"| | | | | | | | | | | | | | | | | | | Disease = Bronchial Asthma: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | chills = YES:\n",
"| | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | | | continuous_sneezing = NO:\n",
"| | | | | | | | | | | | | | | | | | | Disease = Pneumonia: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | continuous_sneezing = YES:\n",
"| | | | | | | | | | | | | | | | | | | Disease = Common Cold: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | yellowish_skin = YES:\n",
"| | | | | | | | | | | | | | | Entropy: 2.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | itching = YES:\n",
"| | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | | | vomiting = NO:\n",
"| | | | | | | | | | | | | | | | | | | Disease = Hepatitis B: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | vomiting = YES:\n",
"| | | | | | | | | | | | | | | | | | | Disease = Jaundice: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | itching = NO:\n",
"| | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | | | | | | | | | vomiting = NO:\n",
"| | | | | | | | | | | | | | | | | | | Disease = Hepatitis C: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | | | | | vomiting = YES:\n",
"| | | | | | | | | | | | | | | | | | | Disease = Hepatitis D: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | obesity = YES:\n",
"| | | | | | | | | | | | | Disease = Varicose veins: 100.0% [94.19448%, 100.0%]\n",
"| | | | | | | | | increased_appetite = YES:\n",
"| | | | | | | | | | | Disease = Diabetes: 100.0% [94.48318%, 100.0%]\n",
"| | | | | | | irritability = YES:\n",
"| | | | | | | | | Entropy: 1.5844996446144277, gain: 0.9241335419915457\n",
"| | | | | | | | | abnormal_menstruation = NO:\n",
"| | | | | | | | | | | Disease = Hypoglycemia: 100.0% [94.48318%, 100.0%]\n",
"| | | | | | | | | abnormal_menstruation = YES:\n",
"| | | | | | | | | | | Entropy: 0.9994730201859836, gain: 0.9994730201859836\n",
"| | | | | | | | | | | depression = NO:\n",
"| | | | | | | | | | | | | Disease = Hyperthyroidism: 100.0% [94.48318%, 100.0%]\n",
"| | | | | | | | | | | depression = YES:\n",
"| | | | | | | | | | | | | Disease = Hypothyroidism: 100.0% [94.19448%, 100.0%]\n",
"| | | | | high_fever = YES:\n",
"| | | | | | | Entropy: 2.381155648699536, gain: 0.9656361333706103\n",
"| | | | | | | chest_pain = NO:\n",
"| | | | | | | | | Entropy: 1.6826392037546638, gain: 0.9402859586706308\n",
"| | | | | | | | | chills = NO:\n",
"| | | | | | | | | | | Entropy: 1.1547717145751624, gain: 0.8452282854248372\n",
"| | | | | | | | | | | itching = YES:\n",
"| | | | | | | | | | | | | Entropy: 0.3095434291503252, gain: 0.3095434291503252\n",
"| | | | | | | | | | | | | skin_rash = YES:\n",
"| | | | | | | | | | | | | | | Disease = Chicken pox: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | | | skin_rash = NO:\n",
"| | | | | | | | | | | | | | | Disease = Jaundice: 100.0% [93.87389999999999%, 100.0%]\n",
"| | | | | | | | | | | itching = NO:\n",
"| | | | | | | | | | | | | Entropy: 0.3095434291503252, gain: 0.3095434291503252\n",
"| | | | | | | | | | | | | vomiting = NO:\n",
"| | | | | | | | | | | | | | | Disease = Bronchial Asthma: 100.0% [93.87389999999999%, 100.0%]\n",
"| | | | | | | | | | | | | vomiting = YES:\n",
"| | | | | | | | | | | | | | | Disease = Jaundice: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | chills = YES:\n",
"| | | | | | | | | | | Disease = Typhoid: 100.0% [94.74452%, 100.0%]\n",
"| | | | | | | chest_pain = YES:\n",
"| | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | muscle_pain = NO:\n",
"| | | | | | | | | | | Disease = Pneumonia: 100.0% [94.19448%, 100.0%]\n",
"| | | | | | | | | muscle_pain = YES:\n",
"| | | | | | | | | | | Disease = Common Cold: 100.0% [94.19448%, 100.0%]\n",
"| | | loss_of_appetite = YES:\n",
"| | | | | Entropy: 2.806836027747821, gain: 0.943622285167955\n",
"| | | | | malaise = NO:\n",
"| | | | | | | Entropy: 1.6854277290691868, gain: 0.9241335419915458\n",
"| | | | | | | coma = NO:\n",
"| | | | | | | | | Entropy: 1.1522290399012944, gain: 0.8488843249236633\n",
"| | | | | | | | | vomiting = NO:\n",
"| | | | | | | | | | | Entropy: 0.2974722489192896, gain: 0.2974722489192896\n",
"| | | | | | | | | | | abdominal_pain = NO:\n",
"| | | | | | | | | | | | | Disease = Hepatitis C: 100.0% [94.19448%, 100.0%]\n",
"| | | | | | | | | | | abdominal_pain = YES:\n",
"| | | | | | | | | | | | | Disease = Hepatitis D: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | vomiting = YES:\n",
"| | | | | | | | | | | Entropy: 0.3095434291503252, gain: 0.3095434291503252\n",
"| | | | | | | | | | | skin_rash = YES:\n",
"| | | | | | | | | | | | | Disease = Dengue: 100.0% [47.40685%, 100.0%]\n",
"| | | | | | | | | | | skin_rash = NO:\n",
"| | | | | | | | | | | | | Disease = Hepatitis D: 100.0% [93.87389999999999%, 100.0%]\n",
"| | | | | | | coma = YES:\n",
"| | | | | | | | | Disease = Hepatitis E: 100.0% [94.48318%, 100.0%]\n",
"| | | | | malaise = YES:\n",
"| | | | | | | Entropy: 1.9995975337661407, gain: 0.9998646331239298\n",
"| | | | | | | yellowing_of_eyes = NO:\n",
"| | | | | | | | | Entropy: 1.0, gain: 1.0\n",
"| | | | | | | | | nausea = NO:\n",
"| | | | | | | | | | | Disease = Chicken pox: 100.0% [94.19448%, 100.0%]\n",
"| | | | | | | | | nausea = YES:\n",
"| | | | | | | | | | | Disease = Dengue: 100.0% [94.19448%, 100.0%]\n",
"| | | | | | | yellowing_of_eyes = YES:\n",
"| | | | | | | | | Entropy: 0.9994730201859836, gain: 0.9994730201859836\n",
"| | | | | | | | | chest_pain = NO:\n",
"| | | | | | | | | | | Disease = Hepatitis B: 100.0% [94.19448%, 100.0%]\n",
"| | | | | | | | | chest_pain = YES:\n",
"| | | | | | | | | | | Disease = Tuberculosis: 100.0% [94.48318%, 100.0%]\n",
"CPU times: user 4min 37s, sys: 3.07 s, total: 4min 41s\n",
"Wall time: 4min 41s\n"
]
}
],
"source": [
"%%time\n",
"builder = ID3TreeBuilder(debugTrace = False)\n",
"builder.prettyPrintTree(tstSource, tree, onlyTerminalProbabilities = True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "088a2ce2",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment