Skip to content

Instantly share code, notes, and snippets.

@CalvinTChi
Last active April 14, 2021 16:30
Show Gist options
  • Save CalvinTChi/dda330ae798c5481dd35ef135daad747 to your computer and use it in GitHub Desktop.
Save CalvinTChi/dda330ae798c5481dd35ef135daad747 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Decision Trees\n",
"### Author: _Calvin Chi_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. Introduction\n",
"Decision tree is a type of machine learning algorithm that can be used for both classification and regression. The central idea of a decision tree is to literally build a decision tree on different features of the data until we can confidently separate classes from each other. This concept is probably best illustrated through a diagram.\n",
"\n",
"<img src=\"http://i.imgur.com/TDQLt7D.png\", width=500, height=500>\n",
"\n",
"The example above is a simple decision tree for deciding whether to go out or not. The decision to go out or not can be viewed as the label, and the attributes that the decision depends on in the above diagram are outlook, humidity, and wind. \n",
"\n",
"What makes a good attribute to decide on? Intuitively, it would be an attribute that can separate one class from the other. For example, if a person is strongly likely to go out if the wind is weak and not likely otherwise, then the attribute wind is a good attribute. \n",
"\n",
"In creating a decision tree, this objective is stated as the attribute that minimizes uncertainty, with entropy being a common measure of uncertainty. Qualitatively, high entropy implies high uncertainty, while low entropy implies low uncertainty. If a certain attribute can well-separate our samples into their respective classes, then the entropy is reduced because we are more certain about the class of our samples. For example, once we know whether the wind is strong or not, we are then more certain whether this person went out or not. Before diving into decision trees, we must first understand what entropy is. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Entropy\n",
"Qualitatively, entropy measures uncertainty or disorder. High entropy indicates high disorder. High entropy is qualitatively expressed as:\n",
"\n",
"$$H = \\sum_{i=1}^{m} p_{i}log_{2}\\:\\frac{1}{p_{i}} = -\\sum_{i=1}^{m} p_{i}log_{2}\\:p_{i}$$\n",
"\n",
"Where $i$ is a class and $p_{i}$ is the probability of class $i$. \n",
"\n",
"Let us plot entropy values for two classes across the range of values $p$ can take on. In our example, if $p_{1} = 0.3$, then $p_{2}$ is necessarily $0.7$."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAEZCAYAAACNebLAAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xnc1XP+//HHq1K2KbIXGZPJMtaQdbhSQxIRJvtg0NjN\nd36YGVuWr+Vm7I1MGEuEIYxdomtGURoTkVZLqNS3EJLU1ev3x+uky+Vc13W6Op/zOcvzfrudW+dz\nPp/rc17Xp+t8Xue9m7sjIiJSV7O0AxARkeKkBCEiIlkpQYiISFZKECIikpUShIiIZKUEISIiWSlB\niIhIVkoQUjHM7EMz+8bMvjSzrzL/3pLDz40ws5MKEaNIMWmRdgAiBeTAge4+Ip8nNbPm7l6Tz3OK\nFAOVIKTS2I9eMPuNmb1iZteZ2Wdm9p6Z7Z/ZdyXwS2BA7RKHmS01s9PNbAowJfPaHmb2upl9bmZj\nzGz3Wu8xwsyuyrw+38weN7O1MvueNrMz6sT0lpn1TuwqiORACUIkdAEmAusA1wF/B3D3i4BXgDPd\nvbW7n13rZ3oDuwBbm9nawNPATZlz3Ag8k3l9meOAE4ANgRpgWfXWvZl9AJjZ9kA74Jn8/ooiK0YJ\nQirNE5lSwueZf3+beX26u//dY3Kye4GNzGz9Rs51lbvPd/dFwIHAFHcf4u5L3f0hYBJwUK3jB7v7\nRHdfCFwM9DUzA54Efm5mHTPHHQs87O5L8vQ7izSJEoRUmt7u3tbd1878e1fm9U+XHZC5gQOs2ci5\nPqn1vB0wvc7+6UD7Wtsf19m3CrBuJsH8Azg2kzCOAgbn9NuIJEgJQirNj9ogclDflMe1X58J/LTO\n/g7AjFrbm9R6vinwHTA3s30vUXLoBixw9zFNiFMkr5QgRBo3G/hZI8c8S1QTHWlmzc2sL7AV0S6x\nzLFmtqWZrQ5cBjySqdLC3UcDS4HrUelBioQShFSap+qMgxhK9hJC7dduBo4ws3lmdlOW/bj7Z0Av\n4P8RpYL/R3Sp/azWYYOJksJMoCVwTp33vA/YBri/ab+aSH5ZkgsGmdldxIdmtrtvl2X/0cAFmc2v\ngNPc/e3EAhJJiZmNIBqp/97AMccBp7j73oWLTKR+SZcg7gb2b2D/+8De7r49cCVwR8LxiBSlTLXT\n6cDf0o5FZJlEE4S7jwQ+b2D/aHefn9kczQ97fIiUk3qL6ma2HzAHmAU8WLCIRBpRTFNtnAw8l3YQ\nIklw930b2DeMxrvUihRcUSQIM+sKnAjslXYsIiISUk8QZrYdMAjo4e71VkeZWXKt6SIiZczdmzL+\npyDdXI16BieZWQdgKHCcu7/X2IncXQ93Lr300tRjKJaHroWuha5Fw4+VkWgJwsyGAFXAOmb2EXAp\n0f/b3X0QMR9NW+C2zBQDi929S5IxiYhIbhJNEO5+dCP7TwFOSTIGERFpGo2kLkFVVVVph1A0dC2W\n07VYTtciPxIdSZ1PZualEquISLEwM7yIG6lFRKQEKUGIiEhWShAiIpKVEoSIiGSlBCEiIlkpQYiI\nSFZKECIikpUShIiIZKUEISIiWSlBiIhIVkoQIiKSlRKEiIhkpQQhIiJZKUGIiEhWShAiIpKVEoSI\niGSlBCEiIlkpQYiISFZKECIikpUShIiIZKUEISIiWSlBiIhIVkoQIiKSlRKEiIhkpQQhIiJZKUGI\niEhWiSYIM7vLzGab2fgGjrnFzKaa2ZtmtkOS8YiISO6SLkHcDexf304zOwDo6O4/B/oBtyccj4iI\n5CjRBOHuI4HPGzikN3Bf5tgxQBsz2yDJmEREJDdpt0G0Bz6utT0j85qIiKSsRdoBiBSbJUtg9myY\nNSsec+bAF1/A55/HvwsXwrffxqOmZvnPmcGqq8Jqq8W/bdrAWmvFY731YKONlj9atUrv9xPJVdoJ\nYgawSa3tjTOvZdW/f//vn1dVVVFVVZVUXFLmli6FDz+EiRPjMWkSvP8+fPABzJwJbdsuv5mvvz6s\nvXbc6Dt1gtVXjyTQqhW0qPUJqqmBRYsicXzzDXz5JcydC1Onxr+zZsGnn8Zj3XXhpz+FzTaDLbaA\nrbaKR6dOsMoqaV0VKQfV1dVUV1fn5Vzm7nk5Ub1vYPZT4Cl33zbLvp7AGe5+oJntBtzk7rvVcx5P\nOlYpT+7w3nvw2mswdiyMGwdvvRU3/K23hi23jEfHjnHD7tABWrZMLp6amkhCH3wQSWnSpHi8+y58\n8knEtOOOsNNOsPvusM020Lx5cvFIeTMz3N2a9LNJ3nTNbAhQBawDzAYuBVoC7u6DMscMAHoAC4AT\n3f2/9ZxLCUJysnQpTJgAI0bAyy/Dq6/GDX/33aFLF+jcGXbYAdZZJ+1If2zBAhg/PpLY2LGR1GbO\njLirqmDffWGXXVTKkNwVbYLIJyUIaci8eTBsGDz7LLzwArRuHTfTrl3hl7+EjTdOO8KmmzcvEsWy\nhPf++5EsevaEAw6IEo9IfZQgpCJ99BE8/jg89lh841520+zRI+r3y9XcufDii5EMn38e2rWDPn3i\nsc020VgusowShFSM2bPh4YdhyBCYNg0OPjhujN27R8+hSlNTE6WLxx6LR6tWcNRRcPTR0eAtogQh\nZW3RInjiCbj7bhgzBg46CI45Brp1+2EvokrnDv/5DzzwADz0EGyyCZxwQiSLtddOOzpJixKElKWJ\nE+Fvf4sb3vbbw0knwSGHRDdTadiSJfDSS5FUn38eevWCU0+N9hhVQVUWJQgpG0uWwFNPwYAB0e3z\nt7+NxPCzn6UdWemaNw8GD4bbb4/eXGeeGSWwNdZIOzIpBCUIKXlffw133QU33hiD0846Cw4/PNnx\nCJXGPUoVAwbAyJFRojj7bNhww7QjkyStTIJIey4mqXBz58JFF8UAtVdeiQbo116LenMlh/wyi8b8\nJ56Itpz582P09sknx2hvkbqUICQVs2fDeedFT5u5c2H0aHj0Udh117QjqwwdO8Jf/xqJYeONYY89\n4Nhjo91HZBklCCmoefMiMWy1VfROGj8+6sY7dkw7ssq07rrQv39MRbL11rDPPtE+MW1a2pFJMVCC\nkIL46iu4/PKYmO7rr+Gdd+CWW0p7hHM5ad0a/vznGKW91Vaw227wu9/BjHqnzpRKoAQhiVqyBAYN\niqqkyZOj7nvgwBj9K8VnzTWjTWjy5JiufLvt4JJLIqlL5VGCkMQMGxazkg4ZAk8/HeMZVJVUGtZZ\nB669NqYwef/9KPn9/e8/XP9Cyp+6uUreffgh/P738PbbcP31MR2GBmeVtrFj4dxz4bvvonG7S5e0\nI5JcqZurFIVvv4UrroCdd47HO+9A795KDuVgl11i7MRZZ8Vo9pNPjt5nUt6UICQv/v3vWGPhjTfi\nceGFlTl5Xjkzg+OPj66wa64ZM8cOHhwD8KQ8qYpJVsrnn8P558Nzz8Gtt8Khh6YdkRTKf/4Dp5wS\nXWXVVbl4qYpJUvHMM7DttrG62YQJSg6VZued4fXX4Ve/igGOAwbEan5SPlSCkBX2xRfwP/8D1dUx\nf1LXrmlHJGmbPDmmFl911ejttNlmaUcky6gEIQVTXR1Tb6+6aoyCVnIQiG6wI0fCgQdGD6e771bb\nRDlQCUJy8t13MWBq8OAoNfTokXZEUqzeeWf5inaDBkHbtmlHVNlUgpBETZ0Ku+8e6zO8+aaSgzRs\nm22ibaJDhyhtVlenHZE0lRKENOjBB2Omz5NOgn/+E9ZbL+2IpBSsuirccAPceWeskX355RqFXYpU\nxSRZLVwYI2dffhn+8Y+YMkOkKWbOjCqnFi3g/vu1QFGhqYpJ8uqDD6LUMH9+DHpTcpCV0a4dDB8e\nf1M77wyjRqUdkeRKCUJ+YNiwmOr5hBOieql167QjknLQokVUMw0aFONl/vpX9XIqBapiEiA+rNdc\nE6OhH3wwFo4RScJ770WS2HFH+NvfNCVL0lTFJCtl4cKoI3788Zi1U8lBktSxY6w7vmhR/K3NmpV2\nRFIfJYgKN3Mm7L13TMT2r39B+/ZpRySVYI01oqTaq1dM0/HGG2lHJNkoQVSw//43PpyHHhqL+ay2\nWtoRSSUxg4svhhtvjLE1Q4emHZHUlXiCMLMeZjbJzKaY2QVZ9rc2syfN7E0ze9vMTkg6JomJ9vbf\nPz6cf/6z1myQ9Bx2GLzwApxzToydUFNj8Ui0kdrMmgFTgG7ATGAscKS7T6p1zJ+A1u7+JzNbF5gM\nbODuS+qcS43UeTJwIFx2WbQ57L572tGIhI8+gp49oaoKbr4ZmjdPO6LyUMyN1F2Aqe4+3d0XAw8B\nvesc48BPMs9/AsyrmxwkP9yjtHDDDTGxmpKDFJMOHeLvctIk6NMnOk9IupJOEO2Bj2ttf5J5rbYB\nwNZmNhN4Czgn4Zgq0pIlsUzk8OHRg2TzzdOOSOTH1loLnn0WVl89qkC/+CLtiCpbi7QDAPYHxrn7\nvmbWEXjRzLZz96/rHti/f//vn1dVVVFVVVWwIEvZwoUxH84338TUGWuumXZEIvVr2TI6TZx7bnSD\nff552GijtKMqHdXV1VTnaYbEpNsgdgP6u3uPzPYfAXf3a2sd8zRwtbuPymy/BFzg7v+pcy61QTTB\nV1/BQQfF/Df33RcfPpFS4A5XXRULEA0frkWImqqY2yDGApub2aZm1hI4EniyzjHTge4AZrYB0Al4\nP+G4KsLnn8dykFtsEd/IlByklJjBhRfCH/4QJYnJk9OOqPIkWsXk7jVmdiYwjEhGd7n7RDPrF7t9\nEHAlcI+Zjc/82Pnu/lmScVWCOXNgv/1ixbcbblA3Vildp58eA+u6do3qpu22SzuiyqG5mMrQrFnQ\nrVv0L7/8ciUHKQ//+AecdVY0Yu+0U9rRlI6VqWIqhkZqyaNZs2DffeGYY+Cii9KORiR/fv1raNUq\nxkooSRSGEkQZ+fRTJQcpb70zo6iUJApDCaJMfPpp1NEqOUi5q50knnsOOndON55ypgRRBubNi95K\nRx2l5CCVoXfv6Abbs2d0gd1mm7QjKk9KECVu/vwYcdqrV8yMKVIpDjkEvv02/v6rq+HnP087ovKj\nBFHCFiyAAw+MOZWuukq9laTyHHlkzBDQvTv8+9+w6aZpR1RelCBK1KJFsY5Dp04x86WSg1Sqk06K\nL0vdusGoUbDBBmlHVD40DqIELV0aS4QuXhx9wzUtskiM+Xniiahuat067WiKx8qMg1CCKDHucPbZ\n8M470YNDC76LBPcYSPfuu9EFVp+NoARRQa64IpZm/Ne/oE2btKMRKS41NdGbr6ZGpetlinmyPsmj\nu++Ge+6J+WiUHER+rHlzGDw41pE491wtX7qylCBKxIsvwp/+FEXnDTdMOxqR4tWqFTz2WLRF3Hhj\n2tGUNvViKgHjx8cI6ccei6m7RaRhbdrEl6k99oilTA8/PO2ISpMSRJGbMSMGwd16K+y1V9rRiJSO\nTTaBp56Kae/btYtkIStGVUxF7Jtv4OCDYz78vn3Tjkak9OywA9x7b0x9/+GHaUdTetSLqUgtXRpJ\nYfXVo2FaA+FEmu7mm+HOO+HVV+EnP0k7msJSN9cydMkl8NJL8PLL0egmIk3nDv36xXopTzxRWd1f\n1c21zDz4INx3Hzz+uJKDSD6YwYAB8NVX0RtQcqMEUWTGjYuR0v/8J6y/ftrRiJSPli1jkOmjj8aX\nMGmcejEVkblzoU+f+Kaz/fZpRyNSftZZJ0rm3bvDVltFI7bUTyWIIrFkSUxdfMQR6rEkkqTtt49u\n44ceGottSf3USF0kzjsP3norJuCrpAY0kbScdx68+WZ85lqUcV2KGqlLXO16USUHkcK4+uroTt6/\nf9qRFC+VIFI2dWqM8HzuOdh557SjEaksc+ZA584waFCsb12OVIIoUQsXxhwxl1+u5CCShvXXh4ce\nghNPhOnT046m+KgEkaLf/jaSxAMPaKS0SJr+8hd45BF45ZXoDltOVIIoQffdF8P+Bw1SchBJ2x/+\nEBP6nX9+2pEUF5UgUjBlCuy5Z0yjse22aUcjIgCffQY77gh//WvMoFwuiroEYWY9zGySmU0xswvq\nOabKzMaZ2TtmNiLpmNK0aFGMd7jsMiUHkWLStm1U9558ckyzLwmXIMysGTAF6AbMBMYCR7r7pFrH\ntAFeBfZz9xlmtq67z81yrrIoQfz+99EYNnSoqpZEitEVV0Tpfvjw8uh2XswliC7AVHef7u6LgYeA\n3nWOORoY6u4zALIlh3LxzDOxKtyddyo5iBSrP/85/r366nTjKAZJJ4j2wMe1tj/JvFZbJ6CtmY0w\ns7FmdlzCMaVizpwout5/fxRlRaQ4NW8en9Nbb4XXX087mnQ1OMDczG4F6q3Xcfez8xRDZ2BfYA3g\nNTN7zd2n1T2wf60hj1VVVVRVVeXh7ZPnHsnhhBPgl79MOxoRaUz79jFp5rHHxgzLa6yRdkS5q66u\nprq6Oi/narANwsx+U2vzMuDS2vvd/d4GT262G9Df3Xtktv8YP+bX1jrmAmBVd78ss30n8Jy7D61z\nrpJtg7jjDrjtNhgzpvz6WIuUs+OPj+QwcGDakTRdQVaUM7Nx7r7jCgbWHJhMNFLPAl4HjnL3ibWO\n2RK4FegBtALGAH3d/d065yrJBDFtGuy2G/zrX/CLX6QdjYisiPnzY/bX224r3ak4ViZBrMgchit8\nd3b3GjM7ExhGtHfc5e4Tzaxf7PZB7j7JzF4AxgM1wKC6yaFU1dTEN5CLL1ZyEClFbdrEoNYjj4Tx\n42HdddOOqLBWpATxX3fvnHA8Db1/yZUgrrsOnn021pZupjHrIiXrD3+AmTNLcyW6xKqYzOwrlpcc\nVge+WbaLKAG0bsqbNkWpJYiJE6NBeuxY2GyztKMRkZWxcGGsPnf11bHqYykpSBtE2kopQdTUxFQa\nxx8Pp5+edjQikg+vvgqHHQZvv11aVU3FPFCuIl1/Pay+Ovzud2lHIiL5sscecMwxcOaZaUdSOCpB\n5NmkSbDXXqpaEilHpVjVpCqmIrF0KeyzD/z613DWWWlHIyJJGDkS+vaFCRNgrbXSjqZxqmIqEnfc\nAYsXq91BpJzttRccfDBckHVu6vKiEkSezJwZA2q0xoNI+Zs/P8Y2DRkCe++ddjQNUwmiCJx5ZjRK\nKzmIlL82bWKuplNOgW+/TTua5ChB5MHjj8O778KFF6YdiYgUyiGHwDbbwP/+b9qRJEdVTCvp669h\n661jOH6JTC4rInkycyZstx2MGgVbbJF2NNmpF1OKLrgg/kgGD047EhFJw003xWJgw4YV50JgShAp\nmTAhSg1vvw0bbph2NCKShiVLYKedYiW6vn3TjubHlCBS4A5du8Lhh1fWyEoR+bFRo2L808SJ0Lpg\nM9TlRr2YUvDAA/Dll3DaaWlHIiJp23NP6NEDLr208WNLiUoQTfDll7DlltF7addd045GRIrB3LnR\nYeWll4qru7uqmArsggtgzhy4++60IxGRYjJgQHxxHD68eBqslSAKaOpU2H33aJjeaKO0oxGRYrJk\nSUzmd+WVMU6iGChBFNDBB0d9YyXMwyIiK274cOjXL3o5rrpq2tGokbpgXnghRkyfe27akYhIsere\nPQbP3XRT2pGsPJUgcrR4cUzGd801UYoQEanPe+9FB5bx46Fdu3RjUQmiAO64I/6jDzoo7UhEpNh1\n7AgnnwyXXJJ2JCtHJYgcfPkldOoEzz8fDVAiIo2ZPz/uG8OHp9vtVSWIhF17Ley/v5KDiOSuTZuY\n4fn889OOpOlUgmjEJ59E28Obb8ImmxT87UWkhH33XSwsNHBgNF6nQSWIBF18MZx6qpKDiKy4li3h\n6qvhvPNizfpSowTRgPHj4dln4Y9/TDsSESlVhx0Gq60G99+fdiQrTlVMDejVC/bbD84+u6BvKyJl\nZtQoOOYYmDwZWrUq7HuriikBo0bFdBr9+qUdiYiUuj33jLaIO+5IO5IVk3iCMLMeZjbJzKaYWb0T\nVJjZLma22Mz6JB1TY9xj8Y/+/Quf7UWkPF15ZaxfvWBB2pHkLtEEYWbNgAHA/sAvgKPMbMt6jrsG\neCHJeHL14oswezYcd1zakYhIudhxR9h7b7j11rQjyV3SJYguwFR3n+7ui4GHgN5ZjjsLeBSYk3A8\njVpWerjiCmjRIu1oRKScXH45XH89fPFF2pHkJukE0R74uNb2J5nXvmdm7YBD3H0gkPoM6o8/DjU1\n0fNARCSfttgi5nL7y1/SjiQ3xfAd+SagdttEvUmif//+3z+vqqqiqqoqr4EsXRpzp1x7LTRT872I\nJOCSS6K66ZxzYL318n/+6upqqqur83KuRLu5mtluQH9375HZ/iPg7n5trWPeX/YUWBdYAJzq7k/W\nOVfi3VwfeQSuuw7GjCme1aBEpPz87new9toxiC5pRbtgkJk1ByYD3YBZwOvAUe4+sZ7j7waecvfH\nsuxLNEEsXRpzLV19NRx4YGJvIyLC9OnQuTNMmQLrrJPsexXtOAh3rwHOBIYBE4CH3H2imfUzs1Oz\n/UiS8TTkiSdiWHzPnmlFICKVYtNNoU8fuPHGtCNpmEZSE6WHzp2jh4EWAxKRQvjgA9h551jnvm3b\n5N6naEsQpeKpp6LNQYsBiUihbLYZHHII3Hxz2pHUr+JLEO6RxS+6CA49NO+nFxGp17KlSadNg7XW\nSuY9VIJYCS+8EHO29842fE9EJEEdO0anmNtuSzuS7Cq+BFFVFWvHHnts3k8tItKoCROgW7dok1ht\ntfyfXyWIJhozBj78EPr2TTsSEalUv/gFdOkC99yTdiQ/VtEliD59oGtXOOusvJ5WRGSFjBoFxx8f\n60Xkew44lSCaYNIkGDkSTjop7UhEpNLtuSdstBEMHZp2JD9UsQniuuvgzDNhjTXSjkREJJY2vuaa\n6FlZLCoyQcyYEbO2nnFG2pGIiISePWHx4liPplhUZIIYMCAWA0p6DhQRkVw1awbnn19cU4FXXCP1\nN9/EPChjxsDPfpaHwERE8mTRIvjpT2H48OjdlA9qpF4BgwdHg5CSg4gUm1at4LTT4JZb0o4kVFQJ\nYunSyMoDB8YAORGRYjNnTqw8N21afqrBVYLI0YsvRobeZ5+0IxERyW799WNeuEGD0o6kwkoQBxwQ\no6ZPOCE/MYmIJOGtt2KOpg8+gFVWWblzqQSRg4kTYdw4OPLItCMREWnY9ttDp07w6KPpxlExCeKW\nW2Id2FVXTTsSEZHGnXsu3HRTujFURBXTl19G19YJE6BduzwHJiKSgJqamA586FDYaaemn0dVTI14\n4IGYTlfJQURKRfPmcOqpcPvt6cVQ9iUI96jPu+EG6N49gcBERBLy6aew1VaxLEGbNk07h0oQDRg9\nGhYuhH33TTsSEZEVs+GG8Ktfwf33p/P+ZZ8gBg6MxulmZf+bikg5Ou20uI+lUdlT1lVM8+ZFI8+0\nabDuugkFJiKSIHfYcku46y7Ya68V/3lVMdXj3nvhoIOUHESkdJlFLUgajdVlW4Jwj/lM7r47JucT\nESlVn30WE4xOnQrrrbdiP6sSRBYjR8barnvskXYkIiIrp21b6NULhgwp7PuWbYK45x448cQonomI\nlLoTT4z7WiGVZRXTggWw8cbw7ruxELiISKlbujQWE3rySdhhh9x/rqirmMysh5lNMrMpZnZBlv1H\nm9lbmcdIM9t2Zd/z8cejaknJQUTKRbNm8JvfROebQkm0BGFmzYApQDdgJjAWONLdJ9U6ZjdgorvP\nN7MeQH933y3LuXIuQXTvDv36wRFH5OO3EBEpDtOmxZffTz6Bli1z+5liLkF0Aaa6+3R3Xww8BPSu\nfYC7j3b3+ZnN0UD7lXnD6dPhzTeje6uISDnZfPPonfncc4V5v6QTRHvg41rbn9BwAjgZWKlfffDg\nWBRI03qLSDk64YTCNVYXTS8mM+sKnAj8qJ0iV+5x4bRinIiUqyOOgBEj4P/+L/n3apHw+WcAHWpt\nb5x57QfMbDtgENDD3T+v72T9+/f//nlVVRVVVVU/2D9qVNTL7bzzSsUsIlK0WreOKvQhQ+Ccc368\nv7q6murq6ry8V9KN1M2ByUQj9SzgdeAod59Y65gOwEvAce4+uoFzNdpIfcYZsebDhRfmI3oRkeL0\n/PNw2WXw2muNH7syjdSJj4PI9Ey6majOusvdrzGzfoC7+yAzuwPoA0wHDFjs7l2ynKfBBFFTA+3b\nxwjqzTdP5FcRESkKixdHN/433ojVMhtS1AkiXxpLEC+/DOedFxdMRKTcnXIKdOoU972GFHM314J5\n+OHovSQiUgn69o37XpLKogSxeHG0PYwdG0PRRUTK3ZIlUa0+alTD1eoVX4J4+eVYGEjJQUQqRYsW\ncNhh8Mgjyb1HWSQIVS+JSCVKupqp5KuYvvsuFvYePz5mcBURqRQ1NbDJJjFwbostsh9T0VVMw4bB\n1lsrOYhI5WneHA4/PLlSRMkniEceUfWSiFSuvn2Ta4co6SqmmhrYYIPcBouIiJSjpUtj0Nzo0bDZ\nZj/eX7FVTGPGRPdWJQcRqVTNmkHPnvDMMwmcO/+nLJynn46FvEVEKlmvXnE/zDclCBGREverX8Gr\nr8LXX+f3vCWbID76CGbNgl13TTsSEZF0tW4d98KXXsrveUs2QTzzDBxwQHTzEhGpdElUM5VsglD1\nkojIcr16xRfnpUvzd86STBALFsArr8B++6UdiYhIcejYEdq0gXHj8nfOkkwQL78cy4qutVbakYiI\nFI98VzOVZIJQ9ZKIyI/lO0GU3Ehq95h3acSIWE1JRETC4sUxu8SECTG6GipsJLU7DB6s5CAiUtcq\nq0RDdZs2+TlfyZUgREQkdxVVghARkcJQghARkayUIEREJCslCBERyUoJQkREslKCEBGRrJQgREQk\nKyUIERHJKvEEYWY9zGySmU0xswvqOeYWM5tqZm+a2Q5JxyQiIo1LNEGYWTNgALA/8AvgKDPbss4x\nBwAd3f3nQD/g9iRjKgfV1dVph1A0dC2W07VYTtciP5IuQXQBprr7dHdfDDwE9K5zTG/gPgB3HwO0\nMbMNEo6rpOmPfzldi+V0LZbTtciPpBNEe+DjWtufZF5r6JgZWY4REZECUyO1iIhklehsrma2G9Df\n3Xtktv8VJmhYAAAEQ0lEQVQIuLtfW+uY24ER7v5wZnsSsI+7z65zLk3lKiLSBE2dzbVFvgOpYyyw\nuZltCswCjgSOqnPMk8AZwMOZhPJF3eQATf8FRUSkaRJNEO5eY2ZnAsOI6qy73H2imfWL3T7I3Z81\ns55mNg1YAJyYZEwiIpKbklkwSERECqvoGqk1sG65xq6FmR1tZm9lHiPNbNs04iyEXP4uMsftYmaL\nzaxPIeMrpBw/I1VmNs7M3jGzEYWOsVBy+Iy0NrMnM/eKt83shBTCTJyZ3WVms81sfAPHrPh9092L\n5kEkrGnApsAqwJvAlnWOOQB4JvN8V2B02nGneC12A9pknveo5GtR67iXgKeBPmnHneLfRRtgAtA+\ns71u2nGneC3+BFy97DoA84AWaceewLXYC9gBGF/P/ibdN4utBKGBdcs1ei3cfbS7z89sjqZ8x4/k\n8ncBcBbwKDCnkMEVWC7X4mhgqLvPAHD3uQWOsVByuRYO/CTz/CfAPHdfUsAYC8LdRwKfN3BIk+6b\nxZYgNLBuuVyuRW0nA88lGlF6Gr0WZtYOOMTdBwLl3OMtl7+LTkBbMxthZmPN7LiCRVdYuVyLAcDW\nZjYTeAs4p0CxFZsm3TeT7uYqBWBmXYneX3ulHUuKbgJq10GXc5JoTAugM7AvsAbwmpm95u7T0g0r\nFfsD49x9XzPrCLxoZtu5+9dpB1YKii1BzAA61NreOPNa3WM2aeSYcpDLtcDMtgMGAT3cvaEiZinL\n5VrsDDxkZkbUNR9gZovd/ckCxVgouVyLT4C57v4t8K2Z/RvYnqivLye5XIsTgasB3P09M/sA2BL4\nT0EiLB5Num8WWxXT9wPrzKwlMbCu7gf8SeB4+H6kdtaBdWWg0WthZh2AocBx7v5eCjEWSqPXwt1/\nlnlsRrRDnF6GyQFy+4z8E9jLzJqb2epEo+TEAsdZCLlci+lAd4BMnXsn4P2CRlk4Rv0l5ybdN4uq\nBOEaWPe9XK4FcDHQFrgt8815sbt3SS/qZOR4LX7wIwUPskBy/IxMMrMXgPFADTDI3d9NMexE5Ph3\ncSVwT63un+e7+2cphZwYMxsCVAHrmNlHwKVAS1byvqmBciIiklWxVTGJiEiRUIIQEZGslCBERCQr\nJQgREclKCUJERLJSghARkayUIETywMzOyEylXGNmbdOORyQflCBE8mMk0I0YuStSFopqJLVIscus\nr/488AYxId47wPHu/lZmfyVPEihlRiUIkRW3BTDA3bcGvgJOTzkekUQoQYisuI/cfXTm+f1U9jTr\nUsaUIERWntfzXKSkKUGIrLgOZrZr5vnRRAP1Mg1NuSxSUpQgRFbcZOAMM3sXaAMMNLOzzOxjYhnH\nt8ys7hTkIiVH032LrIBML6an3X3btGMRSZpKECIrTt+qpCKoBCEiIlmpBCEiIlkpQYiISFZKECIi\nkpUShIiIZKUEISIiWSlBiIhIVv8flHc5+jhqubUAAAAASUVORK5CYII=\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x105969be0>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import math\n",
"p = np.linspace(0.01, 0.99, 100, endpoint=True)\n",
"\n",
"def entropy(p):\n",
" q = 1 - p\n",
" return -(p*np.log2(p) + q*np.log2(q))\n",
"\n",
"h = entropy(p)\n",
"plt.plot(p, h)\n",
"plt.title(\"Entropy\")\n",
"plt.xlabel(\"p1\")\n",
"plt.ylabel(\"H\")\n",
"plt.ylim(0, 1.2)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that entropy is maximized at $p = 0.5$. A value of $p = 0.5$ means we are $50\\%$ certain that a sample belongs to class, thus the uncertainty is great. However, at $p = 0.9$, we are $90\\%$ certain, so entropy is low. At $p = 0.1$, although we are only $10\\%$ sure that the sample belongs to one class, but that also means we are $90\\%$ sure it belongs to the other, thus our certainty is still high.\n",
"\n",
"What is the intuition behind the entropy equation? Let's say we have an item belonging to a set of classes with a probability distribution, and we would like to guess what class that item belongs to by asking whether an item belongs to class $i$. To minimize the questions we need to ask, we would start by asking whether the item belongs to a class in decreasing class probability. The expression $\\frac{1}{p_{i}}$ captures the number of questions we need to ask according to this scheme to discover the item belonging to class $i$. Thus, we will need to ask more questions to get to the class occuring with low probability. Now it is clear that entropy equation represents the expected number of questions we need to ask given a certain class probability distribution to identify the class of the item. \n",
"\n",
"Lastly, it is worthy to note that given the number of classes $m$, the values that entropy can take on is $H \\in [0,\\:log_{2}\\:m]$. Also, when values of entropy are only used for comparison, it does not matter the base of the log, since changing the base only changes the value of entropy by a constant factor according to the mathematical rule: \n",
"\n",
"$$log_{b}a = \\frac{log_{c}a}{log_{c}b}$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. Optimization\n",
"With the definition of entropy set, we now state that the goal of each decision in a decision tree is to reduce entropy. Given a collection of classes, a decision should divide the collection into two groups such that the average entropy of the two groups is lower than the original entropy in the collection. \n",
"\n",
"In the implementation of this tutorial, our simple decision tree will have the following features:\n",
"\n",
"+ At each node, it chooses the attribute (feature) and attribute value that maximizes the information gain from splitting. \n",
" + Categorical attributes: pick a class that we will denote $a$ , then calculate for each category the number of samples of that category with class $a$ divided by the total number of samples with class $a$. Sort categories according to the proportion and choose the best split point going from left to right. \n",
" + Continuous attributes: sort by attribute value, then choose best split point going from left to right only at attribute values where the class changes from the current sample to the next sample. Choosing split points where class changes will always yield lower entropy than choosing split points from a stretch of sorted attribute values of samples with the same class. \n",
"+ Once a parent node chooses an attribute, it will never be used by its children.\n",
"+ Stopping criteria: the rationale for a stopping criteria is to prevent the tree from growing too deep. This prevents the pitfall of a decision tree to overfit and thus not generalize well to other datasets. \n",
" + Entropy at a node is less than $p\\%$ of the maximum entropy while having less than $n$ samples.\n",
" + All labels at a node are the same.\n",
" + No more attributes left to split on.\n",
"+ The class outputted by leaf nodes will be the majority class of samples that \"arrive\" at the node. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4. Ensembling\n",
"Ensembling is the method of combining multiple weak classsifiers to form one strong classifier. Why does this work? Ensembling is inspired by the \"wisdom of crowd\" concept best explained [here](http://www.npr.org/2015/08/20/432978431/wighty-issue-cow-guessing-game-helps-to-explain-the-stock-market). \n",
"\n",
"Ensembling with decision trees gives rise to random forests. In our random forest implementation, we shall perform:\n",
"\n",
"## 4.1 Bagging\n",
"Bagging: generate bootstrap replicates of a training set. Each one of these replicates will be used to construct a decision tree. What is a bootstrap? Pretend we have $n$ training samples, a boostrap is a random sampling with replacement of $n$ samples from our original $n$ training samples. It is often viewed as a way to artificially generate different datasets. On average, how much of the original dataset is represented in each boostrap? Let us utilize the concepts of probability. The probability of a sample not getting selected to a bootstrap is:\n",
"\n",
"> $$p = 1 - \\frac{1}{n}$$\n",
"\n",
"> As each sample has a $\\frac{1}{n}$ chance of getting selected in each sampling attempt while constructing the boostrap, the probability that a sample will never make it to a boostrap is\n",
"\n",
"> $$p = \\big(1 - \\frac{1}{n}\\big)^{n} \\approx \\frac{1}{e} \\approx 0.368$$\n",
"\n",
"> The probability that a sample will appear at least once in a boostrap is thus $1 - 0.368 = 0.632$. Taking a frequentist interpretation to probability, we can expect each bootstrap to contain about $0.632\\%$ of the original dataset with replicates.\n",
"\n",
"## 4.2 Randomness\n",
"Even with slightly different datasets, given that the attributes to choose from are approximately the same at each node, each tree in the random forest will be correlated and this result will counter to our goal of introducing randomness to our ensembling model. To remedy this, we will randomly select a $p\\%$ of attributes to consider at each node. \n",
"\n",
"## 4.3 Adaboost\n",
"When multiple classifiers are trained, adaboost assigns increasing importance to samples that were misclassified in the previous classifier, with the hope of increasing accuracy in later classifiers.Although we will not implement adaboost in our random forest, we can walk through an example training dataset. When ensembling with multiple classifiers, adaboost assigns appropriate weights to each sample for each successive training dataset and assigns a weight to each classifier. Specifically:\n",
"\n",
"1. Misclassified samples from the previous classifier are assigned more weight(importance) for the next classifier\n",
"2. After all classifiers are trained, a weight is assigned to each classifier\n",
"\n",
"To formalize the adaboost algorithm, suppose the training dataset consists of $m$ samples with labels $y \\in \\{-1, +1\\}$. The adaboost algorithm starts by initiating every sample with weight $w_{i} = \\frac{1}{m}$. Then the algorithm runs as follows assuming a total of $T$ different classifiers.\n",
"\n",
"for t = classifier 1 to t = classifer $T$:\n",
"\n",
"1. Train classifier $h_{t}$ and compute loss $\\epsilon_{t}$ based of $w$\n",
"2. Compute hypothesis weight $\\alpha_{t} = \\frac{1}{2}ln\\Big(\\frac{1 - \\epsilon_{t}}{\\epsilon_{t}}\\Big)$\n",
"3. Update weights for each training sample $x_{i}$:\n",
"\n",
"$$w_{t+1}(i)=\\frac{w_{t}(i)}{z_{t}}\\begin{cases}\n",
" e^{-\\alpha_{t}}, & \\text{if $h_{t}(x_{i}) = y_{i}$}\\\\\n",
" e^{\\alpha_{t}}, & \\text{if $h_{t}(x_{i}) \\ne y_{i}$}\\\\\n",
" \\end{cases}$$\n",
" $$\\text{Where $z_{t}$ is normalization factor for classifer $t$}$$\n",
" \n",
"Finally, output:\n",
"\n",
"$$H(x) = sign\\Big(\\sum_{t=1}^{T}\\alpha_{t}h_{t}(x)\\Big)$$\n",
"\n",
"Where $sign(x)$ equals 1 if $x > 0$ and equals -1 if $x < 0$. \n",
"\n",
"To see a runthrough of an example dataset, click [here](http://faculty.utpa.edu/kimd1/class/csci4352-sp16/17_AdaBoostExample.pdf). Note that for this tutorial, $I(x)$ equals -1 if $x < 0$ and equals 1 if $x > 0$. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 5. Implementation\n",
"With these specifications we are ready to implement our simple decision tree and random forest algorithms! We will have single class called `DecisionTree` that can train a single decision tree classifier or a random forests classifier based on user input. Here is the usage of initiating a `DecisionTree` class:\n",
"\n",
"```\n",
"classifier = DecisionTree(method=\"simple_tree\", num_trees=None, stop=0.3, minSize=100, subset=None)\n",
"```\n",
"\n",
"Here are the parameters the user can specify:\n",
"\n",
"+ method: simple_tree or random_forest.\n",
"+ num_trees: if training a random forest, the number of trees in a random forest.\n",
"+ stop: a percentage indicating percentage of maximum entropy (determined by number of classes) to stop tree growth.\n",
"+ minSize: a number indicating the size below which nodes with samples with entropy below a fraction of maximum entropy (provided by `stop`) would stop growing.\n",
"+ subset: a percentage indicating percentage of attributes to randomly choose from at each node during the training of random forests.\n",
"\n",
"What that in mind here is our implementation!"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import copy\n",
"import math\n",
"from collections import Counter\n",
"import pandas as pd\n",
"import numpy as np\n",
"import sys\n",
"pd.options.mode.chained_assignment = None\n",
"\n",
"class DecisionTree:\n",
"\n",
" def __init__(self, method=\"simple_tree\", num_trees=None, stop=0.3, \n",
" minSize=10, subset=None):\n",
" self.method = method\n",
" self.num_trees = num_trees\n",
" self.stop = stop\n",
" self.minSize = minSize\n",
" self.subset = subset\n",
"\n",
" def train(self, data, labels):\n",
" def simple_tree(data, labels):\n",
" '''Method for building a simple decision tree'''\n",
" attributes = copy.deepcopy(data.columns.values.tolist())\n",
" data['labels'] = pd.Series(labels, index=data.index)\n",
" self.classes = data.ix[:, 'labels'].unique()\n",
" tree = simple_tree_helper(data, attributes)\n",
" return tree\n",
"\n",
" def simple_tree_helper(data, attributes, random=False):\n",
" '''Recursive method that builds the decision tree'''\n",
" if data.shape[0] > 0 and len(attributes) > 0: \n",
" # If all the labels in this group are the same, \n",
" # then we are done and return the majority label\n",
" if len(data[\"labels\"].unique()) == 1:\n",
" return int(data[\"labels\"].values[0])\n",
" # Else if the group entropy is less than p% of the original \n",
" elif (calculate_entropy(data[\"labels\"].values) < \n",
" self.stop * math.log(len(self.classes), 2)): \n",
" return majority(data[\"labels\"].values)\n",
" elif (data.shape[0] < self.minSize):\n",
" return majority(data[\"labels\"].values)\n",
" else:\n",
" # Tree decorrelation step of random forest: random choose \n",
" # a fraction of attributes to choose from\n",
" if random:\n",
" indices = np.random.choice(len(attributes), \n",
" size=math.ceil(self.subset*len(attributes)), replace=False)\n",
" subset = [attributes[i] for i in indices]\n",
" bestAttribute, splitVal = choose_attribute(data, subset)\n",
" else:\n",
" bestAttribute, splitVal = choose_attribute(data, attributes)\n",
" attributes.remove(bestAttribute)\n",
" if isinstance(splitVal, list):\n",
" left = data[data[bestAttribute].isin(splitVal[0])]\n",
" right = data[data[bestAttribute].isin(splitVal[1])]\n",
" else:\n",
" left = data[data[bestAttribute] <= splitVal]\n",
" right = data[data[bestAttribute] > splitVal]\n",
" tree = BinaryTree((bestAttribute, splitVal))\n",
" if len(left) == 0 or len(right) == 0:\n",
" return majority(data[\"labels\"].values)\n",
" tree.insertLeft(simple_tree_helper(left, attributes))\n",
" tree.insertRight(simple_tree_helper(right, attributes))\n",
" return tree\n",
" elif data.shape[0] > 0 and len(attributes) == 0:\n",
" return majority(data[\"labels\"].values)\n",
" else:\n",
" return None\n",
"\n",
" def random_forest(data, labels):\n",
" '''Method for building a random forest'''\n",
" trees = []\n",
" attributes = copy.deepcopy(data.columns.values.tolist())\n",
" attributesTemp = copy.deepcopy(attributes)\n",
" data['labels'] = pd.Series(labels, index=data.index)\n",
" self.classes = data.ix[:, 'labels'].unique()\n",
" for i in range(self.num_trees):\n",
" # Bagging: randomly picking 2/3 of the data for making a simple \n",
" # decision tree\n",
" baggingIdx = np.random.choice(data.shape[0], size=data.shape[0], \n",
" replace=True)\n",
" bag = data.iloc[baggingIdx, :]\n",
" tree = simple_tree_helper(bag, attributesTemp, True)\n",
" trees.append(tree)\n",
" attributesTemp = copy.deepcopy(attributes)\n",
" return trees\n",
"\n",
" # @ input: data frame, available attributes\n",
" # @ output: attribute and attribute value to split on that maximizes difference \n",
" # between parent entropy and \n",
" # average children entropy\n",
" def choose_attribute(data, attributes):\n",
" bestGain = 0\n",
" bestAttribute = None\n",
" bestSplit = None\n",
" parentEntropy = calculate_entropy(data.ix[:, \"labels\"].values)\n",
" for attribute in attributes:\n",
" if len(data[attribute].unique()) == 1:\n",
" gain = 0\n",
" bestAttribute = attribute\n",
" bestSplit = data[attribute].values[0]\n",
" else:\n",
" gain, splitVal = calculate_gain(data.ix[:, [attribute, \"labels\"]], \n",
" attribute, parentEntropy)\n",
" if gain >= bestGain:\n",
" bestGain = gain\n",
" bestSplit = splitVal\n",
" bestAttribute = attribute\n",
" return bestAttribute, bestSplit\n",
"\n",
" # @ input: data frame, selected attribute, parent entropy\n",
" # @ output: best entropy gain and best split for that attribute\n",
" def calculate_gain(data, attribute, parentEntropy):\n",
" total = data.shape[0]\n",
" minEntropy = float(\"inf\")\n",
" splitVal = None\n",
" bestLeft = None\n",
" bestRight = None\n",
" # If attribute is categorical, sort attribute value by class proportion for \n",
" # that attribute value, and choose best split point going from least class \n",
" # proportion to greatest class proportion\n",
" if type(data[attribute].values[0]) == str: \n",
" label = self.classes[0]\n",
" n = len(data[data[\"labels\"] == label])\n",
" attributeVals = data[attribute].unique()\n",
" proportions = [(val, data[(data[attribute] == val) & \n",
" (data[\"labels\"] == label)].shape[0] / n) for val in attributeVals]\n",
" proportions = sorted(proportions, key=lambda x: x[1])\n",
" if len(proportions) == 1:\n",
" leftLabels = data[\"labels\"].values\n",
" minEntropy = calculate_entropy(leftLabels)\n",
" splitVal = [[proportions[0][0]], []]\n",
" else: \n",
" for i in range(1, len(proportions)):\n",
" leftVal = [tup[0] for tup in proportions[:i]]\n",
" rightVal = [tup[0] for tup in proportions[i:len(proportions)]]\n",
" leftLabels = data[data[attribute].isin(leftVal)][\"labels\"].values\n",
" rightLabels = data[data[attribute].isin(rightVal)][\"labels\"].values\n",
" leftEntropy = calculate_entropy(leftLabels)\n",
" rightEntropy = calculate_entropy(rightLabels)\n",
" avgEntropy = ((len(leftLabels) / total) * leftEntropy \n",
" + (len(rightLabels) / total) * rightEntropy)\n",
" if avgEntropy < minEntropy:\n",
" minEntropy = avgEntropy\n",
" splitVal = [leftVal, rightVal]\n",
" bestLeft = leftEntropy\n",
" bestRight = rightEntropy\n",
" # If attribute is continuous, sort attribute value, then choose best split \n",
" # point from least to greatest only from points corresponding to class \n",
" # label change.\n",
" else: \n",
" data = data.sort_values(attribute, axis=0)\n",
" diff = data[\"labels\"].diff(1).values\n",
" idx = np.where(diff == -1)[0] - 1\n",
" if len(idx) > 50:\n",
" step = len(idx) // 50\n",
" idxindex = list(range(0, len(idx) + step, step))\n",
" idxindex = idxindex[:len(idxindex) - 1]\n",
" idx = idx[idxindex]\n",
" if len(idx) == 0:\n",
" minEntropy = calculate_entropy(data[\"labels\"].values)\n",
" splitVal = data[attribute].values[0]\n",
" else: \n",
" attributeVals = data.iloc[idx, :][attribute].unique()\n",
" if len(attributeVals) > 1:\n",
" end = len(attributeVals) - 1\n",
" else:\n",
" end = 1\n",
" for val in attributeVals[:end]:\n",
" leftLabels = data[data[attribute] <= val][\"labels\"].values\n",
" rightLabels = data[data[attribute] > val][\"labels\"].values\n",
" leftEntropy = calculate_entropy(leftLabels)\n",
" rightEntropy = calculate_entropy(rightLabels)\n",
" if rightEntropy == None:\n",
" avgEntropy = leftEntropy\n",
" else:\n",
" avgEntropy = ((len(leftLabels) / total) * leftEntropy + \n",
" (len(rightLabels) / total) * rightEntropy)\n",
" if avgEntropy < minEntropy:\n",
" minEntropy = avgEntropy\n",
" splitVal = val\n",
" return parentEntropy - minEntropy, splitVal\n",
"\n",
" def calculate_entropy(data):\n",
" '''Calculate entropy of given list of class labels'''\n",
" if not isinstance(data, np.ndarray):\n",
" data = np.array(data)\n",
" n = len(data)\n",
" if n == 0:\n",
" return None\n",
" entropy = 0.0\n",
" for label in self.classes:\n",
" subset = np.where(data == label)[0]\n",
" p = len(subset)/n + 1e-6\n",
" entropy -= p * math.log(p, 2)\n",
" return entropy\n",
"\n",
" def majority(labels):\n",
" '''Returns the majority label'''\n",
" a = Counter(labels)\n",
" vote = a.most_common(1)[0][0]\n",
" return int(vote)\n",
"\n",
" class BinaryTree:\n",
" def __init__(self, decision):\n",
" self.decision = decision\n",
" self.leftChild = None\n",
" self.rightChild = None\n",
" def insertLeft(self, newNode):\n",
" self.leftChild = newNode\n",
" def insertRight(self, newNode):\n",
" self.rightChild = newNode\n",
" def getRightChild(self):\n",
" return self.rightChild\n",
" def getLeftChild(self):\n",
" return self.leftChild\n",
" def setNode(self, decision):\n",
" self.decision = decision\n",
" def getNode(self):\n",
" return self.decision\n",
"\n",
" if self.method == \"random_forest\": \n",
" trees = random_forest(data, labels)\n",
" self.tree = trees\n",
" elif self.method == \"simple_tree\": \n",
" tree = simple_tree(data, labels)\n",
" self.tree = tree\n",
" def predict(self, dataset):\n",
" def transverse_classify(tree, sample):\n",
" if isinstance(tree, int):\n",
" return int(tree)\n",
" else:\n",
" decision = tree.getNode()\n",
" attribute = decision[0]\n",
" splitVal = decision[1]\n",
" sampleVal = sample[attribute]\n",
" if isinstance(splitVal, list):\n",
" if sampleVal in splitVal[0]:\n",
" tree = tree.getLeftChild()\n",
" else:\n",
" tree = tree.getRightChild()\n",
" else:\n",
" if sampleVal <= splitVal:\n",
" tree = tree.getLeftChild()\n",
" else:\n",
" tree = tree.getRightChild()\n",
" return transverse_classify(tree, sample)\n",
"\n",
" def majority(labels):\n",
" '''Returns the majority label'''\n",
" a = Counter(labels)\n",
" vote = a.most_common(1)[0][0]\n",
" return int(vote)\n",
"\n",
" tree = self.tree\n",
" predictions = []\n",
" if self.method == \"simple_tree\":\n",
" for rownum in range(dataset.shape[0]):\n",
" predict = transverse_classify(tree, dataset.iloc[rownum, :])\n",
" predictions.append(predict)\n",
" # In random forest, the majority vote among all trees is the prediction\n",
" elif self.method == \"random_forest\":\n",
" for rownum in range(dataset.shape[0]):\n",
" ballot = []\n",
" for t in tree:\n",
" predict = transverse_classify(t, dataset.iloc[rownum, :])\n",
" ballot.append(predict)\n",
" vote = majority(ballot)\n",
" predictions.append(vote)\n",
" return predictions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataset we will be testing our implementation on is going to be a census dataset with features such as age, workclass, education, race...etc. The class we are going to predict is whether a person given a set of features earns greater than $50K a year, which is represented in binary. However, the dataset has a few missing values represented by \"?\", which means either samples with missing values have to be disgarded or imputed. We will take the latter direction and crudely impute with the majority value of that attribute. Let us define methods for imputing missing values."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def fill_missing(data, attributes):\n",
" for attribute in attributes:\n",
" if \"?\" in data[attribute].unique().tolist():\n",
" mode = majority(data[attribute].values)\n",
" indices = data[data[attribute] == \"?\"].index.tolist()\n",
" data.ix[indices, attribute] = mode\n",
" return data\n",
"\n",
"def majority(labels):\n",
" '''Returns the majority label'''\n",
" a = Counter(labels)\n",
" vote = a.most_common(1)[0][0]\n",
" return vote"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let us load our dataset, split into training and test datasets according, and build and evaluate a decision tree and random forest respectively."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Simple decision tree\n",
"Accuracy: 0.798655667583\n",
"Random Forest\n",
"Accuracy: 0.762603116407\n"
]
}
],
"source": [
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"\n",
"def evaluate_accuracy(predictions, validationLabel):\n",
" n = len(predictions)\n",
" return sum(predictions == validationLabel) / n\n",
"\n",
"data = pd.read_csv(\"datasets/data.csv\")\n",
"attributes = data.columns.values\n",
"data = fill_missing(data, attributes)\n",
"n = data.shape[0]\n",
"randomIdx = np.random.permutation(n)\n",
"data = data.iloc[randomIdx, :]\n",
"trainData = data.iloc[:9*n//10, :]\n",
"validationData = data.iloc[9*n//10:n, :]\n",
"trainLabel = trainData.ix[:, \"label\"].values\n",
"validationLabel = validationData.ix[:, \"label\"].values\n",
"trainData2 = copy.deepcopy(trainData)\n",
"validationData2 = copy.deepcopy(validationData)\n",
"trainLabel2 = copy.deepcopy(trainLabel)\n",
"validationLabel2 = copy.deepcopy(validationLabel)\n",
"del trainData[\"label\"]\n",
"del validationData[\"label\"]\n",
"print(\"Simple decision tree\")\n",
"classifier = DecisionTree()\n",
"classifier.train(trainData, trainLabel)\n",
"predictions = np.array(classifier.predict(validationData))\n",
"print(\"Accuracy: \", evaluate_accuracy(predictions, validationLabel))\n",
"print(\"Random Forest\")\n",
"classifier = DecisionTree(method=\"random_forest\", num_trees=200, subset=0.1)\n",
"classifier.train(trainData2, trainLabel2)\n",
"predictions = np.array(classifier.predict(validationData2))\n",
"print(\"Accuracy: \", evaluate_accuracy(predictions, validationLabel2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In general, ensembling should increase validation accuracy because the model being built is more generalizable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 6. Multiway Splits and Gain Ratio\n",
"A decision tree does not have to be binary, it can alternatively also perform multi-way splits. Let us look at an example of constructing a decision tree to predict whether or not a person is likely to go golfing based on the weather. Here is a table of the training data:\n",
"\n",
"<img src=\"http://i.imgur.com/bC70gYw.png\", width=500, height=500>\n",
"\n",
"The attribute outlook has three attributes - sunny, overcast and rain. If the outlook attributed is chosen at a node then a natural way to split the data is a three-way split based on whether a sample has sunny, overcast, or rain as the outlook. Here is what a multiway split decision tree might look like.\n",
"\n",
"<img src=\"http://i.imgur.com/qJS7SN9.png\", width=500, height=500>\n",
"\n",
"However, notice a problem if the tree splits based on maximum information gain - it is biased to choose the attribute `Day` to split on, because each sample will become a leaf, resulting in zero entropy for the children. This corresponds to complete information gain. However, while splitting on `Day` may result in perfect accuracy on the training dataset, dates is not generalizable to data in the testset, and thus is a poor attribute to choose. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To prevent this bias, we could have the multiway decision tree split based on which attribute will result in the highest gain ratio. Before defining gain ratio, let us first define the intrinsic information for an attribute, which is:\n",
"\n",
"$$\\text{Intrinsic Information} = -\\sum_{i=1}^{n} p_{i}log_{2}\\:p_{i}$$\n",
"\n",
"Where $n$ is the number of distinct values for an attribute. The gain ratio is now defined as:\n",
"\n",
"$$\\text{Gain Ratio} = \\frac{\\text{Information Gain}}{\\text{Intrinsic Information}}$$\n",
"\n",
"We can view the gain ratio as the normalized version of information gain. Let us walk through an example calculation of the gain ratio for `Outlook` based on the table provided above:\n",
"\n",
"$$\\text{Intrinsic Information} = -\\big(\\frac{5}{14}log_{2}\\frac{5}{14} + \\frac{4}{14}log_{2}\\frac{4}{14} + \\frac{5}{14}log_{2}\\frac{5}{14}\\big) = 1.577$$\n",
"\n",
"$$\\text{Parent Entropy} = -\\big(\\frac{9}{14}log_{2}\\frac{9}{14} + \\frac{5}{14}log_{2}\\frac{5}{14}\\big) = 0.940$$\n",
"\n",
"$$\\text{Child Entropy} = -\\frac{5}{14}\\big(\\frac{2}{5}log_{2}\\frac{2}{5} + \\frac{3}{5}log_{2}\\frac{3}{5}\\big) - \\frac{4}{14}\\big(0\\big) - \\frac{5}{14}\\big(\\frac{2}{5}log_{2}\\frac{2}{5} + \\frac{3}{5}log_{2}\\frac{3}{5}\\big) = 0.693$$\n",
"\n",
"$$\\text{Information Gain} = \\text{Parent Entropy} - \\text{Child Entropy} = 0.247$$\n",
"\n",
"$$\\text{Gain Ratio} = \\frac{\\text{Information Gain}}{\\text{Intrinsic Information}} = \\frac{0.247}{1.577} = 0.157$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 7. Gini Index\n",
"An alternative measure to minimizing entropy is to minimize the gini index. The gini index can be interpreted as the probability of misclassifying a randomly chosen sample if it were labelled according to the probability distribution of the labels. We derive the gini index formula based off this definition, where $p_{i}$ is the probability of choosing choosing a sample of class $i$ and $J$ is the number of distinct classes.\n",
"\n",
"$$\\text{Gini Index} = \\sum_{i=1}^{J} p_{i}(1-p_{i}) = \\sum_{i=1}^{J}(p_{i}-p_{i}^{2}) = \\sum_{i=1}^{J}p_{i} - \\sum_{i=1}^{J}p_{i}^{2} = 1 - \\sum_{i=1}^{J}p_{i}^{2}$$\n",
"\n",
"Let us plot the gini index values as $p_{1},...,p_{J}$ vary with $J = 2$. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAEZCAYAAACNebLAAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xu8XPO5x/HPkxDqXglRURESVSFUiPS47Uiw45aqW6RV\n1CVVFHXacI6e7KNaHA7aUkTVEUqKFCl1Z9OQSNwSJJEQjSRIJJJIIs31OX88k+xtd+1k9mSvWTOz\nv+/Xa17msvaax3pN5jvrd1vm7oiIiDTUKusCRESkNCkgREQkkQJCREQSKSBERCSRAkJERBIpIERE\nJJECQlocM7vFzP6zubctoI5VZrZzGvsWaQ6meRBSacysP3ARsAewCPgAGOrutzTz+5wGnOXuBxX4\n9yuBLu4+tTnrEmkuOoOQimJmlwA3ANcA7d19O+BHwL+Z2YYpvOX6/MKyZqtCJAUKCKkYZrYF8N/A\nue7+kLsvBnD3ce5+qrsvz213p5ldkbt/iJlNN7OfmtksM5tpZqfX2+eabfN4/w/M7BIzG2dm88zs\nPjNrU+/1n5nZR2Y2w8zOoF64mFkbM7vOzKaZ2ce5pq2Ncq/93MxGm1mr3ONzzeyt+vsWSYMCQirJ\nt4E2wIgm/t12wObA9sBZwM1mtmWBNZwIHA50AvYCTgcws2rgp0BvoAvQp8HfXQN0Brrl/rs98F+5\n164F/glcbmadgV8B33P3ZQXWKJIXBYRUknbAHHdftfoJM3sp92v+CzM7sJG/Wwb80t1XuvvjRL/F\nNwqs4TfuPsvd5wN/BfbOPX8icKe7T3T3JUANX25iOhu42N0X5M58rgZOAfDoKDwNuJAIv6vdfXyB\n9YnkbYOsCxBpRnOBdmbWanVIuPsBAGY2ncZ/EM2tHyrAF8BmBdYwq8F+vpa7vz3war3Xpq2+Y2bb\nAJsAr5mtyYxW1AsQd59mZs8DfYHfF1ibSJPoDEIqyShgKdAv60ISfAx8vd7jjtT1QcwhwqSru2+d\nu23l7muauczsKKIJ7VnguiLVLC2cAkIqhrsvAK4Afm9mx5vZZhb2Jn6hZ+l+4HQz+6aZbUJd/8Lq\nJqTbgRtzZxOYWQczOzx3v13u9R8SfRpHm1nfItcvLZACQiqKu19LdAb/HPgkd7sl9/jlfHdT6Nuv\npa4ngBuB54DJxJlAfYOA94DRZjYfeArYNffabcBD7v6ku39GdKTfbmZfLbBOkbykOlHOzO4AjgZm\nuXu3hNcHEP8wABYSwxPfSq0gERHJW9pnEHcCR6zl9anAwe6+F3AlcRotIiIlINVRTO4+0sw6ruX1\n0fUejgY6pFmPiIjkr5T6IM4CHs+6CBERCSUxD8LMegFnAI1NZBIRkSLLPCDMrBswBKh293lr2U7L\nzoqIFMDdC1oYshhNTEYjq1aa2Y7AcOBUd39/XTtyd93cGTx4cOY1lMpNx0LHQsdi7bf1keoZhJnd\nC1QBbc3sQ2AwsZiau/sQ4BfA1sTEJgOWu3uPNGsSEZH8pD2KacA6Xj+bWKRMRERKTCmNYpI8VVVV\nZV1CydCxqKNjUUfHonmUzSVHzczLpVYRkVJhZngJd1KLiEgZUkCIiEgiBYSIiCRSQIiISCIFhIiI\nJFJAiIhIIgWEiIgkUkCIiEgiBYSIiCRSQIiISCIFhIiIJFJAiIhIIgWEiIgkUkCIiEgiBYSIiCRS\nQIiISCIFhIiIJFJAiIhIIgWEiIgkUkCIiEgiBYSIiCRSQIiISCIFhIiIJFJAiIhIIgWEiIgkUkCI\niEiiVAPCzO4ws1lmNn4t2/zWzKaY2Ztmtnea9YiISP7SPoO4EziisRfNrC+wi7t3AQYCt6Zcj4iI\n5CnVgHD3kcC8tWzSDxia2/YVYEsza59mTSIikp+s+yA6ANPrPZ6Ze05ERDKWdUCIiEiJ2iDj958J\nfL3e4x1yzyWqqalZc7+qqoqqqqq06hIRKUu1tbXU1tY2y77M3ZtlR42+gdlOwF/dfc+E144EznP3\no8ysJ3Cju/dsZD+edq0iIpXGzHB3K+RvUz2DMLN7gSqgrZl9CAwG2gDu7kPc/W9mdqSZvQcsBs5I\nsx4REclf6mcQzUVnECIiTbc+ZxDqpBYRkUQKCBERSaSAEBGRRAoIERFJpIAQEZFECggREUmkgBAR\nkUQKCBERSaSAEBGRRAoIERFJpIAQEZFECggREUmkgBARkUQKCBERSaSAEBGRRAoIERFJpIAQEZFE\nCggREUmkgBARkUQKCBERSaSAEBGRRAoIERFJpIAQEZFECggREUmkgBARkUQKCBERSaSAEBGRRAoI\nERFJpIAQEZFEqQeEmVWb2SQzm2xmgxJe38LMRpjZm2b2lpmdnnZNIiKybubu6e3crBUwGegNfASM\nBfq7+6R621wGbOHul5lZO+BdoL27r2iwL0+zVhGRSmRmuLsV8rdpn0H0AKa4+zR3Xw4MA/o12MaB\nzXP3NwfmNgwHEREpvrQDogMwvd7jGbnn6rsJ2N3MPgLGARemXJOIiORhg6wLAI4A3nD3Q81sF+Bp\nM+vm7osablhTU7PmflVVFVVVVUUrUkSkHNTW1lJbW9ss+0q7D6InUOPu1bnHlwLu7tfU2+ZR4Cp3\nfyn3+FlgkLu/2mBf6oMQEWmiUu6DGAt0NrOOZtYG6A+MaLDNNKAPgJm1B3YFpqZcl4iIrEOqTUzu\nvtLMzgeeIsLoDnefaGYD42UfAlwJ/J+Zjc/92c/d/bM06xIRkXVLtYmpOamJSUSk6Uq5iUlERMqU\nAkJERBIpIEREJJECQkREEikgREQkkQJCREQSKSBERCSRAkJERBKtMyDMbOOE59qlU46IiJSKfM4g\nxuYW3QPAzI4HXk6vJBERKQX5rMU0APijmdUC2wNtgUPTLEpERLKX11pMZvYd4G5gIXCwu7+XdmEJ\nNWgtJhGRJlqftZjWeQZhZncAuwDdiKW4HzWz37n7zYW8oYiIlId8+iDeAnq5+wfu/iSwP7BPumWJ\niEjW8m1i+gqwo7u/m35JjdagJiYRkSZKdblvMzsGeBN4Ivd4bzNreFU4ERGpMPk0MdUAPYD5AO7+\nJrBzijWJiEgJyCcglrv7ggbPrUqjGBERKR35zIN4x8wGAK3NrAvwEzRRTkSk4uVzBnEB0BVYCtwH\nfA5clGZRIiKSvbxGMZUCjWISEWm6VCbKmdlfgUa/kd392ELeUEREysPa+iCuy/33u8B2wD25x6cA\ns9IsSkREsrfOJiYze9Xd913Xc2lTE5OISNOlOlEO2NTM1sx7MLNOwKaFvJmIiJSPfIa5XgzUmtlU\nwICOwMBUqxIRkczluxbTRsBuuYeT3H1pqlUl16AmJhGRJlqfJqZ8A+LfgJ2od8bh7kMLecNCKSBE\nRJou7cX67iZGNB0I7Je75d1BbWbVZjbJzCab2aBGtqkyszfM7G0zez7ffYuISHryGcU0Edi9kJ/v\nZtYKmAz0Bj4CxgL93X1SvW22JJbuONzdZ5pZO3efk7AvnUGIiDRR2qOY3ibmQRSiBzDF3ae5+3Jg\nGNCvwTYDgOHuPhMgKRxERKT48hnF1A6YYGZjiPWYgLxnUncAptd7PIMIjfp2BTbMNS1tBvzW3e/O\nY98iIpKifAKipgg17AMcSsyvGGVmo9z9vX8ppKaulKqqKqqqqlIuTUSkvNTW1lJbW9ss+0p1sT4z\n6wnUuHt17vGlgLv7NfW2GQRs7O7/nXv8B+Bxdx/eYF/qgxARaaJU+iDMbKGZfZ5wW2hmn+e5/7FA\nZzPraGZtgP5Aw8uVPgIcaGatzWwTYH9gYiH/MyIi0nwabWJy983Xd+fuvtLMzgeeIsLoDnefaGYD\n42Uf4u6TzOxJYDywEhji7hPW971FRGT96HoQIiIVLO1hriIi0gIpIEREJJECQkREEq3tkqMj3f1A\nM1vIly89akQH8xapVyciIplRJ7WISAVbn07qfGZSY2atgfZ8ebnvDwt5QxERKQ/rDAgzuwAYDMwC\nVuWedqBbinWJiEjG8lnu+z1gf3efW5ySGq1DTUwiIk2U9jyI6cCCQnYuIiLlK58+iKlArZk9xpeX\n+74+tapERCRz+QTEh7lbm9xNRERaAA1zFRGpYKkMczWzG939IjP7K1+eKAfkfUU5EREpU2trYlp9\n2c/rilGISNa++AJmzYLZs2HOHJg3L24LFsDixbBoUWyzdCksXw7LlsGqVXV/bwYbbght2sRtk01g\n001hs81giy3gq1+N29ZbQ/v2sO228bwV9NtOJH1qYpIWwR0+/RTeew+mToUPPojbjBl1t2XL6r64\n27Wr+0Lfcsv4kt9007httFEEwIYbQuvWde+xalVdcCxbFmGyaFGEy4IFdYEzd26E0OzZsGIFdOgA\nO+wQt512gp13hk6doHPneK2VVkyT9bA+TUyNBoSZ9QN2cPebc49fAbbJvfxzd3+wkDcslAJC8jVr\nFrz1Vt1t4kR4990IiS5dYJdd4kt4p53g61+PW4cOEQTF/jW/eDHMnBkBNX06TJsWATZ1KkyZAgsX\nRs277QZ77AHdusGee0LHjjrzkPykFRAvAf3dfXru8ZtAb2BT4E53711gvQVRQEiSuXPhlVdgzBh4\n7bW4/fOf8SW6+rb77vCNb8A225Tfl+qCBTB5coTc6sAbPx6WLIF99oHu3WG//aBnzwg6kYbSCoix\n7r5fvcc3ufv5ufuj3b1nQdUWSAEh7vD++/Dii3EbNQo++SS+IHv0iC/L7t1bxq/rWbPg9dcjEMeO\njWPRpk0ExcEHx23PPb/cBCYtU1oB8Z67d27ktffdfZdC3rBQCoiWafp0ePZZeOYZeO65aI8/5BA4\n6CA44IA4O9CXYITnP/4BL70Ef/97BOgnn0RQ9O4NffrAN79Z+cEp/yqtgPgTUOvutzd4fiBQ5e6n\nFPKGhVJAtAxLl8LIkfC3v8Hjj0dHbu/edbedd9aXXL5mzYLnn49wfeaZ6BDv2zduffrECCqpfGkF\nxLbAw8TyGq/nnu4ObAR8x91nFfKGhVJAVK558yIQHnkEnnoqfumu/iLr3l2jeJqDe/RlPP543EaN\niuaofv3g2GPVf1HJUgmIejs/FOiae/iOuz9XyButLwVEZZkzBx5+GB54IL6sevWKL6ujj45hppKu\nRYvgyScjlB97LIbUnnBC3Dp1yro6aU6pBkSpUECUv4ULIxTuvRdefhmOOCK+kI48MuYZSDaWL4fa\nWnjwQXjooQiIAQPgpJPga1/LujpZXwoIKVkrVsDTT8Ndd0XTxiGHxJfPMcfEpDMpLStWxGCAe++N\ns4t994XTT4fjjouZ4VJ+FBBSciZPhjvugLvvjvbt006Dk0+Gtm2zrkzytWQJjBgR4T5qFBx/PJx1\nFuy/vwYKlBMFhJSEpUth+HC4/XaYMAF+8AM444wYiirl7aOPIuz/8AfYeGM4+2w49dRYikRKmwJC\nMvXhh3DrrXHG0K0bnHNOdDi30dVDKs6qVfDCCzBkCDzxRPQhnXce7L131pVJY9K+5KjIv3CPCVnH\nHw/f+lYsTPfii9HfcOKJCodK1apVjDi77z6YNCnWszrmmJi4OHx49GFI5Uj9DMLMqoEbiTC6w92v\naWS7/YCXgZPd/S8Jr+sMogQsXw733w/XXx9DJS+8MJqSNAqp5VqxIkan3XhjLDr4k59EX4Um4pWG\nkm1iMrNWwGRikb+PgLHEAoCTErZ7GlgC/FEBUXoWLYr25+uvjzHzl1wSE9k0iU3qGzMGbrghziTP\nPjvCQkNls1XKTUw9gCnuPs3dlwPDgH4J210APAjMTrkeaaLPPoOamhgb//LL0Yzw3HNw1FEKB/lX\nPXpE89PYsfGjomtX+NGP4tobUn7S/ifeAZhe7/GM3HNrmNn2xNIdtwAaPFciZs+GSy+NaxFMnx7h\ncP/9sXKqyLp06gS/+11ch6Ndu/jcnHZaPJbysbZLjhbLjcCgeo8bDYmampo196uqqqiqqkqtqJZq\nzhy49toYqtq/fywp3bFj1lVJudpmG7jySvj3f4/AOOggqK6GX/wifnxI86utraW2trZZ9pV2H0RP\noMbdq3OPLwW8fke1mU1dfRdoBywGznH3EQ32pT6IFM2fH8Fw660xoe2yy7SAmzS/zz+H3/wGfvvb\nWHdr8OAYCSXpKeU+iLFAZzPraGZtgP7Al7743X3n3K0T0Q/x44bhIOn54gu4+ur4Nbf6IjS//73C\nQdKxxRZx9jBlSnzGunePjuxZRV0bWvKVakC4+0rgfOAp4B1gmLtPNLOBZnZO0p+kWY/UWbEimpG6\ndIlQGDkyRimpOUmKYaut4Ior4lKqrVrFbPvBg2NBRykdmkndwrjDo4/CoEGxrPa116rjWbI3bRpc\nfnlc2Gjw4JhHsUEp9JBWgJKdB9GcFBDrb/x4uPhi+Phj+J//iaGqWnRNSsnrr8PPfhaf0euvjw5t\nWT8KCFmr2bOj3ffhh+PX2Tnn6NeZlK7VZ7mXXBKTMq+/HnbbLeuqylcpd1JLhlasiNEiXbvCV74S\na+f8+McKByltZrG+09tvx7WzDzoohsl+/nnWlbU8CogK9cILsYjeiBFx/8YbtTSzlJc2beCnP42g\n+OyzuFb5PffEGYYUh5qYKszs2fFrq7Y2Ts2PP179DFIZRo+OpcU33zyGYus6I/lRE5OwahXcdhvs\nsQe0bx8X7DnhBIWDVI6ePWMxwBNOiEvX/sd/xDweSY/OICrAhAmxcqZ7zITu1i3rikTS9fHHMSJv\n7Nj4YdSnT9YVlS6dQbRQS5fGqKRDDoHvfz8muykcpCX42tdg2LBY3+nMM2MhwLlzs66q8iggytQr\nr0Qn9JtvwhtvwLnnavltaXmOPBLeeScGYOyxBzzwQNYVVRY1MZWZJUtiTsM998SiZyedpH4GEYhO\n7DPOiGHdN98cfXGiJqYWY9SouDj89Onw1lux6qrCQST07Bln0126RFPrsGFZV1T+dAZRBpYujau6\n3Xkn3HRTjOIQkcaNGRP9Et26xZDYtm2zrig7OoOoYOPHx2UcJ0yAceMUDiL56NEj1nXq0CFC4rHH\nsq6oPOkMokStWhWzn6+6KlZcPe00NSeJFOKFF+Lfz5FHwnXXwSabZF1RcekMosLMmAGHHQbDh8ep\n8umnKxxECnXIITHa7/PPYZ994LXXsq6ofCggSsxDD8VVtnr1il8+nTplXZFI+dtqqxj5N3gw9O0b\nZxKrVmVdVelTE1OJWLIkljd+4gm4994YkSEize8f/4ABA2JNp7vugu22y7qidKmJqcxNnBidavPm\nxTA9hYNIenbaCV58EfbfP5qcnnoq64pKlwIiY3fdBQcfDBddFGcOW26ZdUUilW+DDeKa2H/6U0yu\nu/zyuH6KfJmamDKyeDGcf37M/rz/fthzz6wrEmmZZs2KtcyWLoX77ouhsZVETUxlZvLkaEZasSJW\no1Q4iGSnffvo+zvsMNhvP3juuawrKh0KiCIbPhwOOCDOHoYOhc02y7oiEWndOtY4GzoUvvc9+PWv\nNcoJ1MRUNCtWwKWXRkA88ADsu2/WFYlIkhkzYhHMtm3h7rtjiGw5UxNTifv0Uzj88Li27quvKhxE\nStkOO8Qle3faKUYXvvNO1hVlRwGRstWB8O1vx3owLXnRMJFy0aZNXIzo8suhqqrlXmdCTUwpGjo0\nJr8NGQLHHZd1NSJSiNdfh+9+NybX/fKX0V9RTtaniUkBkYIVK2DQIHjkkbh17Zp1RSKyPj79FE48\nMQaV/OlP5TVfSX0QJWTePDjqqFime8wYhYNIJdhmG3j6aejYMYaoT5mSdUXFkXpAmFm1mU0ys8lm\nNijh9QFmNi53G2lmZTsrYMqU+PB885vw+OOw9dZZVyQizWXDDeNSphdfDAce2DLmS6QaEGbWCrgJ\nOALoCpxiZrs12GwqcLC77wVcCdyeZk1pee65+NBccklcx2GDDbKuSETScM45cTnTU06B227Lupp0\npX0G0QOY4u7T3H05MAzoV38Ddx/t7gtyD0cDZTfRfciQ+LAMGxYfHhGpbL16wciRcMMNcOGFsHJl\n1hWlI+2A6ABMr/d4BmsPgLOAx1OtqBmtWgU/+1msLT9yZHxoRKRl6NIl1lJ7+234zndg0aKsK2p+\nJdNJbWa9gDOAf+mnKEVffBGjGsaMgVGj4sMiIi3LVltFf+O228aqzDNnZl1R80q7pXwmsGO9xzvk\nnvsSM+sGDAGq3X1eYzurqalZc7+qqoqqqqrmqrNJZs+GY46BXXeNteQ32iiTMkSkBLRpA3/4A1xz\nTQxSeewx6NYtu3pqa2upra1tln2lOg/CzFoD7wK9gY+BMcAp7j6x3jY7As8Cp7r76LXsqyTmQUye\nHJcs/P73oaZG14oWkTp//jNccEEsG967d9bVhJKeKGdm1cBviOasO9z9ajMbCLi7DzGz24HvAtMA\nA5a7e4+E/WQeEC+/HDMqf/UrOPPMTEsRkRL14ovR/HzddXDqqVlXU+IB0VyyDohHHoGzz44rwPXt\nm1kZIlIGJkyAI4+EgQNjFecsWxoUECkbMiSak0aM0EqsIpKfjz6C6upY7O+GG7Jbw0kBkRL3uG7t\n0KHw5JPQuXNR315Eytz8+TEEdttt49oSWQxo0VpMKVi5Es47L5qWXnpJ4SAiTbfVVnE501Wroslp\n4cKsK2oaBUSCZcvisoMTJsSFQ7bbLuuKRKRcbbxxjG7q3BkOPTRWhi0XCogGFi+GY4+FJUsi+bfY\nIuuKRKTctW4Nt94Khx0WE+qmT1/335QCBUQ98+fHpUG32y6uHb3xxllXJCKVwgx+/Ws466xY2HPy\n5KwrWjetOZozezYccUSk+w03QCtFp4ik4JJL4KtfjdFNTzyR7azrddHXIDBjRgTDMcfEUt0KBxFJ\n0w9/GD9EDzsMXnkl62oa1+K/CqdOhYMOipnRV1yhpTNEpDhOPhn++Ec4+ugYDFOKWvQ8iMmToU8f\nuOwyOPfcZt21iEhenn8eTjoprnV9+OHNv3/NgyjAO+/E9RtqahQOIpKdXr3g4YdjAdBHH826mi9r\nkQExblycOVx7bbQFiohk6YADIhzOPBP+8pesq6nT4kYxvf56zGj83e9ixUURkVLQo0eMaurbN1Zy\nKIXvpxYVEK+9FuFwyy2xbLeISCn51rdi3bfq6lie4+STs62nxQTE2LExWuC222LxLBGRUrTXXnGl\nysMPjzOJAQOyq6VFBMSrr8JRR8VlAY89NutqRETWbs894ZlnYp6EGZxySjZ1VHxAvP56hMPttysc\nRKR8dO0aZxKHHRZrOZ10UvFrqOiAePPN6PC55Rbo1y/rakREmmaPPaJP4vDDIySOP76471+xAfH2\n2xEON9+sDmkRKV/dusXopurqCIli9qFWZEBMmhSJe8MNcMIJWVcjIrJ+9t4bHnssfvRutFH8txgq\nbqLc++9Hm91VV0H//llXIyLSPLp3jytcnnYaPPtscd6zogLiww+hd2+4/PI4iCIileTb34YHH4xR\nTX//e/rvVzEB8cknsXzGRRfBwIFZVyMiko6DD4Z7740O61dfTfe9KiIg5s6NZqVTT42AEBGpZH36\nxND9o4+OATlpKftO6s8/jw6bvn2jaUlEpCXo1w8WL47RTbW10Llz879HWQfEkiUx+a17d7jmGl3s\nR0RalgEDYNGiaEEZORI6dGje/ZdtQCxfHgtZbb99zHVQOIhIS3TOOTB/fgztf/FFaNu2+fZdlleU\nW7UKTj8d5syJYV8bbphtbSIiWRs0KJqannkGNt+87vmSvqKcmVWb2SQzm2xmgxrZ5rdmNsXM3jSz\nvde2P3e4+OK4lvSDDyocREQArr46Zl0fdxwsXdo8+0w1IMysFXATcATQFTjFzHZrsE1fYBd37wIM\nBG5d2z6XLYsziEcfhU02SanwEldbqlc4z4CORR0dizot8ViYwa23xnLhc+Y0zz7TPoPoAUxx92nu\nvhwYBjRcNq8fMBTA3V8BtjSz9o3tcKON4mpwW22VVsmlryV++BujY1FHx6JOSz0WrVvD//5v83VW\npx0QHYDp9R7PyD23tm1mJmwjIiJFVhET5UREpPmlOorJzHoCNe5enXt8KeDufk29bW4Fnnf3P+ce\nTwIOcfdZDfZVHsOtRERKTKGjmNKeBzEW6GxmHYGPgf5Aw4vnjQDOA/6cC5T5DcMBCv8fFBGRwqQa\nEO6+0szOB54imrPucPeJZjYwXvYh7v43MzvSzN4DFgNnpFmTiIjkp2wmyomISHGVXCd1c0+sK2fr\nOhZmNsDMxuVuI81szyzqLIZ8Phe57fYzs+VmVrEXms3z30iVmb1hZm+b2fPFrrFY8vg3soWZjch9\nV7xlZqdnUGbqzOwOM5tlZuPXsk3TvzfdvWRuRGC9B3QENgTeBHZrsE1f4LHc/f2B0VnXneGx6Als\nmbtf3ZKPRb3tngUeBb6bdd0Zfi62BN4BOuQet8u67gyPxWXAVauPAzAX2CDr2lM4FgcCewPjG3m9\noO/NUjuDaPaJdWVsncfC3Ue7+4Lcw9FU7vyRfD4XABcADwKzi1lckeVzLAYAw919JoC7N9O82pKT\nz7FwYPXKRJsDc919RRFrLAp3HwnMW8smBX1vllpAaGJdnXyORX1nAY+nWlF21nkszGx74DvufgtQ\nySPe8vlc7ApsbWbPm9lYMzu1aNUVVz7H4iZgdzP7CBgHXFik2kpNQd+bZbvct9Qxs17E6K8Ds64l\nQzcC9dugKzkk1mUDYB/gUGBTYJSZjXL397ItKxNHAG+4+6FmtgvwtJl1c/dFWRdWDkotIGYCO9Z7\nvEPuuYbbfH0d21SCfI4FZtYNGAJUu/vaTjHLWT7HYl9gmJkZ0dbc18yWu/uIItVYLPkcixnAHHf/\nJ/BPM3sR2Itor68k+RyLM4CrANz9fTP7ANgNSPlqziWnoO/NUmtiWjOxzszaEBPrGv4DHwH8ANbM\n1E6cWFcB1nkszGxHYDhwqru/n0GNxbLOY+HuO+dunYh+iB9XYDhAfv9GHgEONLPWZrYJ0Sk5sch1\nFkM+x2Ia0Acg1+a+KzC1qFUWj9H4mXNB35sldQbhmli3Rj7HAvgFsDXw+9wv5+Xu3iO7qtOR57H4\n0p8UvcigbUOnAAABTElEQVQiyfPfyCQzexIYD6wEhrj7hAzLTkWen4srgf+rN/zz5+7+WUYlp8bM\n7gWqgLZm9iEwGGjDen5vaqKciIgkKrUmJhERKREKCBERSaSAEBGRRAoIERFJpIAQEZFECggREUmk\ngBBpBmZ2Xm4p5ZVmtnXW9Yg0BwWESPMYCfQmZu6KVISSmkktUupy11d/AniNWBDvbeAH7j4u93pL\nXiRQKozOIESa7hvATe6+O7AQ+HHG9YikQgEh0nQfuvvo3P17aNnLrEsFU0CIrD9v5L5IWVNAiDTd\njma2f+7+AKKDerW1LbksUlYUECJN9y5wnplNALYEbjGzC8xsOnEZx3Fm1nAJcpGyo+W+RZogN4rp\nUXffM+taRNKmMwiRptOvKmkRdAYhIiKJdAYhIiKJFBAiIpJIASEiIokUECIikkgBISIiiRQQIiKS\n6P8BGMvpDHIlxW0AAAAASUVORK5CYII=\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x1041049e8>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import math\n",
"p = np.linspace(0.01, 0.99, 100, endpoint=True)\n",
"\n",
"def entropy(p):\n",
" q = 1 - p\n",
" return (1 - p**2 - q**2)\n",
"\n",
"gini = entropy(p)\n",
"plt.plot(p, gini)\n",
"plt.title(\"Gini Index\")\n",
"plt.xlabel(\"p1\")\n",
"plt.ylabel(\"Gini Index\")\n",
"plt.ylim(0, 1.2)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Numerically, we can see that it makes sense to use the gini index as an alternative measure to minimize because this measure is maximum when a set has half the members belonging to one class and the other half belonging to another class. It is minimal when all members of a set belong to one class."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 8. Pruning\n",
"If a decision tre is allowed to grow as deeply as possible, then we run into the danger of overfitting. Hence, there are a few common stopping criteria that people use to prevent overfitting and make the tree generalizable:\n",
"\n",
"1. Maximum tree depth\n",
"2. Minimum number of samples at one node\n",
"3. Stop growing when the accuracy on the validation dataset does not improve\n",
"4. Stop growing when there is no statisticially significant association between any attribute and the class at a node based on the chi-square test of independence\n",
"\n",
"There are two types of pruning - prepruning and postpruning. The difference is that postpruning allows the tree to grow to maximum depth possible before removing certain nodes. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 9. Regression Tree\n",
"Decision trees can easily be converted to do regression rather than classification with a few changes. The changes are:\n",
"\n",
"1. Prediction is the average value of all samples at a leaf\n",
"2. Splitting criteria is to choose the attribute value that minimize average variance in children, or achieves greatest standard deviation reduction (SDR). Suppose the parent node is denoted as $P$ and we split the data into subsets $S_{i}$. $|S_{i}|$ and $|P|$ denote the number of samples at child node $i$ and number of samples in the parent node respectively. Then SDR is defined as:\n",
"\n",
"$$SDR(P, S) = SD(P) - \\sum_{i} \\frac{|S_{i}|}{|P|}SD(S_{i})$$\n",
"\n",
"3. Prepruning criteria can be lower bound on standard deviation in a node and/or number of examples in a node.\n",
"\n",
"4. Postpruning criteria can remove nodes until the mean-square error on the validaiton dataset stops improving."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment