PBPatil/splitting_techniques_without_scikit.ipynb

## splitting_techniques_without_scikit.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Problem Statement\n",
    "\n",
    "- We have a sample of 50 people with three variables Gender (M/F), employment status( Student/ Working) and Age (years)\n",
    "- Some of these 50 are planning to watch the movie.\n",
    "- Now, we want to create a model to predict who will watch the movie? In this problem, we need to segregate the sample into who will watch the movie based on highly significant input variable among all three\n",
    "- For the sake of simplicity, age feature is conveted into bins of >28 and <28"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   # gender  is_28+ employment_status watching\n",
      "0  1      M       0           student      yes\n",
      "1  2      M       1           working      yes\n",
      "2  3      F       0           working      yes\n",
      "3  4      F       0           student       no\n",
      "4  5      M       1           working      yes\n",
      "(50, 5)\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "films = pd.read_excel('films.xlsx', sheetname='films')\n",
    "print(films.head())\n",
    "print(films.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### __I. Gini Index__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Viewers who watched the movie:26\n",
      "Viewers who did not watch the movie:24\n"
     ]
    }
   ],
   "source": [
    "print(\"Viewers who watched the movie:{}\".format(len(films[films['watching'] == 'yes'])))\n",
    "print(\"Viewers who did not watch the movie:{}\".format(len(films[films['watching'] == 'no'])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__SPLIT BASED ON GENDER__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>gender</th>\n",
       "      <th>F</th>\n",
       "      <th>M</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>watching</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>no</th>\n",
       "      <td>8</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>yes</th>\n",
       "      <td>14</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "gender     F   M\n",
       "watching        \n",
       "no         8  16\n",
       "yes       14  12"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "crosstab1 = pd.crosstab(index=films[\"watching\"], columns=films[\"gender\"])\n",
    "crosstab1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Probability of males that watched Dunkirk:0.429\n",
      "Probability of females that watched Dunkirk:0.636\n"
     ]
    }
   ],
   "source": [
    "male_watched_yes = (12/float(28))\n",
    "female_watched_yes = (14/float(22))\n",
    "\n",
    "print(\"Probability of males that watched Dunkirk:{:.3f}\".format(male_watched_yes))\n",
    "print(\"Probability of females that watched Dunkirk:{:.3f}\".format(female_watched_yes))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Gini(female):0.463\n",
      "Gini(male):0.490\n"
     ]
    }
   ],
   "source": [
    "subnode_male = 1 - ((male_watched_yes)**2 + (1-male_watched_yes)**2)\n",
    "subnode_female =1-( (female_watched_yes)**2 + (1-female_watched_yes)**2)\n",
    "\n",
    "print(\"Gini(female):{:.3f}\".format(subnode_female))\n",
    "print(\"Gini(male):{:.3f}\".format(subnode_male))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Weighted Gini for Gender:0.4779\n"
     ]
    }
   ],
   "source": [
    "# Weighted Gini Index Calculation for Gender Split\n",
    "calculated_wt_gender = (28/float(50))*subnode_male + (22/float(50))*subnode_female\n",
    "print(\"Weighted Gini for Gender:{:.4f}\".format(calculated_wt_gender))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__SPLIT BASED ON EMPLOYMENT__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>employment_status</th>\n",
       "      <th>student</th>\n",
       "      <th>working</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>watching</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>no</th>\n",
       "      <td>5</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>yes</th>\n",
       "      <td>4</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "employment_status  student  working\n",
       "watching                           \n",
       "no                       5       19\n",
       "yes                      4       22"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "crosstab2 = pd.crosstab(index=films[\"watching\"], columns=films[\"employment_status\"])\n",
    "crosstab2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Probability of students that watched:0.444\n",
      "Probability of working people that watched:0.537\n"
     ]
    }
   ],
   "source": [
    "student_watched_yes = (4/float(9))\n",
    "working_watched_yes = (22/float(41))\n",
    "print(\"Probability of students that watched:{:.3f}\".format(student_watched_yes))\n",
    "print(\"Probability of working people that watched:{:.3f}\".format(working_watched_yes))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Gini(student):0.494\n",
      "Gini(working):0.497\n"
     ]
    }
   ],
   "source": [
    "subnode_student =1-( (student_watched_yes)**2 + (1 - student_watched_yes)**2)\n",
    "subnode_working = 1-((working_watched_yes)**2 + (1 - working_watched_yes)**2)\n",
    "\n",
    "print(\"Gini(student):{:.3f}\".format(subnode_student))\n",
    "print(\"Gini(working):{:.3f}\".format(subnode_working))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Weighted Gini(employment):0.4967\n"
     ]
    }
   ],
   "source": [
    "#Weighted Gini Index for Employment Split\n",
    "calculated_wt_emp = (41/float(50))*subnode_working + (9/float(50))*subnode_student\n",
    "print(\"Weighted Gini(employment):{:.4f}\".format(calculated_wt_emp))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "__SPLIT BASED ON AGE__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>is_28+</th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>watching</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>no</th>\n",
       "      <td>11</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>yes</th>\n",
       "      <td>17</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "is_28+     0   1\n",
       "watching        \n",
       "no        11  13\n",
       "yes       17   9"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "crosstab3 = pd.crosstab(index=films[\"watching\"], columns=films[\"is_28+\"])\n",
    "crosstab3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Probability of people_younger_than_28_watched:0.607\n",
      "Probability of people_older_than_28_watched:0.409\n"
     ]
    }
   ],
   "source": [
    "people_younger_than_28_watched_yes = (17/float(28))\n",
    "people_older_than_28_watched_yes = (9/float(22))\n",
    "print(\"Probability of people_younger_than_28_watched:{:.3f}\".format(people_younger_than_28_watched_yes))\n",
    "print(\"Probability of people_older_than_28_watched:{:.3f}\".format(people_older_than_28_watched_yes))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Gini(people_younger_than_28):0.477\n",
      "Gini(people_older_than_28):0.483\n"
     ]
    }
   ],
   "source": [
    "subnode_less_than28_watched_yes =1-( (people_younger_than_28_watched_yes)**2 + (1 - people_younger_than_28_watched_yes)**2)\n",
    "subnode_more_than28_watched_yes = 1-((people_older_than_28_watched_yes)**2 + (1 - people_older_than_28_watched_yes)**2)\n",
    "\n",
    "print(\"Gini(people_younger_than_28):{:.3f}\".format(subnode_less_than28_watched_yes))\n",
    "print(\"Gini(people_older_than_28):{:.3f}\".format(subnode_more_than28_watched_yes))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Weighted Gini(age):0.4799\n"
     ]
    }
   ],
   "source": [
    "#Weighted Gini Index for Age Split\n",
    "calculated_wt_emp = (28/float(50))*subnode_less_than28_watched_yes + (22/float(50))*subnode_more_than28_watched_yes\n",
    "print(\"Weighted Gini(age):{:.4f}\".format(calculated_wt_emp))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- __Since weighted gini(gender)< weighted gini(age) < weighted gini(employment), the node split will take on Gender__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### II. Chi-Square"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Gender Node__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>watching</th>\n",
       "      <th>no</th>\n",
       "      <th>yes</th>\n",
       "      <th>Total</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gender</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>F</th>\n",
       "      <td>8</td>\n",
       "      <td>14</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>M</th>\n",
       "      <td>16</td>\n",
       "      <td>12</td>\n",
       "      <td>28</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "watching  no  yes  Total\n",
       "gender                  \n",
       "F          8   14     22\n",
       "M         16   12     28"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "crosstab1 = pd.crosstab(index=films[\"gender\"], columns=films[\"watching\"])\n",
    "crosstab1[\"Total\"] = crosstab1.no + crosstab1.yes\n",
    "crosstab1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>watching</th>\n",
       "      <th>no</th>\n",
       "      <th>yes</th>\n",
       "      <th>Total</th>\n",
       "      <th>Expected watch</th>\n",
       "      <th>Expected not watch</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gender</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>F</th>\n",
       "      <td>8</td>\n",
       "      <td>14</td>\n",
       "      <td>22</td>\n",
       "      <td>11.44</td>\n",
       "      <td>10.56</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>M</th>\n",
       "      <td>16</td>\n",
       "      <td>12</td>\n",
       "      <td>28</td>\n",
       "      <td>14.56</td>\n",
       "      <td>13.44</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "watching  no  yes  Total  Expected watch  Expected not watch\n",
       "gender                                                      \n",
       "F          8   14     22           11.44               10.56\n",
       "M         16   12     28           14.56               13.44"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate the expected  who watch movie\n",
    "\n",
    "crosstab1[\"Expected watch\"] = crosstab1.Total * 26/50\n",
    "crosstab1[\"Expected not watch\"] = crosstab1.Total * 24/50\n",
    "crosstab1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>watching</th>\n",
       "      <th>no</th>\n",
       "      <th>yes</th>\n",
       "      <th>Total</th>\n",
       "      <th>Expected watch</th>\n",
       "      <th>Expected not watch</th>\n",
       "      <th>E - O (Watch)</th>\n",
       "      <th>E - O (Not Watch)</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gender</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>F</th>\n",
       "      <td>8</td>\n",
       "      <td>14</td>\n",
       "      <td>22</td>\n",
       "      <td>11.44</td>\n",
       "      <td>10.56</td>\n",
       "      <td>-2.56</td>\n",
       "      <td>2.56</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>M</th>\n",
       "      <td>16</td>\n",
       "      <td>12</td>\n",
       "      <td>28</td>\n",
       "      <td>14.56</td>\n",
       "      <td>13.44</td>\n",
       "      <td>2.56</td>\n",
       "      <td>-2.56</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "watching  no  yes  Total  Expected watch  Expected not watch  E - O (Watch)  \\\n",
       "gender                                                                        \n",
       "F          8   14     22           11.44               10.56          -2.56   \n",
       "M         16   12     28           14.56               13.44           2.56   \n",
       "\n",
       "watching  E - O (Not Watch)  \n",
       "gender                       \n",
       "F                      2.56  \n",
       "M                     -2.56  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Calculating Deviation\n",
    "crosstab1[\"E - O (Watch)\"] = crosstab1[\"Expected watch\"] - crosstab1.yes\n",
    "crosstab1[\"E - O (Not Watch)\"] = crosstab1[\"Expected not watch\"] - crosstab1.no\n",
    "crosstab1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>watching</th>\n",
       "      <th>no</th>\n",
       "      <th>yes</th>\n",
       "      <th>Total</th>\n",
       "      <th>Expected watch</th>\n",
       "      <th>Expected not watch</th>\n",
       "      <th>E - O (Watch)</th>\n",
       "      <th>E - O (Not Watch)</th>\n",
       "      <th>chi2_watch</th>\n",
       "      <th>chi2_not_watch</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gender</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>F</th>\n",
       "      <td>8</td>\n",
       "      <td>14</td>\n",
       "      <td>22</td>\n",
       "      <td>11.44</td>\n",
       "      <td>10.56</td>\n",
       "      <td>-2.56</td>\n",
       "      <td>2.56</td>\n",
       "      <td>0.756880</td>\n",
       "      <td>0.787786</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>M</th>\n",
       "      <td>16</td>\n",
       "      <td>12</td>\n",
       "      <td>28</td>\n",
       "      <td>14.56</td>\n",
       "      <td>13.44</td>\n",
       "      <td>2.56</td>\n",
       "      <td>-2.56</td>\n",
       "      <td>0.670902</td>\n",
       "      <td>0.698297</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "watching  no  yes  Total  Expected watch  Expected not watch  E - O (Watch)  \\\n",
       "gender                                                                        \n",
       "F          8   14     22           11.44               10.56          -2.56   \n",
       "M         16   12     28           14.56               13.44           2.56   \n",
       "\n",
       "watching  E - O (Not Watch)  chi2_watch  chi2_not_watch  \n",
       "gender                                                   \n",
       "F                      2.56    0.756880        0.787786  \n",
       "M                     -2.56    0.670902        0.698297  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "crosstab1[\"chi2_watch\"] = np.sqrt(crosstab1[\"E - O (Watch)\"]**2/crosstab1[\"Expected watch\"])\n",
    "crosstab1[\"chi2_not_watch\"] = np.sqrt(crosstab1[\"E - O (Not Watch)\"]**2/crosstab1[\"Expected not watch\"])\n",
    "crosstab1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2.9138649533909593"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chi2_gender = (crosstab1[\"chi2_watch\"] + crosstab1[\"chi2_not_watch\"]).sum()\n",
    "chi2_gender"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Employment node__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>watching</th>\n",
       "      <th>no</th>\n",
       "      <th>yes</th>\n",
       "      <th>Total</th>\n",
       "      <th>Expected watch</th>\n",
       "      <th>Expected not watch</th>\n",
       "      <th>E - O (Watch)</th>\n",
       "      <th>E - O (Not Watch)</th>\n",
       "      <th>chi2_watch</th>\n",
       "      <th>chi2_not_watch</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>employment_status</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>student</th>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>9</td>\n",
       "      <td>4.68</td>\n",
       "      <td>4.32</td>\n",
       "      <td>0.68</td>\n",
       "      <td>-0.68</td>\n",
       "      <td>0.31433</td>\n",
       "      <td>0.327165</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>working</th>\n",
       "      <td>19</td>\n",
       "      <td>22</td>\n",
       "      <td>41</td>\n",
       "      <td>21.32</td>\n",
       "      <td>19.68</td>\n",
       "      <td>-0.68</td>\n",
       "      <td>0.68</td>\n",
       "      <td>0.14727</td>\n",
       "      <td>0.153284</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "watching           no  yes  Total  Expected watch  Expected not watch  \\\n",
       "employment_status                                                       \n",
       "student             5    4      9            4.68                4.32   \n",
       "working            19   22     41           21.32               19.68   \n",
       "\n",
       "watching           E - O (Watch)  E - O (Not Watch)  chi2_watch  \\\n",
       "employment_status                                                 \n",
       "student                     0.68              -0.68     0.31433   \n",
       "working                    -0.68               0.68     0.14727   \n",
       "\n",
       "watching           chi2_not_watch  \n",
       "employment_status                  \n",
       "student                  0.327165  \n",
       "working                  0.153284  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "crosstab2 = pd.crosstab(index=films[\"employment_status\"], columns=films[\"watching\"])\n",
    "crosstab2[\"Total\"] = crosstab2.no + crosstab2.yes\n",
    "\n",
    "crosstab2[\"Expected watch\"] = crosstab2.Total * 26/50\n",
    "crosstab2[\"Expected not watch\"] = crosstab2.Total * 24/50\n",
    "\n",
    "crosstab2[\"E - O (Watch)\"] = crosstab2[\"Expected watch\"] - crosstab2.yes\n",
    "crosstab2[\"E - O (Not Watch)\"] = crosstab2[\"Expected not watch\"] - crosstab2.no\n",
    "\n",
    "crosstab2[\"chi2_watch\"] = np.sqrt(crosstab2[\"E - O (Watch)\"]**2/crosstab2[\"Expected watch\"])\n",
    "crosstab2[\"chi2_not_watch\"] = np.sqrt(crosstab2[\"E - O (Not Watch)\"]**2/crosstab2[\"Expected not watch\"])\n",
    "\n",
    "crosstab2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9420494494487789"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chi2_emp = (crosstab2[\"chi2_watch\"] + crosstab2[\"chi2_not_watch\"]).sum()\n",
    "chi2_emp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Age node__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>watching</th>\n",
       "      <th>no</th>\n",
       "      <th>yes</th>\n",
       "      <th>Total</th>\n",
       "      <th>Expected watch</th>\n",
       "      <th>Expected not watch</th>\n",
       "      <th>E - O (Watch)</th>\n",
       "      <th>E - O (Not Watch)</th>\n",
       "      <th>chi2_watch</th>\n",
       "      <th>chi2_not_watch</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>is_28+</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>11</td>\n",
       "      <td>17</td>\n",
       "      <td>28</td>\n",
       "      <td>14.56</td>\n",
       "      <td>13.44</td>\n",
       "      <td>-2.44</td>\n",
       "      <td>2.44</td>\n",
       "      <td>0.639454</td>\n",
       "      <td>0.665565</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>13</td>\n",
       "      <td>9</td>\n",
       "      <td>22</td>\n",
       "      <td>11.44</td>\n",
       "      <td>10.56</td>\n",
       "      <td>2.44</td>\n",
       "      <td>-2.44</td>\n",
       "      <td>0.721401</td>\n",
       "      <td>0.750858</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "watching  no  yes  Total  Expected watch  Expected not watch  E - O (Watch)  \\\n",
       "is_28+                                                                        \n",
       "0         11   17     28           14.56               13.44          -2.44   \n",
       "1         13    9     22           11.44               10.56           2.44   \n",
       "\n",
       "watching  E - O (Not Watch)  chi2_watch  chi2_not_watch  \n",
       "is_28+                                                   \n",
       "0                      2.44    0.639454        0.665565  \n",
       "1                     -2.44    0.721401        0.750858  "
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "crosstab3 = pd.crosstab(index=films[\"is_28+\"], columns=films[\"watching\"])\n",
    "crosstab3[\"Total\"] = crosstab3.no + crosstab3.yes\n",
    "\n",
    "crosstab3[\"Expected watch\"] = crosstab3.Total * 26/50\n",
    "crosstab3[\"Expected not watch\"] = crosstab3.Total * 24/50\n",
    "\n",
    "crosstab3[\"E - O (Watch)\"] = crosstab3[\"Expected watch\"] - crosstab3.yes\n",
    "crosstab3[\"E - O (Not Watch)\"] = crosstab3[\"Expected not watch\"] - crosstab3.no\n",
    "\n",
    "crosstab3[\"chi2_watch\"] = np.sqrt(crosstab3[\"E - O (Watch)\"]**2/crosstab3[\"Expected watch\"])\n",
    "crosstab3[\"chi2_not_watch\"] = np.sqrt(crosstab3[\"E - O (Not Watch)\"]**2/crosstab3[\"Expected not watch\"])\n",
    "\n",
    "crosstab3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2.777277533700757"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chi2_age = (crosstab3[\"chi2_watch\"] + crosstab3[\"chi2_not_watch\"]).sum()\n",
    "chi2_age"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- __Since chi2_gender < chi2_age < chi2_emp, the node split will take on Gender__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### III.Entropy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Gender node__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>watching</th>\n",
       "      <th>no</th>\n",
       "      <th>yes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gender</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>F</th>\n",
       "      <td>8</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>M</th>\n",
       "      <td>16</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "watching  no  yes\n",
       "gender           \n",
       "F          8   14\n",
       "M         16   12"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "crosstab1.iloc[:,:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.94566030460064021"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Female node entropy\n",
    "p = 14/float(22)\n",
    "q = 8/float(22)\n",
    "female_entropy = -p*np.log2(p) - q*np.log2(q)\n",
    "female_entropy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.98522813603425163"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Male node entropy\n",
    "p = 12/float(28)\n",
    "q = 16/float(28)\n",
    "male_entropy = -p*np.log2(p) - q*np.log2(q)\n",
    "male_entropy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.96781829020346266"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Weighted entropy for gender\n",
    "total_entropy_gender = (28/float(50))*male_entropy + (22/float(50))*female_entropy\n",
    "total_entropy_gender"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Employment Node__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>watching</th>\n",
       "      <th>no</th>\n",
       "      <th>yes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>employment_status</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>student</th>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>working</th>\n",
       "      <td>19</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "watching           no  yes\n",
       "employment_status         \n",
       "student             5    4\n",
       "working            19   22"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "crosstab2.iloc[:,:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.99107605983822222"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#entropy for students\n",
    "p = 4/float(9)\n",
    "q = 5/float(9)\n",
    "working_entropy = -p*np.log2(p) - q*np.log2(q)\n",
    "working_entropy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.99613448350957956"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# entropy for working people\n",
    "p = 22/float(41)\n",
    "q = 19/float(41)\n",
    "\n",
    "student_entropy = -p*np.log2(p) - q*np.log2(q)\n",
    "student_entropy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.99198657609906649"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "total_entropy_emp = (41/float(50))*working_entropy + (9/float(50))*student_entropy\n",
    "total_entropy_emp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Age Node__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>watching</th>\n",
       "      <th>no</th>\n",
       "      <th>yes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>is_28+</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>11</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>13</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "watching  no  yes\n",
       "is_28+           \n",
       "0         11   17\n",
       "1         13    9"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "crosstab3.iloc[:,:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.96661863254810276"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#entropy for less than 28 years\n",
    "p = 17/float(28)\n",
    "q = 11/float(28)\n",
    "less_than_28_entropy = -p*np.log2(p) - q*np.log2(q)\n",
    "less_than_28_entropy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.97602064823661505"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#entropy for more than 28 years\n",
    "p = 9/float(22)\n",
    "q = 13/float(22)\n",
    "more_than_28_entropy = -p*np.log2(p) - q*np.log2(q)\n",
    "more_than_28_entropy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.97075551945104821"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "total_entropy_age = (28/float(50))*less_than_28_entropy + (22/float(50))*more_than_28_entropy\n",
    "total_entropy_age"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- __Since entropy_gender < entropy_age < entropy_emp, the node split will take on Gender__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}