Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save PBPatil/2ff370eee311d4c5301144f59af89e84 to your computer and use it in GitHub Desktop.
Save PBPatil/2ff370eee311d4c5301144f59af89e84 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problem Statement\n",
"\n",
"- We have a sample of 50 people with three variables Gender (M/F), employment status( Student/ Working) and Age (years)\n",
"- Some of these 50 are planning to watch the movie.\n",
"- Now, we want to create a model to predict who will watch the movie? In this problem, we need to segregate the sample into who will watch the movie based on highly significant input variable among all three\n",
"- For the sake of simplicity, age feature is conveted into bins of >28 and <28"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" # gender is_28+ employment_status watching\n",
"0 1 M 0 student yes\n",
"1 2 M 1 working yes\n",
"2 3 F 0 working yes\n",
"3 4 F 0 student no\n",
"4 5 M 1 working yes\n",
"(50, 5)\n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"films = pd.read_excel('films.xlsx', sheetname='films')\n",
"print(films.head())\n",
"print(films.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### __I. Gini Index__"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Viewers who watched the movie:26\n",
"Viewers who did not watch the movie:24\n"
]
}
],
"source": [
"print(\"Viewers who watched the movie:{}\".format(len(films[films['watching'] == 'yes'])))\n",
"print(\"Viewers who did not watch the movie:{}\".format(len(films[films['watching'] == 'no'])))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__SPLIT BASED ON GENDER__"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>gender</th>\n",
" <th>F</th>\n",
" <th>M</th>\n",
" </tr>\n",
" <tr>\n",
" <th>watching</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>no</th>\n",
" <td>8</td>\n",
" <td>16</td>\n",
" </tr>\n",
" <tr>\n",
" <th>yes</th>\n",
" <td>14</td>\n",
" <td>12</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"gender F M\n",
"watching \n",
"no 8 16\n",
"yes 14 12"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"crosstab1 = pd.crosstab(index=films[\"watching\"], columns=films[\"gender\"])\n",
"crosstab1"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Probability of males that watched Dunkirk:0.429\n",
"Probability of females that watched Dunkirk:0.636\n"
]
}
],
"source": [
"male_watched_yes = (12/float(28))\n",
"female_watched_yes = (14/float(22))\n",
"\n",
"print(\"Probability of males that watched Dunkirk:{:.3f}\".format(male_watched_yes))\n",
"print(\"Probability of females that watched Dunkirk:{:.3f}\".format(female_watched_yes))"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Gini(female):0.463\n",
"Gini(male):0.490\n"
]
}
],
"source": [
"subnode_male = 1 - ((male_watched_yes)**2 + (1-male_watched_yes)**2)\n",
"subnode_female =1-( (female_watched_yes)**2 + (1-female_watched_yes)**2)\n",
"\n",
"print(\"Gini(female):{:.3f}\".format(subnode_female))\n",
"print(\"Gini(male):{:.3f}\".format(subnode_male))"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Weighted Gini for Gender:0.4779\n"
]
}
],
"source": [
"# Weighted Gini Index Calculation for Gender Split\n",
"calculated_wt_gender = (28/float(50))*subnode_male + (22/float(50))*subnode_female\n",
"print(\"Weighted Gini for Gender:{:.4f}\".format(calculated_wt_gender))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__SPLIT BASED ON EMPLOYMENT__"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>employment_status</th>\n",
" <th>student</th>\n",
" <th>working</th>\n",
" </tr>\n",
" <tr>\n",
" <th>watching</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>no</th>\n",
" <td>5</td>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>yes</th>\n",
" <td>4</td>\n",
" <td>22</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"employment_status student working\n",
"watching \n",
"no 5 19\n",
"yes 4 22"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"crosstab2 = pd.crosstab(index=films[\"watching\"], columns=films[\"employment_status\"])\n",
"crosstab2"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Probability of students that watched:0.444\n",
"Probability of working people that watched:0.537\n"
]
}
],
"source": [
"student_watched_yes = (4/float(9))\n",
"working_watched_yes = (22/float(41))\n",
"print(\"Probability of students that watched:{:.3f}\".format(student_watched_yes))\n",
"print(\"Probability of working people that watched:{:.3f}\".format(working_watched_yes))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Gini(student):0.494\n",
"Gini(working):0.497\n"
]
}
],
"source": [
"subnode_student =1-( (student_watched_yes)**2 + (1 - student_watched_yes)**2)\n",
"subnode_working = 1-((working_watched_yes)**2 + (1 - working_watched_yes)**2)\n",
"\n",
"print(\"Gini(student):{:.3f}\".format(subnode_student))\n",
"print(\"Gini(working):{:.3f}\".format(subnode_working))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Weighted Gini(employment):0.4967\n"
]
}
],
"source": [
"#Weighted Gini Index for Employment Split\n",
"calculated_wt_emp = (41/float(50))*subnode_working + (9/float(50))*subnode_student\n",
"print(\"Weighted Gini(employment):{:.4f}\".format(calculated_wt_emp))"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"__SPLIT BASED ON AGE__"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>is_28+</th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" </tr>\n",
" <tr>\n",
" <th>watching</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>no</th>\n",
" <td>11</td>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>yes</th>\n",
" <td>17</td>\n",
" <td>9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"is_28+ 0 1\n",
"watching \n",
"no 11 13\n",
"yes 17 9"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"crosstab3 = pd.crosstab(index=films[\"watching\"], columns=films[\"is_28+\"])\n",
"crosstab3"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Probability of people_younger_than_28_watched:0.607\n",
"Probability of people_older_than_28_watched:0.409\n"
]
}
],
"source": [
"people_younger_than_28_watched_yes = (17/float(28))\n",
"people_older_than_28_watched_yes = (9/float(22))\n",
"print(\"Probability of people_younger_than_28_watched:{:.3f}\".format(people_younger_than_28_watched_yes))\n",
"print(\"Probability of people_older_than_28_watched:{:.3f}\".format(people_older_than_28_watched_yes))"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Gini(people_younger_than_28):0.477\n",
"Gini(people_older_than_28):0.483\n"
]
}
],
"source": [
"subnode_less_than28_watched_yes =1-( (people_younger_than_28_watched_yes)**2 + (1 - people_younger_than_28_watched_yes)**2)\n",
"subnode_more_than28_watched_yes = 1-((people_older_than_28_watched_yes)**2 + (1 - people_older_than_28_watched_yes)**2)\n",
"\n",
"print(\"Gini(people_younger_than_28):{:.3f}\".format(subnode_less_than28_watched_yes))\n",
"print(\"Gini(people_older_than_28):{:.3f}\".format(subnode_more_than28_watched_yes))"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Weighted Gini(age):0.4799\n"
]
}
],
"source": [
"#Weighted Gini Index for Age Split\n",
"calculated_wt_emp = (28/float(50))*subnode_less_than28_watched_yes + (22/float(50))*subnode_more_than28_watched_yes\n",
"print(\"Weighted Gini(age):{:.4f}\".format(calculated_wt_emp))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- __Since weighted gini(gender)< weighted gini(age) < weighted gini(employment), the node split will take on Gender__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### II. Chi-Square"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Gender Node__"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>watching</th>\n",
" <th>no</th>\n",
" <th>yes</th>\n",
" <th>Total</th>\n",
" </tr>\n",
" <tr>\n",
" <th>gender</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>F</th>\n",
" <td>8</td>\n",
" <td>14</td>\n",
" <td>22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>M</th>\n",
" <td>16</td>\n",
" <td>12</td>\n",
" <td>28</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"watching no yes Total\n",
"gender \n",
"F 8 14 22\n",
"M 16 12 28"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"crosstab1 = pd.crosstab(index=films[\"gender\"], columns=films[\"watching\"])\n",
"crosstab1[\"Total\"] = crosstab1.no + crosstab1.yes\n",
"crosstab1"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>watching</th>\n",
" <th>no</th>\n",
" <th>yes</th>\n",
" <th>Total</th>\n",
" <th>Expected watch</th>\n",
" <th>Expected not watch</th>\n",
" </tr>\n",
" <tr>\n",
" <th>gender</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>F</th>\n",
" <td>8</td>\n",
" <td>14</td>\n",
" <td>22</td>\n",
" <td>11.44</td>\n",
" <td>10.56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>M</th>\n",
" <td>16</td>\n",
" <td>12</td>\n",
" <td>28</td>\n",
" <td>14.56</td>\n",
" <td>13.44</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"watching no yes Total Expected watch Expected not watch\n",
"gender \n",
"F 8 14 22 11.44 10.56\n",
"M 16 12 28 14.56 13.44"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# calculate the expected who watch movie\n",
"\n",
"crosstab1[\"Expected watch\"] = crosstab1.Total * 26/50\n",
"crosstab1[\"Expected not watch\"] = crosstab1.Total * 24/50\n",
"crosstab1"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>watching</th>\n",
" <th>no</th>\n",
" <th>yes</th>\n",
" <th>Total</th>\n",
" <th>Expected watch</th>\n",
" <th>Expected not watch</th>\n",
" <th>E - O (Watch)</th>\n",
" <th>E - O (Not Watch)</th>\n",
" </tr>\n",
" <tr>\n",
" <th>gender</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>F</th>\n",
" <td>8</td>\n",
" <td>14</td>\n",
" <td>22</td>\n",
" <td>11.44</td>\n",
" <td>10.56</td>\n",
" <td>-2.56</td>\n",
" <td>2.56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>M</th>\n",
" <td>16</td>\n",
" <td>12</td>\n",
" <td>28</td>\n",
" <td>14.56</td>\n",
" <td>13.44</td>\n",
" <td>2.56</td>\n",
" <td>-2.56</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"watching no yes Total Expected watch Expected not watch E - O (Watch) \\\n",
"gender \n",
"F 8 14 22 11.44 10.56 -2.56 \n",
"M 16 12 28 14.56 13.44 2.56 \n",
"\n",
"watching E - O (Not Watch) \n",
"gender \n",
"F 2.56 \n",
"M -2.56 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Calculating Deviation\n",
"crosstab1[\"E - O (Watch)\"] = crosstab1[\"Expected watch\"] - crosstab1.yes\n",
"crosstab1[\"E - O (Not Watch)\"] = crosstab1[\"Expected not watch\"] - crosstab1.no\n",
"crosstab1"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>watching</th>\n",
" <th>no</th>\n",
" <th>yes</th>\n",
" <th>Total</th>\n",
" <th>Expected watch</th>\n",
" <th>Expected not watch</th>\n",
" <th>E - O (Watch)</th>\n",
" <th>E - O (Not Watch)</th>\n",
" <th>chi2_watch</th>\n",
" <th>chi2_not_watch</th>\n",
" </tr>\n",
" <tr>\n",
" <th>gender</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>F</th>\n",
" <td>8</td>\n",
" <td>14</td>\n",
" <td>22</td>\n",
" <td>11.44</td>\n",
" <td>10.56</td>\n",
" <td>-2.56</td>\n",
" <td>2.56</td>\n",
" <td>0.756880</td>\n",
" <td>0.787786</td>\n",
" </tr>\n",
" <tr>\n",
" <th>M</th>\n",
" <td>16</td>\n",
" <td>12</td>\n",
" <td>28</td>\n",
" <td>14.56</td>\n",
" <td>13.44</td>\n",
" <td>2.56</td>\n",
" <td>-2.56</td>\n",
" <td>0.670902</td>\n",
" <td>0.698297</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"watching no yes Total Expected watch Expected not watch E - O (Watch) \\\n",
"gender \n",
"F 8 14 22 11.44 10.56 -2.56 \n",
"M 16 12 28 14.56 13.44 2.56 \n",
"\n",
"watching E - O (Not Watch) chi2_watch chi2_not_watch \n",
"gender \n",
"F 2.56 0.756880 0.787786 \n",
"M -2.56 0.670902 0.698297 "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"crosstab1[\"chi2_watch\"] = np.sqrt(crosstab1[\"E - O (Watch)\"]**2/crosstab1[\"Expected watch\"])\n",
"crosstab1[\"chi2_not_watch\"] = np.sqrt(crosstab1[\"E - O (Not Watch)\"]**2/crosstab1[\"Expected not watch\"])\n",
"crosstab1"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2.9138649533909593"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chi2_gender = (crosstab1[\"chi2_watch\"] + crosstab1[\"chi2_not_watch\"]).sum()\n",
"chi2_gender"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Employment node__"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>watching</th>\n",
" <th>no</th>\n",
" <th>yes</th>\n",
" <th>Total</th>\n",
" <th>Expected watch</th>\n",
" <th>Expected not watch</th>\n",
" <th>E - O (Watch)</th>\n",
" <th>E - O (Not Watch)</th>\n",
" <th>chi2_watch</th>\n",
" <th>chi2_not_watch</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employment_status</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>student</th>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>9</td>\n",
" <td>4.68</td>\n",
" <td>4.32</td>\n",
" <td>0.68</td>\n",
" <td>-0.68</td>\n",
" <td>0.31433</td>\n",
" <td>0.327165</td>\n",
" </tr>\n",
" <tr>\n",
" <th>working</th>\n",
" <td>19</td>\n",
" <td>22</td>\n",
" <td>41</td>\n",
" <td>21.32</td>\n",
" <td>19.68</td>\n",
" <td>-0.68</td>\n",
" <td>0.68</td>\n",
" <td>0.14727</td>\n",
" <td>0.153284</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"watching no yes Total Expected watch Expected not watch \\\n",
"employment_status \n",
"student 5 4 9 4.68 4.32 \n",
"working 19 22 41 21.32 19.68 \n",
"\n",
"watching E - O (Watch) E - O (Not Watch) chi2_watch \\\n",
"employment_status \n",
"student 0.68 -0.68 0.31433 \n",
"working -0.68 0.68 0.14727 \n",
"\n",
"watching chi2_not_watch \n",
"employment_status \n",
"student 0.327165 \n",
"working 0.153284 "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"crosstab2 = pd.crosstab(index=films[\"employment_status\"], columns=films[\"watching\"])\n",
"crosstab2[\"Total\"] = crosstab2.no + crosstab2.yes\n",
"\n",
"crosstab2[\"Expected watch\"] = crosstab2.Total * 26/50\n",
"crosstab2[\"Expected not watch\"] = crosstab2.Total * 24/50\n",
"\n",
"crosstab2[\"E - O (Watch)\"] = crosstab2[\"Expected watch\"] - crosstab2.yes\n",
"crosstab2[\"E - O (Not Watch)\"] = crosstab2[\"Expected not watch\"] - crosstab2.no\n",
"\n",
"crosstab2[\"chi2_watch\"] = np.sqrt(crosstab2[\"E - O (Watch)\"]**2/crosstab2[\"Expected watch\"])\n",
"crosstab2[\"chi2_not_watch\"] = np.sqrt(crosstab2[\"E - O (Not Watch)\"]**2/crosstab2[\"Expected not watch\"])\n",
"\n",
"crosstab2"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.9420494494487789"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chi2_emp = (crosstab2[\"chi2_watch\"] + crosstab2[\"chi2_not_watch\"]).sum()\n",
"chi2_emp"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Age node__"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>watching</th>\n",
" <th>no</th>\n",
" <th>yes</th>\n",
" <th>Total</th>\n",
" <th>Expected watch</th>\n",
" <th>Expected not watch</th>\n",
" <th>E - O (Watch)</th>\n",
" <th>E - O (Not Watch)</th>\n",
" <th>chi2_watch</th>\n",
" <th>chi2_not_watch</th>\n",
" </tr>\n",
" <tr>\n",
" <th>is_28+</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>11</td>\n",
" <td>17</td>\n",
" <td>28</td>\n",
" <td>14.56</td>\n",
" <td>13.44</td>\n",
" <td>-2.44</td>\n",
" <td>2.44</td>\n",
" <td>0.639454</td>\n",
" <td>0.665565</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>13</td>\n",
" <td>9</td>\n",
" <td>22</td>\n",
" <td>11.44</td>\n",
" <td>10.56</td>\n",
" <td>2.44</td>\n",
" <td>-2.44</td>\n",
" <td>0.721401</td>\n",
" <td>0.750858</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"watching no yes Total Expected watch Expected not watch E - O (Watch) \\\n",
"is_28+ \n",
"0 11 17 28 14.56 13.44 -2.44 \n",
"1 13 9 22 11.44 10.56 2.44 \n",
"\n",
"watching E - O (Not Watch) chi2_watch chi2_not_watch \n",
"is_28+ \n",
"0 2.44 0.639454 0.665565 \n",
"1 -2.44 0.721401 0.750858 "
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"crosstab3 = pd.crosstab(index=films[\"is_28+\"], columns=films[\"watching\"])\n",
"crosstab3[\"Total\"] = crosstab3.no + crosstab3.yes\n",
"\n",
"crosstab3[\"Expected watch\"] = crosstab3.Total * 26/50\n",
"crosstab3[\"Expected not watch\"] = crosstab3.Total * 24/50\n",
"\n",
"crosstab3[\"E - O (Watch)\"] = crosstab3[\"Expected watch\"] - crosstab3.yes\n",
"crosstab3[\"E - O (Not Watch)\"] = crosstab3[\"Expected not watch\"] - crosstab3.no\n",
"\n",
"crosstab3[\"chi2_watch\"] = np.sqrt(crosstab3[\"E - O (Watch)\"]**2/crosstab3[\"Expected watch\"])\n",
"crosstab3[\"chi2_not_watch\"] = np.sqrt(crosstab3[\"E - O (Not Watch)\"]**2/crosstab3[\"Expected not watch\"])\n",
"\n",
"crosstab3"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2.777277533700757"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chi2_age = (crosstab3[\"chi2_watch\"] + crosstab3[\"chi2_not_watch\"]).sum()\n",
"chi2_age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- __Since chi2_gender < chi2_age < chi2_emp, the node split will take on Gender__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### III.Entropy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Gender node__"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>watching</th>\n",
" <th>no</th>\n",
" <th>yes</th>\n",
" </tr>\n",
" <tr>\n",
" <th>gender</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>F</th>\n",
" <td>8</td>\n",
" <td>14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>M</th>\n",
" <td>16</td>\n",
" <td>12</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"watching no yes\n",
"gender \n",
"F 8 14\n",
"M 16 12"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"crosstab1.iloc[:,:2]"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.94566030460064021"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Female node entropy\n",
"p = 14/float(22)\n",
"q = 8/float(22)\n",
"female_entropy = -p*np.log2(p) - q*np.log2(q)\n",
"female_entropy"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.98522813603425163"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Male node entropy\n",
"p = 12/float(28)\n",
"q = 16/float(28)\n",
"male_entropy = -p*np.log2(p) - q*np.log2(q)\n",
"male_entropy"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.96781829020346266"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Weighted entropy for gender\n",
"total_entropy_gender = (28/float(50))*male_entropy + (22/float(50))*female_entropy\n",
"total_entropy_gender"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Employment Node__"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>watching</th>\n",
" <th>no</th>\n",
" <th>yes</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employment_status</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>student</th>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>working</th>\n",
" <td>19</td>\n",
" <td>22</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"watching no yes\n",
"employment_status \n",
"student 5 4\n",
"working 19 22"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"crosstab2.iloc[:,:2]"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.99107605983822222"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#entropy for students\n",
"p = 4/float(9)\n",
"q = 5/float(9)\n",
"working_entropy = -p*np.log2(p) - q*np.log2(q)\n",
"working_entropy"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.99613448350957956"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# entropy for working people\n",
"p = 22/float(41)\n",
"q = 19/float(41)\n",
"\n",
"student_entropy = -p*np.log2(p) - q*np.log2(q)\n",
"student_entropy"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.99198657609906649"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"total_entropy_emp = (41/float(50))*working_entropy + (9/float(50))*student_entropy\n",
"total_entropy_emp"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Age Node__"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>watching</th>\n",
" <th>no</th>\n",
" <th>yes</th>\n",
" </tr>\n",
" <tr>\n",
" <th>is_28+</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>11</td>\n",
" <td>17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>13</td>\n",
" <td>9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"watching no yes\n",
"is_28+ \n",
"0 11 17\n",
"1 13 9"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"crosstab3.iloc[:,:2]"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.96661863254810276"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#entropy for less than 28 years\n",
"p = 17/float(28)\n",
"q = 11/float(28)\n",
"less_than_28_entropy = -p*np.log2(p) - q*np.log2(q)\n",
"less_than_28_entropy"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.97602064823661505"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#entropy for more than 28 years\n",
"p = 9/float(22)\n",
"q = 13/float(22)\n",
"more_than_28_entropy = -p*np.log2(p) - q*np.log2(q)\n",
"more_than_28_entropy"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.97075551945104821"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"total_entropy_age = (28/float(50))*less_than_28_entropy + (22/float(50))*more_than_28_entropy\n",
"total_entropy_age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- __Since entropy_gender < entropy_age < entropy_emp, the node split will take on Gender__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.14"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment