Skip to content

Instantly share code, notes, and snippets.

@saeedaghabozorgi
Created November 29, 2018 19:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save saeedaghabozorgi/ebc1d5cd90da3dd07eb95616ecc3dd14 to your computer and use it in GitHub Desktop.
Save saeedaghabozorgi/ebc1d5cd90da3dd07eb95616ecc3dd14 to your computer and use it in GitHub Desktop.
Created on Cognitive Class Labs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"<a href=\"https://www.bigdatauniversity.com\"><img src = \"https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png\" width = 400, align = \"center\"></a>\n",
"\n",
"<h1 align=center><font size = 5> Logistic Regression with Python</font></h1>"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"In this notebook, you will learn Logist Regression, and then, you'll create a model with telecommunications data to predict when its customers will leave for a competitor, so that you can take some action to retain the customer.\n",
"\n",
"\n",
"<a id=\"ref1\"></a>\n",
"## What is different between Linear and Logistic Regression?\n",
"\n",
"While Linear Regression is suited for estimating continuous values (e.g. estimating house price), it isn’t the best tool for predicting the class of an observed data point. In order to estimate a classification, we need some sort of guidance on what would be the **most probable class** for that data point. For this, we use **Logistic Regression**.\n",
"\n",
"<div class=\"alert alert-success alertsuccess\" style=\"margin-top: 20px\">\n",
"<font size = 3><strong>Recall linear regression:</strong></font>\n",
"<br>\n",
"<br>\n",
"Linear regression finds a function that relates a continuous dependent variable, _y_, to some predictors (independent variables _x1_, _x2_, etc.). Simple linear regression assumes a function of the form:\n",
"<br><br>\n",
"$$\n",
"y = w0 + w1 * x1 + w2 * x2 +...\n",
"$$\n",
"<br>\n",
"and finds the values of _w0_, _w1_, _w2_, etc. The term _w0_ is the \"intercept\" or \"constant term\" :\n",
"<br><br>\n",
"$$\n",
"Y = W^TX\n",
"$$\n",
"<p></p>\n",
"\n",
"</div>\n",
"\n",
"Logistic Regression is a variation of Linear Regression, useful when the observed dependent variable, _y_, is categorical. It produces a formula that predicts the probability of the class label as a function of the independent variables.\n",
"\n",
"Despite the name logistic _regression_, it is actually a __probabilistic classification__ model. Logistic regression fits a special s-shaped curve by taking the linear regression and transforming the numeric estimate into a probability with the following function:\n",
"\n",
"$$\n",
"ℎ_w(𝑥) = \\frac {e^{(w0 + w1 * x1 + w2 * x2 +...)}}{1 + e^{(w0 + w1 * x1 + w2 * x2 +...)}}\n",
"$$\n",
"Or:\n",
"$$\n",
"p(X) = ProbabilityOfaClass = P(Y=1|X) = 𝜎({W^TX}) = \\frac{e^{W^TX}}{1+e^{W^TX}} = exp({W^TX}) / (1+exp({W^TX})) \n",
"$$\n",
"Or:\n",
"$$\n",
"Logit (ProbabilityOfaClass) = log(oddRatio) = log(\\frac{p(X)}{1-p(X)}) = w0 + w1 * x1 + w2 * x2 +...\n",
"$$\n",
"which produces p-values between 0 (as y approaches minus infinity) and 1 (as y approaches plus infinity). This now becomes a special kind of non-linear regression.\n",
"\n",
"In this equation, ${W^TX}$ is the regression result (the sum of the variables weighted by the coefficients), `exp` is the exponential function and $𝜎(W^TX)$ is the [logistic function](http://en.wikipedia.org/wiki/Logistic_function), also called logistic curve. It is a common \"S\" shape (sigmoid curve), and was first developed for modelling population growth.\n",
"\n",
"\n",
"So, briefly, Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability:\n",
"\n",
"<img\n",
"src=\"https://ibm.box.com/shared/static/kgv9alcghmjcv97op4d6onkyxevk23b1.png\" width = \"400\" align = \"center\">\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Lets first import requiered libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/jupyterlab/conda/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.\n",
" from pandas.core import datetools\n",
"/home/jupyterlab/conda/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
" \"This module will be removed in 0.20.\", DeprecationWarning)\n"
]
}
],
"source": [
"import pandas as pd\n",
"import statsmodels.api as sm\n",
"import pylab as pl\n",
"import numpy as np\n",
"import scipy.optimize as opt\n",
"from sklearn import preprocessing\n",
"from sklearn.cross_validation import train_test_split\n",
"%matplotlib inline \n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Customer churn with Logistic Regression\n",
"A telecommunications company is concerned about the number of customers leaving their landline business for cable competitors. They need to understand who is leaving. Imagine that you’re an analyst at this company and you have to find out who is leaving and why."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"### About dataset\n",
"We’ll use a telecommunications data for predicting customer churn. This is a historical customer data where each row represents one customer. The data is relatively easy to understand, and you may uncover insights you can use immediately. Typically it’s less expensive to keep customers than acquire new ones, so the focus of this analysis is to predict the customers who will stay with the company. \n",
"\n",
"\n",
"This data set provides info to help you predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.\n",
"\n",
"\n",
"\n",
"The data set includes information about:\n",
"\n",
"- Customers who left within the last month – the column is called Churn\n",
"- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies\n",
"- Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges\n",
"- Demographic info about customers – gender, age range, and if they have partners and dependents\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"### Load the Telco Churn data \n",
"Telco Churn is a hypothetical data file that concerns a telecommunications company's efforts to reduce turnover in its customer base. Each case corresponds to a separate customer and it records various demographic and service usage information. Before you can work with the data, you must use the URL to get the ChurnData.csv."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2018-01-31 21:55:42-- https://ibm.box.com/shared/static/8s8dn9gam7ipqb42cm4aehmbb26zkekl.csv\n",
"Resolving ibm.box.com (ibm.box.com)... 107.152.27.197\n",
"Connecting to ibm.box.com (ibm.box.com)|107.152.27.197|:443... connected.\n",
"HTTP request sent, awaiting response... 301 Moved Permanently\n",
"Location: https://ibm.ent.box.com/shared/static/8s8dn9gam7ipqb42cm4aehmbb26zkekl.csv [following]\n",
"--2018-01-31 21:55:43-- https://ibm.ent.box.com/shared/static/8s8dn9gam7ipqb42cm4aehmbb26zkekl.csv\n",
"Resolving ibm.ent.box.com (ibm.ent.box.com)... 107.152.27.211, 107.152.26.211\n",
"Connecting to ibm.ent.box.com (ibm.ent.box.com)|107.152.27.211|:443... connected.\n",
"HTTP request sent, awaiting response... 302 Found\n",
"Location: https://public.boxcloud.com/d/1/eFZfnPsXaNFGM-meuzBFT7mtYslMBeoRqFqZfGJ440MQCBqRTave1AlwufCjDIhsw5Z7ovKHjLq4B47zhoBCuOy7XDXxNkn8IMJbep5c_bCXv1wz_dDjFHojK41XFRDa76n6p6YCAxMbfjwpatQp_cw70M9j23MjIg8KIaBQtVSI0H831uePMR4q91gUGss8bujO4RhaL7-sMPIR7vORmW1eOx2GPnUK4FVHAbcVksnLoWTelylRYeJV6AD2OrQTuie6H3on7W1vwkmwc_I-CCZqUyfpjUh3DsdgMMMVUFvsoKwyUNpE8AHEAKIXHdx1HqI7Dj4HvzltdBab7adH1pqYHM_rTpXdtkoUJpRSqVTnd3ylIbKxPXAvuXque0zlW_m5kTJj8wbq9f7E4xucNsdPkOMo_RbT7OiG9S3h6ufPoN6xu5qmMoOl1EEVP0c7qdRDtATZMv8p2t99glzax7QARB8y9u9IaZNvEHkROa9ik4IEp0ihnZWQ2Y2f0mLjlG0Dt55aaYstFXSyVa6Do7S4EJvuSCJpb-xwHecPapcJV9ywg34OpZtOrOduQpEk29TUi_P-66MSnTyQ1sX9wZWOvuDHbiKORqA3yWwwvJnXANT0pk372GEQw2A83hXS0QGM_dERmXmSZPJepOkoe2LRODzHJh7R13QW8Xpa65tImM8AMJCG1dyMGm4djzFKWWc2ITE3CtzwUMk_NnJlAzctZyItIw23UwmSMy0_Vi3C_lDkJY4bj5wK13HbKnL-eD4oNEh1U7RyEzrggw_BjVFE__cetGPE6ybv57UQsqBkGGmrgfGzAVhs_3seswv_Wx8jPRCK0mT2lxaheotuVaANvy18_V8WYOJoM_UQ4QiIwk6MZRXUj0S2W82NSwGoI6_rwy1hiuPalNPHJbvmN0Bn2mqs-SrSoBjJJ8seUYr9tAoSEwuiaRbAk8mxM3XkQ2zoHCoCcNbewwqLxW7s56QaTPJz3HWRZ0KAI_ZXaMWN9-XdhtimIis1lZK1MX306Blcr1vrUvO4OeSWC3uBx-Ol9ZxGbo-YmuLz4wnfCjxovA2VnJARu4wqUxcr-WkWycEvsFLqe5d3ul4XGKZ96zPvFFT68t3o6DbV7_VnRhayxP3kO4IPGkKOKeJItZeZVLFhhD8T2p57RqgNTD4k37F3zIVv89Ls3hlftg8xJe20z9zY8g0eX0yEolAu8R5pKfxKIcYxj6QETtUG4AXHRwDPbeIdT5CxVKKmqyd6X5tOq_IBk7ENyYsskMRxZKScp-tReooOLJXnmCknj46qggl7vYGMHpCYQpUm_fAR7omIe5K3RGmyy_1XK1Q_S5l_Jz6-Ujo./download [following]\n",
"--2018-01-31 21:55:44-- https://public.boxcloud.com/d/1/eFZfnPsXaNFGM-meuzBFT7mtYslMBeoRqFqZfGJ440MQCBqRTave1AlwufCjDIhsw5Z7ovKHjLq4B47zhoBCuOy7XDXxNkn8IMJbep5c_bCXv1wz_dDjFHojK41XFRDa76n6p6YCAxMbfjwpatQp_cw70M9j23MjIg8KIaBQtVSI0H831uePMR4q91gUGss8bujO4RhaL7-sMPIR7vORmW1eOx2GPnUK4FVHAbcVksnLoWTelylRYeJV6AD2OrQTuie6H3on7W1vwkmwc_I-CCZqUyfpjUh3DsdgMMMVUFvsoKwyUNpE8AHEAKIXHdx1HqI7Dj4HvzltdBab7adH1pqYHM_rTpXdtkoUJpRSqVTnd3ylIbKxPXAvuXque0zlW_m5kTJj8wbq9f7E4xucNsdPkOMo_RbT7OiG9S3h6ufPoN6xu5qmMoOl1EEVP0c7qdRDtATZMv8p2t99glzax7QARB8y9u9IaZNvEHkROa9ik4IEp0ihnZWQ2Y2f0mLjlG0Dt55aaYstFXSyVa6Do7S4EJvuSCJpb-xwHecPapcJV9ywg34OpZtOrOduQpEk29TUi_P-66MSnTyQ1sX9wZWOvuDHbiKORqA3yWwwvJnXANT0pk372GEQw2A83hXS0QGM_dERmXmSZPJepOkoe2LRODzHJh7R13QW8Xpa65tImM8AMJCG1dyMGm4djzFKWWc2ITE3CtzwUMk_NnJlAzctZyItIw23UwmSMy0_Vi3C_lDkJY4bj5wK13HbKnL-eD4oNEh1U7RyEzrggw_BjVFE__cetGPE6ybv57UQsqBkGGmrgfGzAVhs_3seswv_Wx8jPRCK0mT2lxaheotuVaANvy18_V8WYOJoM_UQ4QiIwk6MZRXUj0S2W82NSwGoI6_rwy1hiuPalNPHJbvmN0Bn2mqs-SrSoBjJJ8seUYr9tAoSEwuiaRbAk8mxM3XkQ2zoHCoCcNbewwqLxW7s56QaTPJz3HWRZ0KAI_ZXaMWN9-XdhtimIis1lZK1MX306Blcr1vrUvO4OeSWC3uBx-Ol9ZxGbo-YmuLz4wnfCjxovA2VnJARu4wqUxcr-WkWycEvsFLqe5d3ul4XGKZ96zPvFFT68t3o6DbV7_VnRhayxP3kO4IPGkKOKeJItZeZVLFhhD8T2p57RqgNTD4k37F3zIVv89Ls3hlftg8xJe20z9zY8g0eX0yEolAu8R5pKfxKIcYxj6QETtUG4AXHRwDPbeIdT5CxVKKmqyd6X5tOq_IBk7ENyYsskMRxZKScp-tReooOLJXnmCknj46qggl7vYGMHpCYQpUm_fAR7omIe5K3RGmyy_1XK1Q_S5l_Jz6-Ujo./download\n",
"Resolving public.boxcloud.com (public.boxcloud.com)... 107.152.26.200, 107.152.27.200\n",
"Connecting to public.boxcloud.com (public.boxcloud.com)|107.152.26.200|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 36144 (35K) [text/csv]\n",
"Saving to: ‘ChurnData.csv’\n",
"\n",
"ChurnData.csv 100%[=====================>] 35.30K --.-KB/s in 0.04s \n",
"\n",
"2018-01-31 21:55:44 (948 KB/s) - ‘ChurnData.csv’ saved [36144/36144]\n",
"\n"
]
}
],
"source": [
"#Click here and press Shift+Enter\n",
"!wget -O ChurnData.csv https://ibm.box.com/shared/static/8s8dn9gam7ipqb42cm4aehmbb26zkekl.csv"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"### Load Data From CSV File "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>tenure</th>\n",
" <th>age</th>\n",
" <th>address</th>\n",
" <th>income</th>\n",
" <th>ed</th>\n",
" <th>employ</th>\n",
" <th>equip</th>\n",
" <th>callcard</th>\n",
" <th>wireless</th>\n",
" <th>longmon</th>\n",
" <th>...</th>\n",
" <th>pager</th>\n",
" <th>internet</th>\n",
" <th>callwait</th>\n",
" <th>confer</th>\n",
" <th>ebill</th>\n",
" <th>loglong</th>\n",
" <th>logtoll</th>\n",
" <th>lninc</th>\n",
" <th>custcat</th>\n",
" <th>churn</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>11.0</td>\n",
" <td>33.0</td>\n",
" <td>7.0</td>\n",
" <td>136.0</td>\n",
" <td>5.0</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>4.40</td>\n",
" <td>...</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.482</td>\n",
" <td>3.033</td>\n",
" <td>4.913</td>\n",
" <td>4.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>33.0</td>\n",
" <td>33.0</td>\n",
" <td>12.0</td>\n",
" <td>33.0</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>9.45</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2.246</td>\n",
" <td>3.240</td>\n",
" <td>3.497</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>23.0</td>\n",
" <td>30.0</td>\n",
" <td>9.0</td>\n",
" <td>30.0</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>6.30</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.841</td>\n",
" <td>3.240</td>\n",
" <td>3.401</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>38.0</td>\n",
" <td>35.0</td>\n",
" <td>5.0</td>\n",
" <td>76.0</td>\n",
" <td>2.0</td>\n",
" <td>10.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>6.05</td>\n",
" <td>...</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.800</td>\n",
" <td>3.807</td>\n",
" <td>4.331</td>\n",
" <td>4.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>7.0</td>\n",
" <td>35.0</td>\n",
" <td>14.0</td>\n",
" <td>80.0</td>\n",
" <td>2.0</td>\n",
" <td>15.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>7.10</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.960</td>\n",
" <td>3.091</td>\n",
" <td>4.382</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 28 columns</p>\n",
"</div>"
],
"text/plain": [
" tenure age address income ed employ equip callcard wireless \\\n",
"0 11.0 33.0 7.0 136.0 5.0 5.0 0.0 1.0 1.0 \n",
"1 33.0 33.0 12.0 33.0 2.0 0.0 0.0 0.0 0.0 \n",
"2 23.0 30.0 9.0 30.0 1.0 2.0 0.0 0.0 0.0 \n",
"3 38.0 35.0 5.0 76.0 2.0 10.0 1.0 1.0 1.0 \n",
"4 7.0 35.0 14.0 80.0 2.0 15.0 0.0 1.0 0.0 \n",
"\n",
" longmon ... pager internet callwait confer ebill loglong logtoll \\\n",
"0 4.40 ... 1.0 0.0 1.0 1.0 0.0 1.482 3.033 \n",
"1 9.45 ... 0.0 0.0 0.0 0.0 0.0 2.246 3.240 \n",
"2 6.30 ... 0.0 0.0 0.0 1.0 0.0 1.841 3.240 \n",
"3 6.05 ... 1.0 1.0 1.0 1.0 1.0 1.800 3.807 \n",
"4 7.10 ... 0.0 0.0 1.0 1.0 0.0 1.960 3.091 \n",
"\n",
" lninc custcat churn \n",
"0 4.913 4.0 1.0 \n",
"1 3.497 1.0 1.0 \n",
"2 3.401 3.0 0.0 \n",
"3 4.331 4.0 0.0 \n",
"4 4.382 3.0 0.0 \n",
"\n",
"[5 rows x 28 columns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"churn_df = pd.read_csv(\"ChurnData.csv\")\n",
"churn_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data pre-processing and selection"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets select some features for the modeling. Also we change the target data type to be integer, as it is a requirement by the skitlearn algorithm:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>tenure</th>\n",
" <th>age</th>\n",
" <th>address</th>\n",
" <th>income</th>\n",
" <th>ed</th>\n",
" <th>employ</th>\n",
" <th>equip</th>\n",
" <th>callcard</th>\n",
" <th>wireless</th>\n",
" <th>churn</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>11.0</td>\n",
" <td>33.0</td>\n",
" <td>7.0</td>\n",
" <td>136.0</td>\n",
" <td>5.0</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>33.0</td>\n",
" <td>33.0</td>\n",
" <td>12.0</td>\n",
" <td>33.0</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>23.0</td>\n",
" <td>30.0</td>\n",
" <td>9.0</td>\n",
" <td>30.0</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>38.0</td>\n",
" <td>35.0</td>\n",
" <td>5.0</td>\n",
" <td>76.0</td>\n",
" <td>2.0</td>\n",
" <td>10.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>7.0</td>\n",
" <td>35.0</td>\n",
" <td>14.0</td>\n",
" <td>80.0</td>\n",
" <td>2.0</td>\n",
" <td>15.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" tenure age address income ed employ equip callcard wireless \\\n",
"0 11.0 33.0 7.0 136.0 5.0 5.0 0.0 1.0 1.0 \n",
"1 33.0 33.0 12.0 33.0 2.0 0.0 0.0 0.0 0.0 \n",
"2 23.0 30.0 9.0 30.0 1.0 2.0 0.0 0.0 0.0 \n",
"3 38.0 35.0 5.0 76.0 2.0 10.0 1.0 1.0 1.0 \n",
"4 7.0 35.0 14.0 80.0 2.0 15.0 0.0 1.0 0.0 \n",
"\n",
" churn \n",
"0 1 \n",
"1 1 \n",
"2 0 \n",
"3 0 \n",
"4 0 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"churn_df = churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip', 'callcard', 'wireless','churn']]\n",
"churn_df['churn'] = churn_df['churn'].astype('int')\n",
"churn_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": true,
"new_sheet": true,
"run_control": {
"read_only": false
}
},
"source": [
"#### How many rows, columns in total?"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"2000"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"churn_df.size"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip',\n",
" 'callcard', 'wireless', 'churn'],\n",
" dtype='object')"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"churn_df.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets define X, and y for our dataset:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 11., 33., 7., 136., 5., 5., 0.],\n",
" [ 33., 33., 12., 33., 2., 0., 0.],\n",
" [ 23., 30., 9., 30., 1., 2., 0.],\n",
" [ 38., 35., 5., 76., 2., 10., 1.],\n",
" [ 7., 35., 14., 80., 2., 15., 0.]])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = np.asarray(churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']])\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 1, 0, 0, 0])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y = np.asarray(churn_df['churn'])\n",
"y [0:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Also, we normalize the dataset:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[-1.13518441, -0.62595491, -0.4588971 , 0.4751423 , 1.6961288 ,\n",
" -0.58477841, -0.85972695],\n",
" [-0.11604313, -0.62595491, 0.03454064, -0.32886061, -0.6433592 ,\n",
" -1.14437497, -0.85972695],\n",
" [-0.57928917, -0.85594447, -0.261522 , -0.35227817, -1.42318853,\n",
" -0.92053635, -0.85972695],\n",
" [ 0.11557989, -0.47262854, -0.65627219, 0.00679109, -0.6433592 ,\n",
" -0.02518185, 1.16316 ],\n",
" [-1.32048283, -0.47262854, 0.23191574, 0.03801451, -0.6433592 ,\n",
" 0.53441472, -0.85972695]])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn import preprocessing\n",
"X = preprocessing.StandardScaler().fit(X).transform(X)\n",
"\n",
"\n",
"X[0:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train/Test dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Okay, we split our dataset into train and test set:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train set: (160, 7) (160,)\n",
"Test set: (40, 7) (40,)\n"
]
}
],
"source": [
"X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)\n",
"print ('Train set:', X_train.shape, y_train.shape)\n",
"print ('Test set:', X_test.shape, y_test.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Modeling (Logistic Regression with Scikit-learn)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.metrics import confusion_matrix\n",
"#y_cat = churn_df['churn']\n",
"# X_train, X_test, y_train, y_test = train_test_split( X, y_cat, test_size=0.2, random_state=4)\n",
"# print ('Train set:', X_train.shape, y_train.shape)\n",
"# print ('Test set:', X_test.shape, y_test.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets build our model:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,\n",
" intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
" penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
" verbose=0, warm_start=False)"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"LR = LogisticRegression(C=0.01).fit(X_train,y_train)\n",
"LR"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can predict using our test set:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 0, 0, 0, 0])"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"yhat = LR.predict(X_test)\n",
"yhat [0:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluation"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import classification_report, confusion_matrix\n",
"import itertools"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"def plot_confusion_matrix(cm, classes,\n",
" normalize=False,\n",
" title='Confusion matrix',\n",
" cmap=plt.cm.Blues):\n",
" \"\"\"\n",
" This function prints and plots the confusion matrix.\n",
" Normalization can be applied by setting `normalize=True`.\n",
" \"\"\"\n",
" if normalize:\n",
" cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n",
" print(\"Normalized confusion matrix\")\n",
" else:\n",
" print('Confusion matrix, without normalization')\n",
"\n",
" print(cm)\n",
"\n",
" plt.imshow(cm, interpolation='nearest', cmap=cmap)\n",
" plt.title(title)\n",
" plt.colorbar()\n",
" tick_marks = np.arange(len(classes))\n",
" plt.xticks(tick_marks, classes, rotation=45)\n",
" plt.yticks(tick_marks, classes)\n",
"\n",
" fmt = '.2f' if normalize else 'd'\n",
" thresh = cm.max() / 2.\n",
" for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n",
" plt.text(j, i, format(cm[i, j], fmt),\n",
" horizontalalignment=\"center\",\n",
" color=\"white\" if cm[i, j] > thresh else \"black\")\n",
"\n",
" plt.tight_layout()\n",
" plt.ylabel('True label')\n",
" plt.xlabel('Predicted label')"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'y_test' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-1-2fac0ebc5383>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmetrics\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mconfusion_matrix\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mconfusion_matrix\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my_test\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0myhat\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlabels\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mNameError\u001b[0m: name 'y_test' is not defined"
]
}
],
"source": [
"from sklearn.metrics import confusion_matrix\n",
"print(confusion_matrix(y_test, yhat, labels=[1,0]))"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 0.73 0.96 0.83 25\n",
" 1 0.86 0.40 0.55 15\n",
"\n",
"avg / total 0.78 0.75 0.72 40\n",
"\n",
"Confusion matrix, without normalization\n",
"[[ 6 9]\n",
" [ 1 24]]\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7f210eb17198>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Compute confusion matrix\n",
"cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])\n",
"np.set_printoptions(precision=2)\n",
"\n",
"print (classification_report(y_test, yhat))\n",
"\n",
"# Plot non-normalized confusion matrix\n",
"plt.figure()\n",
"plot_confusion_matrix(cnf_matrix, classes=['churn=1','churn=0'],normalize= False, title='Confusion matrix')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets try jaccard index for accuracy:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.75"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.metrics import jaccard_similarity_score\n",
"jaccard_similarity_score(y_test, yhat)"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"## Want to learn more?\n",
"\n",
"You can take free [Machine learning with Python](https://cocl.us/DX0108EN_ML0101EN)course.\n",
"\n",
"IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: [SPSS Modeler for Mac users](https://cocl.us/DX0108EN_SPSSMod_mac) and [SPSS Modeler for Windows users](https://cocl.us/DX0108EN_SPSSMod_win)\n",
"\n",
"Also, you can use Data Science Experience to run these notebooks faster with bigger datasets. Data Science Experience is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, DSX enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of DSX users today with a free account at [Data Science Experience](https://cocl.us/DX0108EN_DSX)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"### Thanks for completing this lesson!\n",
"\n",
"Notebook created by: <a href = \"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"<hr>\n",
"Copyright &copy; 2017 [Cognitive Class](https://cocl.us/DX0108EN_CC). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).​"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment