Skip to content

Instantly share code, notes, and snippets.

@harisonmg
Last active May 8, 2019 07:22
Show Gist options
  • Save harisonmg/b452289e14af1159a070e8da094202ca to your computer and use it in GitHub Desktop.
Save harisonmg/b452289e14af1159a070e8da094202ca to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"nbformat_minor": 1,
"cells": [
{
"source": "# Regression D-NA Course Notebook (Supervised Learning)",
"cell_type": "markdown",
"metadata": {
"collapsed": true
}
},
{
"source": "### Importing Data Files",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 1,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 1,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>PRODUCT CODE</th>\n <th>PRODUCT CATEGORY</th>\n <th>UNIT LIST PRICE</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>30001</td>\n <td>HEALTH &amp; BEAUTY</td>\n <td>$7.45</td>\n </tr>\n <tr>\n <th>1</th>\n <td>30002</td>\n <td>HEALTH &amp; BEAUTY</td>\n <td>$5.35</td>\n </tr>\n <tr>\n <th>2</th>\n <td>30003</td>\n <td>HEALTH &amp; BEAUTY</td>\n <td>$5.49</td>\n </tr>\n <tr>\n <th>3</th>\n <td>30004</td>\n <td>HEALTH &amp; BEAUTY</td>\n <td>$6.46</td>\n </tr>\n <tr>\n <th>4</th>\n <td>30005</td>\n <td>HEALTH &amp; BEAUTY</td>\n <td>$7.33</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " PRODUCT CODE PRODUCT CATEGORY UNIT LIST PRICE\n0 30001 HEALTH & BEAUTY $7.45 \n1 30002 HEALTH & BEAUTY $5.35 \n2 30003 HEALTH & BEAUTY $5.49 \n3 30004 HEALTH & BEAUTY $6.46 \n4 30005 HEALTH & BEAUTY $7.33 "
},
"output_type": "execute_result"
}
],
"source": "# The code was removed by Watson Studio for sharing."
},
{
"execution_count": 2,
"cell_type": "code",
"metadata": {
"scrolled": true
},
"outputs": [
{
"execution_count": 2,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>CUSTOMER NUM</th>\n <th>PRODUCT NUM</th>\n <th>QUANTITY PURCHASED</th>\n <th>DISCOUNT TAKEN</th>\n <th>TRANSACTION DATE</th>\n <th>STOCKOUT</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>10114</td>\n <td>30011</td>\n <td>4</td>\n <td>0.0</td>\n <td>1/2/2015</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>10217</td>\n <td>30016</td>\n <td>3</td>\n <td>0.0</td>\n <td>1/2/2015</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>10224</td>\n <td>30013</td>\n <td>4</td>\n <td>0.0</td>\n <td>1/2/2015</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>10103</td>\n <td>30012</td>\n <td>3</td>\n <td>0.2</td>\n <td>1/2/2015</td>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>10037</td>\n <td>30010</td>\n <td>8</td>\n <td>0.0</td>\n <td>1/2/2015</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " CUSTOMER NUM PRODUCT NUM QUANTITY PURCHASED DISCOUNT TAKEN \\\n0 10114 30011 4 0.0 \n1 10217 30016 3 0.0 \n2 10224 30013 4 0.0 \n3 10103 30012 3 0.2 \n4 10037 30010 8 0.0 \n\n TRANSACTION DATE STOCKOUT \n0 1/2/2015 0 \n1 1/2/2015 0 \n2 1/2/2015 0 \n3 1/2/2015 0 \n4 1/2/2015 0 "
},
"output_type": "execute_result"
}
],
"source": "#Import Transaction DataSet Here\nbody = client_0b7f25d6f4c743e1bbf81e3ecac0de47.get_object(Bucket='ibmdnadatasciencelearningproject-donotdelete-pr-foj79k0orfi5wk',Key='DNA- Transaction Data Set - Student 3 of 3.csv')['Body']\n# add missing __iter__ method, so pandas accepts body as file-like object\nif not hasattr(body, \"__iter__\"): body.__iter__ = types.MethodType( __iter__, body )\n\ntransactions_data = pd.read_csv(body, sep='|')\ntransactions_data.head()\n\n"
},
{
"execution_count": 3,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 3,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>CUSTOMERID</th>\n <th>GENDER</th>\n <th>AGE</th>\n <th>INCOME</th>\n <th>EXPERIENCE SCORE</th>\n <th>LOYALTY GROUP</th>\n <th>ENROLLMENT DATE</th>\n <th>HOUSEHOLD SIZE</th>\n <th>MARITAL STATUS</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>10001</td>\n <td>0</td>\n <td>64</td>\n <td>$133,498</td>\n <td>5</td>\n <td>enrolled</td>\n <td>06-03-2013</td>\n <td>4</td>\n <td>Single</td>\n </tr>\n <tr>\n <th>1</th>\n <td>10002</td>\n <td>0</td>\n <td>42</td>\n <td>$94,475</td>\n <td>9</td>\n <td>notenrolled</td>\n <td>NaN</td>\n <td>6</td>\n <td>Married</td>\n </tr>\n <tr>\n <th>2</th>\n <td>10003</td>\n <td>0</td>\n <td>40</td>\n <td>$88,610</td>\n <td>9</td>\n <td>enrolled</td>\n <td>02-09-2010</td>\n <td>5</td>\n <td>Married</td>\n </tr>\n <tr>\n <th>3</th>\n <td>10004</td>\n <td>0</td>\n <td>38</td>\n <td>$84,313</td>\n <td>8</td>\n <td>enrolled</td>\n <td>06-04-2015</td>\n <td>1</td>\n <td>Single</td>\n </tr>\n <tr>\n <th>4</th>\n <td>10005</td>\n <td>0</td>\n <td>30</td>\n <td>$51,498</td>\n <td>3</td>\n <td>notenrolled</td>\n <td>NaN</td>\n <td>1</td>\n <td>Single</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " CUSTOMERID GENDER AGE INCOME EXPERIENCE SCORE LOYALTY GROUP \\\n0 10001 0 64 $133,498 5 enrolled \n1 10002 0 42 $94,475 9 notenrolled \n2 10003 0 40 $88,610 9 enrolled \n3 10004 0 38 $84,313 8 enrolled \n4 10005 0 30 $51,498 3 notenrolled \n\n ENROLLMENT DATE HOUSEHOLD SIZE MARITAL STATUS \n0 06-03-2013 4 Single \n1 NaN 6 Married \n2 02-09-2010 5 Married \n3 06-04-2015 1 Single \n4 NaN 1 Single "
},
"output_type": "execute_result"
}
],
"source": "#Import Customer Dataset Here\nbody = client_0b7f25d6f4c743e1bbf81e3ecac0de47.get_object(Bucket='ibmdnadatasciencelearningproject-donotdelete-pr-foj79k0orfi5wk',Key='DNA - Customer Data Set - Student 1 of 3.csv')['Body']\n# add missing __iter__ method, so pandas accepts body as file-like object\nif not hasattr(body, \"__iter__\"): body.__iter__ = types.MethodType( __iter__, body )\n\ncustomer_data = pd.read_csv(body)\ncustomer_data.head()\n\n"
},
{
"source": "### Changing Data Types",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 4,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "customer_data['INCOME']=customer_data['INCOME'].map(lambda x : x.replace('$',''))"
},
{
"execution_count": 5,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "customer_data['INCOME']=customer_data['INCOME'].map(lambda x : int(x.replace(',','')))"
},
{
"source": "### Creating Customer View",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 6,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "trans_products=transactions_data.merge(product_data,how='inner', left_on='PRODUCT NUM', right_on='PRODUCT CODE')"
},
{
"execution_count": 7,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "trans_products['UNIT LIST PRICE']=trans_products['UNIT LIST PRICE'].map(lambda x : float(x.replace('$','')))"
},
{
"execution_count": 8,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "trans_products['Total_Price']=trans_products['QUANTITY PURCHASED'] * trans_products['UNIT LIST PRICE'] * (1- trans_products['DISCOUNT TAKEN'])"
},
{
"execution_count": 9,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "customer_prod_categ=trans_products.groupby(['CUSTOMER NUM','PRODUCT CATEGORY']).agg({'Total_Price':'sum'})"
},
{
"execution_count": 10,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "customer_prod_categ=customer_prod_categ.reset_index()"
},
{
"execution_count": 11,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "customer_pivot=customer_prod_categ.pivot(index='CUSTOMER NUM',columns='PRODUCT CATEGORY',values='Total_Price')"
},
{
"execution_count": 12,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "trans_total_spend=trans_products.groupby('CUSTOMER NUM').agg({'Total_Price':'sum'}).\\\nrename(columns={'Total_Price':'TOTAL SPENT'})"
},
{
"execution_count": 13,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "customer_KPIs=customer_pivot.merge(trans_total_spend,how='inner',left_index=True, right_index=True )"
},
{
"execution_count": 14,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "customer_KPIs=customer_KPIs.fillna(0)\n"
},
{
"execution_count": 15,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "customer_all_view=customer_data.merge(customer_KPIs,how='inner', left_on='CUSTOMERID', right_index=True)"
},
{
"execution_count": 16,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 16,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>CUSTOMERID</th>\n <th>GENDER</th>\n <th>AGE</th>\n <th>INCOME</th>\n <th>EXPERIENCE SCORE</th>\n <th>LOYALTY GROUP</th>\n <th>ENROLLMENT DATE</th>\n <th>HOUSEHOLD SIZE</th>\n <th>MARITAL STATUS</th>\n <th>APPAREL</th>\n <th>ELECTRONICS</th>\n <th>FOOD</th>\n <th>HEALTH &amp; BEAUTY</th>\n <th>TOTAL SPENT</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>10001</td>\n <td>0</td>\n <td>64</td>\n <td>133498</td>\n <td>5</td>\n <td>enrolled</td>\n <td>06-03-2013</td>\n <td>4</td>\n <td>Single</td>\n <td>4022.430</td>\n <td>1601.315</td>\n <td>68.688</td>\n <td>1134.337</td>\n <td>6826.770</td>\n </tr>\n <tr>\n <th>1</th>\n <td>10002</td>\n <td>0</td>\n <td>42</td>\n <td>94475</td>\n <td>9</td>\n <td>notenrolled</td>\n <td>NaN</td>\n <td>6</td>\n <td>Married</td>\n <td>2312.509</td>\n <td>2473.163</td>\n <td>276.779</td>\n <td>0.000</td>\n <td>5062.451</td>\n </tr>\n <tr>\n <th>2</th>\n <td>10003</td>\n <td>0</td>\n <td>40</td>\n <td>88610</td>\n <td>9</td>\n <td>enrolled</td>\n <td>02-09-2010</td>\n <td>5</td>\n <td>Married</td>\n <td>2887.382</td>\n <td>5414.418</td>\n <td>260.640</td>\n <td>0.000</td>\n <td>8562.440</td>\n </tr>\n <tr>\n <th>3</th>\n <td>10004</td>\n <td>0</td>\n <td>38</td>\n <td>84313</td>\n <td>8</td>\n <td>enrolled</td>\n <td>06-04-2015</td>\n <td>1</td>\n <td>Single</td>\n <td>3637.213</td>\n <td>1840.211</td>\n <td>45.270</td>\n <td>0.000</td>\n <td>5522.694</td>\n </tr>\n <tr>\n <th>4</th>\n <td>10005</td>\n <td>0</td>\n <td>30</td>\n <td>51498</td>\n <td>3</td>\n <td>notenrolled</td>\n <td>NaN</td>\n <td>1</td>\n <td>Single</td>\n <td>213.512</td>\n <td>0.000</td>\n <td>0.000</td>\n <td>0.000</td>\n <td>213.512</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " CUSTOMERID GENDER AGE INCOME EXPERIENCE SCORE LOYALTY GROUP \\\n0 10001 0 64 133498 5 enrolled \n1 10002 0 42 94475 9 notenrolled \n2 10003 0 40 88610 9 enrolled \n3 10004 0 38 84313 8 enrolled \n4 10005 0 30 51498 3 notenrolled \n\n ENROLLMENT DATE HOUSEHOLD SIZE MARITAL STATUS APPAREL ELECTRONICS \\\n0 06-03-2013 4 Single 4022.430 1601.315 \n1 NaN 6 Married 2312.509 2473.163 \n2 02-09-2010 5 Married 2887.382 5414.418 \n3 06-04-2015 1 Single 3637.213 1840.211 \n4 NaN 1 Single 213.512 0.000 \n\n FOOD HEALTH & BEAUTY TOTAL SPENT \n0 68.688 1134.337 6826.770 \n1 276.779 0.000 5062.451 \n2 260.640 0.000 8562.440 \n3 45.270 0.000 5522.694 \n4 0.000 0.000 213.512 "
},
"output_type": "execute_result"
}
],
"source": "customer_all_view.head()"
},
{
"source": "# Linear Regression ",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "1. Feed the following variables to the linear regression model to predict the total spending: gender, age, income, experience score and loyalty enrolment, household size, and marital status.",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 17,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "DF_input=customer_all_view[['GENDER','AGE','INCOME','EXPERIENCE SCORE','LOYALTY GROUP','HOUSEHOLD SIZE','MARITAL STATUS']]\n"
},
{
"source": "2. Run the head command to view the top rows of the input dataframe \u201cDF_input\u201d:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 19,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 19,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>GENDER</th>\n <th>AGE</th>\n <th>INCOME</th>\n <th>EXPERIENCE SCORE</th>\n <th>LOYALTY GROUP</th>\n <th>HOUSEHOLD SIZE</th>\n <th>MARITAL STATUS</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>64</td>\n <td>133498</td>\n <td>5</td>\n <td>enrolled</td>\n <td>4</td>\n <td>Single</td>\n </tr>\n <tr>\n <th>1</th>\n <td>0</td>\n <td>42</td>\n <td>94475</td>\n <td>9</td>\n <td>notenrolled</td>\n <td>6</td>\n <td>Married</td>\n </tr>\n <tr>\n <th>2</th>\n <td>0</td>\n <td>40</td>\n <td>88610</td>\n <td>9</td>\n <td>enrolled</td>\n <td>5</td>\n <td>Married</td>\n </tr>\n <tr>\n <th>3</th>\n <td>0</td>\n <td>38</td>\n <td>84313</td>\n <td>8</td>\n <td>enrolled</td>\n <td>1</td>\n <td>Single</td>\n </tr>\n <tr>\n <th>4</th>\n <td>0</td>\n <td>30</td>\n <td>51498</td>\n <td>3</td>\n <td>notenrolled</td>\n <td>1</td>\n <td>Single</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " GENDER AGE INCOME EXPERIENCE SCORE LOYALTY GROUP HOUSEHOLD SIZE \\\n0 0 64 133498 5 enrolled 4 \n1 0 42 94475 9 notenrolled 6 \n2 0 40 88610 9 enrolled 5 \n3 0 38 84313 8 enrolled 1 \n4 0 30 51498 3 notenrolled 1 \n\n MARITAL STATUS \n0 Single \n1 Married \n2 Married \n3 Single \n4 Single "
},
"output_type": "execute_result"
}
],
"source": "DF_input.head()"
},
{
"source": "As you can see, Loyalty group and Marital status are strings, so they must be converted to numeric format because the scikit-learn machine learning libraries accept features that are in numeric format. So, we must encode both variables in a numeric format.",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "3. The Marital status variable has four values: Single, Married, Divorced, and Widow/Widower. Check the marital status by using the following command:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 20,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 20,
"metadata": {},
"data": {
"text/plain": "array(['Single', 'Married', 'Divorced', 'Widow/Widower'], dtype=object)"
},
"output_type": "execute_result"
}
],
"source": "DF_input[\"MARITAL STATUS\"].unique()\n"
},
{
"source": "4. Create four columns that are named Single, Married, Divorced, and Widow/Widower. For each marital status, there is a binary representation.\n The four new columns have the encoding of the Marital Status column, which is called one hot encoding, in which a nominal categorical variable is converted to a numerical form so that it can be processed by machine learning algorithms.\n\n Pandas has a method that is called \u201cget_dummies\u201d that detects categorical variables in a Pandas data frame and automatically performs one hot encoding of it. To get an idea about Pandas one hot encoding, run the following command.",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 21,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 21,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Divorced</th>\n <th>Married</th>\n <th>Single</th>\n <th>Widow/Widower</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>5</th>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>6</th>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>7</th>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>8</th>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>9</th>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " Divorced Married Single Widow/Widower\n0 0 0 1 0\n1 0 1 0 0\n2 0 1 0 0\n3 0 0 1 0\n4 0 0 1 0\n5 0 1 0 0\n6 0 1 0 0\n7 0 0 1 0\n8 0 1 0 0\n9 0 1 0 0"
},
"output_type": "execute_result"
}
],
"source": "pd.get_dummies(DF_input[\"MARITAL STATUS\"]).head(10)\n"
},
{
"source": "The output shows four new columns that are created from the one hot encoding of the \u201cMarital Status\u201d column. This command prints only the encoded output and does not change the original DF_input data frame because this was only a demonstration of one hot encoding. Later in this step, you perform one hot encoding on the input Pandas data frame and replace the original Pandas data frame with the binary-hot encoded frame.",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "5. The Loyalty group has two values: \u201cenrolled\u201d and \u201cnotenrolled\u201d. The encoding of this variable is easy and can be accomplished by using a simple function instead of the Pandas \u201cpd.get_dummies\u201c function. Run this code snippet function in a new cell to binary encode the loyalty variable:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 22,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "def encode_loyalty(value):\n if value == \"enrolled\" :\n return 1 \n else : \n return 0\n"
},
{
"source": "The function carries out a binary encoding. When it finds the value \u201cenrolled\u201d, it returns 1, and if it finds otherwise, that is, the value \u201cnotenrolled\u201d, it returns 0. Apply this function to the \u201cloyalty group\u201d column to take effect by running the following command: ",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 23,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": "/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: \nA value is trying to be set on a copy of a slice from a DataFrame.\nTry using .loc[row_indexer,col_indexer] = value instead\n\nSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n if __name__ == '__main__':\n"
}
],
"source": "DF_input['LOYALTY GROUP']=DF_input['LOYALTY GROUP'].apply(encode_loyalty)\n"
},
{
"source": "Run the head command to make sure that the output is as expected:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 25,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 25,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>GENDER</th>\n <th>AGE</th>\n <th>INCOME</th>\n <th>EXPERIENCE SCORE</th>\n <th>LOYALTY GROUP</th>\n <th>HOUSEHOLD SIZE</th>\n <th>MARITAL STATUS</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>64</td>\n <td>133498</td>\n <td>5</td>\n <td>1</td>\n <td>4</td>\n <td>Single</td>\n </tr>\n <tr>\n <th>1</th>\n <td>0</td>\n <td>42</td>\n <td>94475</td>\n <td>9</td>\n <td>0</td>\n <td>6</td>\n <td>Married</td>\n </tr>\n <tr>\n <th>2</th>\n <td>0</td>\n <td>40</td>\n <td>88610</td>\n <td>9</td>\n <td>1</td>\n <td>5</td>\n <td>Married</td>\n </tr>\n <tr>\n <th>3</th>\n <td>0</td>\n <td>38</td>\n <td>84313</td>\n <td>8</td>\n <td>1</td>\n <td>1</td>\n <td>Single</td>\n </tr>\n <tr>\n <th>4</th>\n <td>0</td>\n <td>30</td>\n <td>51498</td>\n <td>3</td>\n <td>0</td>\n <td>1</td>\n <td>Single</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " GENDER AGE INCOME EXPERIENCE SCORE LOYALTY GROUP HOUSEHOLD SIZE \\\n0 0 64 133498 5 1 4 \n1 0 42 94475 9 0 6 \n2 0 40 88610 9 1 5 \n3 0 38 84313 8 1 1 \n4 0 30 51498 3 0 1 \n\n MARITAL STATUS \n0 Single \n1 Married \n2 Married \n3 Single \n4 Single "
},
"output_type": "execute_result"
}
],
"source": "DF_input.head()"
},
{
"source": "6. Now, in a new cell, replace the \u201cDF_input\u201d frame with the one-hot encoded Pandas data frame by running the following command:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 26,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "DF_input=pd.get_dummies(DF_input)"
},
{
"source": "Verify that the operation completed by running the head command:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 27,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 27,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>GENDER</th>\n <th>AGE</th>\n <th>INCOME</th>\n <th>EXPERIENCE SCORE</th>\n <th>LOYALTY GROUP</th>\n <th>HOUSEHOLD SIZE</th>\n <th>MARITAL STATUS_Divorced</th>\n <th>MARITAL STATUS_Married</th>\n <th>MARITAL STATUS_Single</th>\n <th>MARITAL STATUS_Widow/Widower</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>64</td>\n <td>133498</td>\n <td>5</td>\n <td>1</td>\n <td>4</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>0</td>\n <td>42</td>\n <td>94475</td>\n <td>9</td>\n <td>0</td>\n <td>6</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>0</td>\n <td>40</td>\n <td>88610</td>\n <td>9</td>\n <td>1</td>\n <td>5</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>0</td>\n <td>38</td>\n <td>84313</td>\n <td>8</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>0</td>\n <td>30</td>\n <td>51498</td>\n <td>3</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " GENDER AGE INCOME EXPERIENCE SCORE LOYALTY GROUP HOUSEHOLD SIZE \\\n0 0 64 133498 5 1 4 \n1 0 42 94475 9 0 6 \n2 0 40 88610 9 1 5 \n3 0 38 84313 8 1 1 \n4 0 30 51498 3 0 1 \n\n MARITAL STATUS_Divorced MARITAL STATUS_Married MARITAL STATUS_Single \\\n0 0 0 1 \n1 0 1 0 \n2 0 1 0 \n3 0 0 1 \n4 0 0 1 \n\n MARITAL STATUS_Widow/Widower \n0 0 \n1 0 \n2 0 \n3 0 \n4 0 "
},
"output_type": "execute_result"
}
],
"source": "DF_input.head()"
},
{
"source": "You now have the data frame \u201cDF_input\u201d, which has all the input variables of the regression model. Before feeding the data frame to the LinearRegession Scikit-learn library, scale the features to the 0 \u2013 1 range because every input variable has its own scale. For example, age has two digits (scale of tens) and income might have five or six digits (scale of thousands).\n\nThe linear regression model calculates the coefficient for each input variable without scaling. The coefficient for income is much lower than other coefficients (the coefficients are multiplied by their predictors, so a low coefficient is needed for a high-order predictor). You might conclude that the income variable is less relevant to total spending than other variables. To balance the comparison and select the top features that contribute to customer total spending, scale every feature to have a more balanced comparison.",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "## Min- max Scaling",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "1. Import the pre-processing library and save the Pandas columns to re-create the Pandas data frame for the NumPy format by running the following command (the minmax scale command outputs NumPy):",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 28,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "from sklearn import preprocessing\nDF_input_column_names = DF_input.columns.values\nDF_input_np=preprocessing.minmax_scale(DF_input)\nReg_input_scaled=pd.DataFrame(DF_input_np, columns=DF_input_column_names)\nReg_input_scaled=pd.DataFrame(DF_input_np, columns=DF_input_column_names)\n"
},
{
"source": "View the scaled Pandas by running the following command:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 30,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 30,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>GENDER</th>\n <th>AGE</th>\n <th>INCOME</th>\n <th>EXPERIENCE SCORE</th>\n <th>LOYALTY GROUP</th>\n <th>HOUSEHOLD SIZE</th>\n <th>MARITAL STATUS_Divorced</th>\n <th>MARITAL STATUS_Married</th>\n <th>MARITAL STATUS_Single</th>\n <th>MARITAL STATUS_Widow/Widower</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0.0</td>\n <td>0.638889</td>\n <td>0.872818</td>\n <td>0.444444</td>\n <td>1.0</td>\n <td>0.6</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>0.0</td>\n <td>0.333333</td>\n <td>0.572046</td>\n <td>0.888889</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>0.0</td>\n <td>0.305556</td>\n <td>0.526842</td>\n <td>0.888889</td>\n <td>1.0</td>\n <td>0.8</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>0.0</td>\n <td>0.277778</td>\n <td>0.493722</td>\n <td>0.777778</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>0.0</td>\n <td>0.166667</td>\n <td>0.240799</td>\n <td>0.222222</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " GENDER AGE INCOME EXPERIENCE SCORE LOYALTY GROUP \\\n0 0.0 0.638889 0.872818 0.444444 1.0 \n1 0.0 0.333333 0.572046 0.888889 0.0 \n2 0.0 0.305556 0.526842 0.888889 1.0 \n3 0.0 0.277778 0.493722 0.777778 1.0 \n4 0.0 0.166667 0.240799 0.222222 0.0 \n\n HOUSEHOLD SIZE MARITAL STATUS_Divorced MARITAL STATUS_Married \\\n0 0.6 0.0 0.0 \n1 1.0 0.0 1.0 \n2 0.8 0.0 1.0 \n3 0.0 0.0 0.0 \n4 0.0 0.0 0.0 \n\n MARITAL STATUS_Single MARITAL STATUS_Widow/Widower \n0 1.0 0.0 \n1 0.0 0.0 \n2 0.0 0.0 \n3 1.0 0.0 \n4 1.0 0.0 "
},
"output_type": "execute_result"
}
],
"source": "Reg_input_scaled.head()"
},
{
"source": "2. Create a train \u2013 test split of 80% training and 20% testing by running the following command:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 31,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "from sklearn.model_selection import train_test_split\nX_train,X_test,Y_train,Y_test = train_test_split(Reg_input_scaled,customer_all_view['TOTAL SPENT'], test_size=0.2,random_state=42)"
},
{
"source": "In this instruction, we split both the input variables data frame \u201cReg_input_scaled\u201d and the target variable data frame \u201ccustomer_all_view[\u2018TOTAL SPENT\u2019]\u201d.",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "3. Import the LinearRegression Library by running the following command:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 32,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "from sklearn.linear_model import LinearRegression"
},
{
"source": "4. Instantiate the model by running the following command:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 33,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "regr = LinearRegression()"
},
{
"source": "5. Train the model on the training data set by running the following command:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 34,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 34,
"metadata": {},
"data": {
"text/plain": "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
},
"output_type": "execute_result"
}
],
"source": "regr.fit(X_train, Y_train)"
},
{
"source": "6. Now that training is complete, you can predict the values of total spending for the testing feature set by running the following command:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 35,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "Y_pred = regr.predict(X_test)"
},
{
"source": "7. Obtain the regression coefficients of the linear model by running the \u201cregr.coef_\u201d method:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 36,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "Coefficients: \n [ 35.58494877 -7076.18001534 12403.0952701 505.84578936\n 1566.48092344 1040.19923196 377.33573376 -82.02665324\n -34.57620088 -260.73287964]\n"
}
],
"source": "print('Coefficients: \\n', regr.coef_)"
},
{
"source": "8. The coefficients are in the same order that the variables were fed to the model. To get the order that the variables were fed to the model, run the following command:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 37,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 37,
"metadata": {},
"data": {
"text/plain": "array(['GENDER', 'AGE', 'INCOME', 'EXPERIENCE SCORE', 'LOYALTY GROUP',\n 'HOUSEHOLD SIZE', 'MARITAL STATUS_Divorced',\n 'MARITAL STATUS_Married', 'MARITAL STATUS_Single',\n 'MARITAL STATUS_Widow/Widower'], dtype=object)"
},
"output_type": "execute_result"
}
],
"source": "Reg_input_scaled.columns.values"
},
{
"source": "9. Create a table. In the table, write in the coefficients and the type of correlation between the inputs and the targets.\n\n The importance of a predictor can be measured by the magnitude of its regression coefficient. High-magnitude coefficients mean high importance and lower magnitudes means less importance regarding target variables. ",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "10. Arrange the predictors in the order of absolute magnitude of their regression coefficients (ignore sign because sign just means the direction of the correlation).\n\n High importance to low: The order is Income, Age, Loyalty Group, Household size, Experience score, Marital Status, and Gender.\n",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "\n11. Obtain the constant or intercept term of the equation by running the following command:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 38,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "intercept: \n -38.1694752187\n"
}
],
"source": "print('intercept: \\n', regr.intercept_)"
},
{
"source": "The linear regression equation to predict total spending is as follows:\n\nPredicted Total Spend = 35.58 * Gender \u2013 7076 * AGE + 12403 * Income + 505 * Experience score + 155 * Loyalty group + 1040 * House hold size + 377.3 * (Marital Status \u2013 Divorced) \u2013 82.02 * (Marital Status \u2013 Married) \u2013 35.57 * (Marital Status \u2013 Single) \u2013 260 * (Marital Status \u2013 Widow) \u2013 38.17\n",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "12. Evaluate the performance of the linear model by calculating the mean squared error. Import its library and then apply the function to Predicted and actual values of customer spending by running the following command:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 39,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "Mean squared error: 3164098.64\n"
}
],
"source": "from sklearn.metrics import mean_squared_error\nprint(\"Mean squared error: %.2f\" % mean_squared_error(Y_test, Y_pred))\n"
},
{
"source": "This is the mean squared of the error. To get the mean error, take the square root of this number. So, the mean error is 1779. Therefore, our linear model predicts total spending with an average error of $1779.",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "### The End",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "Notebook edited by Harison Mwangi\n\nFind me @ https://www.linkedin.com/in/harison-m-418641102",
"cell_type": "markdown",
"metadata": {}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.5",
"name": "python3",
"language": "python"
},
"language_info": {
"mimetype": "text/x-python",
"nbconvert_exporter": "python",
"version": "3.5.5",
"name": "python",
"file_extension": ".py",
"pygments_lexer": "ipython3",
"codemirror_mode": {
"version": 3,
"name": "ipython"
}
}
},
"nbformat": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment