harisonmg/DNA-Regression.ipynb

## DNA-Regression.ipynb
{
    "nbformat_minor": 1,
    "cells": [
        {
            "source": "# Regression D-NA Course Notebook (Supervised Learning)",
            "cell_type": "markdown",
            "metadata": {
                "collapsed": true
            }
        },
        {
            "source": "### Importing Data Files",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 1,
            "cell_type": "code",
            "metadata": {},
            "outputs": [
                {
                    "execution_count": 1,
                    "metadata": {},
                    "data": {
                        "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>PRODUCT CODE</th>\n      <th>PRODUCT CATEGORY</th>\n      <th>UNIT LIST PRICE</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>30001</td>\n      <td>HEALTH &amp; BEAUTY</td>\n      <td>$7.45</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>30002</td>\n      <td>HEALTH &amp; BEAUTY</td>\n      <td>$5.35</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>30003</td>\n      <td>HEALTH &amp; BEAUTY</td>\n      <td>$5.49</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>30004</td>\n      <td>HEALTH &amp; BEAUTY</td>\n      <td>$6.46</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>30005</td>\n      <td>HEALTH &amp; BEAUTY</td>\n      <td>$7.33</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
                        "text/plain": "   PRODUCT CODE PRODUCT CATEGORY UNIT LIST PRICE\n0         30001  HEALTH & BEAUTY          $7.45 \n1         30002  HEALTH & BEAUTY          $5.35 \n2         30003  HEALTH & BEAUTY          $5.49 \n3         30004  HEALTH & BEAUTY          $6.46 \n4         30005  HEALTH & BEAUTY          $7.33 "
                    },
                    "output_type": "execute_result"
                }
            ],
            "source": "# The code was removed by Watson Studio for sharing."
        },
        {
            "execution_count": 2,
            "cell_type": "code",
            "metadata": {
                "scrolled": true
            },
            "outputs": [
                {
                    "execution_count": 2,
                    "metadata": {},
                    "data": {
                        "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>CUSTOMER NUM</th>\n      <th>PRODUCT NUM</th>\n      <th>QUANTITY PURCHASED</th>\n      <th>DISCOUNT TAKEN</th>\n      <th>TRANSACTION DATE</th>\n      <th>STOCKOUT</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>10114</td>\n      <td>30011</td>\n      <td>4</td>\n      <td>0.0</td>\n      <td>1/2/2015</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>10217</td>\n      <td>30016</td>\n      <td>3</td>\n      <td>0.0</td>\n      <td>1/2/2015</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>10224</td>\n      <td>30013</td>\n      <td>4</td>\n      <td>0.0</td>\n      <td>1/2/2015</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>10103</td>\n      <td>30012</td>\n      <td>3</td>\n      <td>0.2</td>\n      <td>1/2/2015</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>10037</td>\n      <td>30010</td>\n      <td>8</td>\n      <td>0.0</td>\n      <td>1/2/2015</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
                        "text/plain": "   CUSTOMER NUM  PRODUCT NUM  QUANTITY PURCHASED  DISCOUNT TAKEN  \\\n0         10114        30011                   4             0.0   \n1         10217        30016                   3             0.0   \n2         10224        30013                   4             0.0   \n3         10103        30012                   3             0.2   \n4         10037        30010                   8             0.0   \n\n  TRANSACTION DATE  STOCKOUT  \n0         1/2/2015         0  \n1         1/2/2015         0  \n2         1/2/2015         0  \n3         1/2/2015         0  \n4         1/2/2015         0  "
                    },
                    "output_type": "execute_result"
                }
            ],
            "source": "#Import Transaction DataSet Here\nbody = client_0b7f25d6f4c743e1bbf81e3ecac0de47.get_object(Bucket='ibmdnadatasciencelearningproject-donotdelete-pr-foj79k0orfi5wk',Key='DNA- Transaction Data Set - Student 3 of 3.csv')['Body']\n# add missing __iter__ method, so pandas accepts body as file-like object\nif not hasattr(body, \"__iter__\"): body.__iter__ = types.MethodType( __iter__, body )\n\ntransactions_data = pd.read_csv(body, sep='|')\ntransactions_data.head()\n\n"
        },
        {
            "execution_count": 3,
            "cell_type": "code",
            "metadata": {},
            "outputs": [
                {
                    "execution_count": 3,
                    "metadata": {},
                    "data": {
                        "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>CUSTOMERID</th>\n      <th>GENDER</th>\n      <th>AGE</th>\n      <th>INCOME</th>\n      <th>EXPERIENCE SCORE</th>\n      <th>LOYALTY GROUP</th>\n      <th>ENROLLMENT DATE</th>\n      <th>HOUSEHOLD SIZE</th>\n      <th>MARITAL STATUS</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>10001</td>\n      <td>0</td>\n      <td>64</td>\n      <td>$133,498</td>\n      <td>5</td>\n      <td>enrolled</td>\n      <td>06-03-2013</td>\n      <td>4</td>\n      <td>Single</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>10002</td>\n      <td>0</td>\n      <td>42</td>\n      <td>$94,475</td>\n      <td>9</td>\n      <td>notenrolled</td>\n      <td>NaN</td>\n      <td>6</td>\n      <td>Married</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>10003</td>\n      <td>0</td>\n      <td>40</td>\n      <td>$88,610</td>\n      <td>9</td>\n      <td>enrolled</td>\n      <td>02-09-2010</td>\n      <td>5</td>\n      <td>Married</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>10004</td>\n      <td>0</td>\n      <td>38</td>\n      <td>$84,313</td>\n      <td>8</td>\n      <td>enrolled</td>\n      <td>06-04-2015</td>\n      <td>1</td>\n      <td>Single</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>10005</td>\n      <td>0</td>\n      <td>30</td>\n      <td>$51,498</td>\n      <td>3</td>\n      <td>notenrolled</td>\n      <td>NaN</td>\n      <td>1</td>\n      <td>Single</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
                        "text/plain": "   CUSTOMERID  GENDER  AGE    INCOME  EXPERIENCE SCORE LOYALTY GROUP  \\\n0       10001       0   64  $133,498                 5      enrolled   \n1       10002       0   42   $94,475                 9   notenrolled   \n2       10003       0   40   $88,610                 9      enrolled   \n3       10004       0   38   $84,313                 8      enrolled   \n4       10005       0   30   $51,498                 3   notenrolled   \n\n  ENROLLMENT DATE  HOUSEHOLD SIZE MARITAL STATUS  \n0      06-03-2013               4         Single  \n1             NaN               6        Married  \n2      02-09-2010               5        Married  \n3      06-04-2015               1         Single  \n4             NaN               1         Single  "
                    },
                    "output_type": "execute_result"
                }
            ],
            "source": "#Import Customer Dataset Here\nbody = client_0b7f25d6f4c743e1bbf81e3ecac0de47.get_object(Bucket='ibmdnadatasciencelearningproject-donotdelete-pr-foj79k0orfi5wk',Key='DNA - Customer Data Set - Student 1 of 3.csv')['Body']\n# add missing __iter__ method, so pandas accepts body as file-like object\nif not hasattr(body, \"__iter__\"): body.__iter__ = types.MethodType( __iter__, body )\n\ncustomer_data = pd.read_csv(body)\ncustomer_data.head()\n\n"
        },
        {
            "source": "### Changing Data Types",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 4,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "customer_data['INCOME']=customer_data['INCOME'].map(lambda x : x.replace('$',''))"
        },
        {
            "execution_count": 5,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "customer_data['INCOME']=customer_data['INCOME'].map(lambda x : int(x.replace(',','')))"
        },
        {
            "source": "### Creating Customer View",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 6,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "trans_products=transactions_data.merge(product_data,how='inner', left_on='PRODUCT NUM', right_on='PRODUCT CODE')"
        },
        {
            "execution_count": 7,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "trans_products['UNIT LIST PRICE']=trans_products['UNIT LIST PRICE'].map(lambda x : float(x.replace('$','')))"
        },
        {
            "execution_count": 8,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "trans_products['Total_Price']=trans_products['QUANTITY PURCHASED'] * trans_products['UNIT LIST PRICE'] * (1- trans_products['DISCOUNT TAKEN'])"
        },
        {
            "execution_count": 9,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "customer_prod_categ=trans_products.groupby(['CUSTOMER NUM','PRODUCT CATEGORY']).agg({'Total_Price':'sum'})"
        },
        {
            "execution_count": 10,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "customer_prod_categ=customer_prod_categ.reset_index()"
        },
        {
            "execution_count": 11,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "customer_pivot=customer_prod_categ.pivot(index='CUSTOMER NUM',columns='PRODUCT CATEGORY',values='Total_Price')"
        },
        {
            "execution_count": 12,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "trans_total_spend=trans_products.groupby('CUSTOMER NUM').agg({'Total_Price':'sum'}).\\\nrename(columns={'Total_Price':'TOTAL SPENT'})"
        },
        {
            "execution_count": 13,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "customer_KPIs=customer_pivot.merge(trans_total_spend,how='inner',left_index=True, right_index=True )"
        },
        {
            "execution_count": 14,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "customer_KPIs=customer_KPIs.fillna(0)\n"
        },
        {
            "execution_count": 15,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "customer_all_view=customer_data.merge(customer_KPIs,how='inner', left_on='CUSTOMERID', right_index=True)"
        },
        {
            "execution_count": 16,
            "cell_type": "code",
            "metadata": {},
            "outputs": [
                {
                    "execution_count": 16,
                    "metadata": {},
                    "data": {
                        "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>CUSTOMERID</th>\n      <th>GENDER</th>\n      <th>AGE</th>\n      <th>INCOME</th>\n      <th>EXPERIENCE SCORE</th>\n      <th>LOYALTY GROUP</th>\n      <th>ENROLLMENT DATE</th>\n      <th>HOUSEHOLD SIZE</th>\n      <th>MARITAL STATUS</th>\n      <th>APPAREL</th>\n      <th>ELECTRONICS</th>\n      <th>FOOD</th>\n      <th>HEALTH &amp; BEAUTY</th>\n      <th>TOTAL SPENT</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>10001</td>\n      <td>0</td>\n      <td>64</td>\n      <td>133498</td>\n      <td>5</td>\n      <td>enrolled</td>\n      <td>06-03-2013</td>\n      <td>4</td>\n      <td>Single</td>\n      <td>4022.430</td>\n      <td>1601.315</td>\n      <td>68.688</td>\n      <td>1134.337</td>\n      <td>6826.770</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>10002</td>\n      <td>0</td>\n      <td>42</td>\n      <td>94475</td>\n      <td>9</td>\n      <td>notenrolled</td>\n      <td>NaN</td>\n      <td>6</td>\n      <td>Married</td>\n      <td>2312.509</td>\n      <td>2473.163</td>\n      <td>276.779</td>\n      <td>0.000</td>\n      <td>5062.451</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>10003</td>\n      <td>0</td>\n      <td>40</td>\n      <td>88610</td>\n      <td>9</td>\n      <td>enrolled</td>\n      <td>02-09-2010</td>\n      <td>5</td>\n      <td>Married</td>\n      <td>2887.382</td>\n      <td>5414.418</td>\n      <td>260.640</td>\n      <td>0.000</td>\n      <td>8562.440</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>10004</td>\n      <td>0</td>\n      <td>38</td>\n      <td>84313</td>\n      <td>8</td>\n      <td>enrolled</td>\n      <td>06-04-2015</td>\n      <td>1</td>\n      <td>Single</td>\n      <td>3637.213</td>\n      <td>1840.211</td>\n      <td>45.270</td>\n      <td>0.000</td>\n      <td>5522.694</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>10005</td>\n      <td>0</td>\n      <td>30</td>\n      <td>51498</td>\n      <td>3</td>\n      <td>notenrolled</td>\n      <td>NaN</td>\n      <td>1</td>\n      <td>Single</td>\n      <td>213.512</td>\n      <td>0.000</td>\n      <td>0.000</td>\n      <td>0.000</td>\n      <td>213.512</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
                        "text/plain": "   CUSTOMERID  GENDER  AGE  INCOME  EXPERIENCE SCORE LOYALTY GROUP  \\\n0       10001       0   64  133498                 5      enrolled   \n1       10002       0   42   94475                 9   notenrolled   \n2       10003       0   40   88610                 9      enrolled   \n3       10004       0   38   84313                 8      enrolled   \n4       10005       0   30   51498                 3   notenrolled   \n\n  ENROLLMENT DATE  HOUSEHOLD SIZE MARITAL STATUS   APPAREL  ELECTRONICS  \\\n0      06-03-2013               4         Single  4022.430     1601.315   \n1             NaN               6        Married  2312.509     2473.163   \n2      02-09-2010               5        Married  2887.382     5414.418   \n3      06-04-2015               1         Single  3637.213     1840.211   \n4             NaN               1         Single   213.512        0.000   \n\n      FOOD  HEALTH & BEAUTY  TOTAL SPENT  \n0   68.688         1134.337     6826.770  \n1  276.779            0.000     5062.451  \n2  260.640            0.000     8562.440  \n3   45.270            0.000     5522.694  \n4    0.000            0.000      213.512  "
                    },
                    "output_type": "execute_result"
                }
            ],
            "source": "customer_all_view.head()"
        },
        {
            "source": "# Linear Regression ",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "source": "1. Feed the following variables to the linear regression model to predict the total spending: gender, age, income, experience score and loyalty enrolment, household size, and marital status.",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 17,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "DF_input=customer_all_view[['GENDER','AGE','INCOME','EXPERIENCE SCORE','LOYALTY GROUP','HOUSEHOLD SIZE','MARITAL STATUS']]\n"
        },
        {
            "source": "2. Run the head command to view the top rows of the input dataframe \u201cDF_input\u201d:",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 19,
            "cell_type": "code",
            "metadata": {},
            "outputs": [
                {
                    "execution_count": 19,
                    "metadata": {},
                    "data": {
                        "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>GENDER</th>\n      <th>AGE</th>\n      <th>INCOME</th>\n      <th>EXPERIENCE SCORE</th>\n      <th>LOYALTY GROUP</th>\n      <th>HOUSEHOLD SIZE</th>\n      <th>MARITAL STATUS</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>0</td>\n      <td>64</td>\n      <td>133498</td>\n      <td>5</td>\n      <td>enrolled</td>\n      <td>4</td>\n      <td>Single</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>0</td>\n      <td>42</td>\n      <td>94475</td>\n      <td>9</td>\n      <td>notenrolled</td>\n      <td>6</td>\n      <td>Married</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>0</td>\n      <td>40</td>\n      <td>88610</td>\n      <td>9</td>\n      <td>enrolled</td>\n      <td>5</td>\n      <td>Married</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>0</td>\n      <td>38</td>\n      <td>84313</td>\n      <td>8</td>\n      <td>enrolled</td>\n      <td>1</td>\n      <td>Single</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>0</td>\n      <td>30</td>\n      <td>51498</td>\n      <td>3</td>\n      <td>notenrolled</td>\n      <td>1</td>\n      <td>Single</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
                        "text/plain": "   GENDER  AGE  INCOME  EXPERIENCE SCORE LOYALTY GROUP  HOUSEHOLD SIZE  \\\n0       0   64  133498                 5      enrolled               4   \n1       0   42   94475                 9   notenrolled               6   \n2       0   40   88610                 9      enrolled               5   \n3       0   38   84313                 8      enrolled               1   \n4       0   30   51498                 3   notenrolled               1   \n\n  MARITAL STATUS  \n0         Single  \n1        Married  \n2        Married  \n3         Single  \n4         Single  "
                    },
                    "output_type": "execute_result"
                }
            ],
            "source": "DF_input.head()"
        },
        {
            "source": "As you can see, Loyalty group and Marital status are strings, so they must be converted to numeric format because the scikit-learn machine learning libraries accept features that are in numeric format. So, we must encode both variables in a numeric format.",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "source": "3. The Marital status variable has four values: Single, Married, Divorced, and Widow/Widower. Check the marital status by using the following command:",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 20,
            "cell_type": "code",
            "metadata": {},
            "outputs": [
                {
                    "execution_count": 20,
                    "metadata": {},
                    "data": {
                        "text/plain": "array(['Single', 'Married', 'Divorced', 'Widow/Widower'], dtype=object)"
                    },
                    "output_type": "execute_result"
                }
            ],
            "source": "DF_input[\"MARITAL STATUS\"].unique()\n"
        },
        {
            "source": "4. Create four columns that are named Single, Married, Divorced, and Widow/Widower. For each marital status, there is a binary representation.\n    The four new columns have the encoding of the Marital Status column, which is called one hot encoding, in which a nominal categorical variable is converted to a numerical form so that it can be processed by machine learning algorithms.\n\n    Pandas has a method that is called \u201cget_dummies\u201d that detects categorical variables in a Pandas data frame and automatically performs one hot encoding of it. To get an idea about Pandas one hot encoding, run the following command.",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 21,
            "cell_type": "code",
            "metadata": {},
            "outputs": [
                {
                    "execution_count": 21,
                    "metadata": {},
                    "data": {
                        "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Divorced</th>\n      <th>Married</th>\n      <th>Single</th>\n      <th>Widow/Widower</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>0</td>\n      <td>0</td>\n      <td>1</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>0</td>\n      <td>1</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>0</td>\n      <td>1</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>0</td>\n      <td>0</td>\n      <td>1</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>0</td>\n      <td>0</td>\n      <td>1</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>0</td>\n      <td>1</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>0</td>\n      <td>1</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>0</td>\n      <td>0</td>\n      <td>1</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>0</td>\n      <td>1</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>0</td>\n      <td>1</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
                        "text/plain": "   Divorced  Married  Single  Widow/Widower\n0         0        0       1              0\n1         0        1       0              0\n2         0        1       0              0\n3         0        0       1              0\n4         0        0       1              0\n5         0        1       0              0\n6         0        1       0              0\n7         0        0       1              0\n8         0        1       0              0\n9         0        1       0              0"
                    },
                    "output_type": "execute_result"
                }
            ],
            "source": "pd.get_dummies(DF_input[\"MARITAL STATUS\"]).head(10)\n"
        },
        {
            "source": "The output shows four new columns that are created from the one hot encoding of the \u201cMarital Status\u201d column. This command prints only the encoded output and does not change the original DF_input data frame because this was only a demonstration of one hot encoding. Later in this step, you perform one hot encoding on the input Pandas data frame and replace the original Pandas data frame with the binary-hot encoded frame.",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "source": "5. The Loyalty group has two values: \u201cenrolled\u201d and \u201cnotenrolled\u201d. The encoding of this variable is easy and can be accomplished by using a simple function instead of the Pandas \u201cpd.get_dummies\u201c function. Run this code snippet function in a new cell to binary encode the loyalty variable:",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 22,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "def encode_loyalty(value):\n    if value == \"enrolled\" :\n        return 1 \n    else : \n        return 0\n"
        },
        {
            "source": "The function carries out a binary encoding. When it finds the value \u201cenrolled\u201d, it returns 1, and if it finds otherwise, that is, the value \u201cnotenrolled\u201d, it returns 0. Apply this function to the \u201cloyalty group\u201d column to take effect by running the following command: ",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 23,
            "cell_type": "code",
            "metadata": {},
            "outputs": [
                {
                    "output_type": "stream",
                    "name": "stderr",
                    "text": "/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: \nA value is trying to be set on a copy of a slice from a DataFrame.\nTry using .loc[row_indexer,col_indexer] = value instead\n\nSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n  if __name__ == '__main__':\n"
                }
            ],
            "source": "DF_input['LOYALTY GROUP']=DF_input['LOYALTY GROUP'].apply(encode_loyalty)\n"
        },
        {
            "source": "Run the head command to make sure that the output is as expected:",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 25,
            "cell_type": "code",
            "metadata": {},
            "outputs": [
                {
                    "execution_count": 25,
                    "metadata": {},
                    "data": {
                        "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>GENDER</th>\n      <th>AGE</th>\n      <th>INCOME</th>\n      <th>EXPERIENCE SCORE</th>\n      <th>LOYALTY GROUP</th>\n      <th>HOUSEHOLD SIZE</th>\n      <th>MARITAL STATUS</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>0</td>\n      <td>64</td>\n      <td>133498</td>\n      <td>5</td>\n      <td>1</td>\n      <td>4</td>\n      <td>Single</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>0</td>\n      <td>42</td>\n      <td>94475</td>\n      <td>9</td>\n      <td>0</td>\n      <td>6</td>\n      <td>Married</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>0</td>\n      <td>40</td>\n      <td>88610</td>\n      <td>9</td>\n      <td>1</td>\n      <td>5</td>\n      <td>Married</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>0</td>\n      <td>38</td>\n      <td>84313</td>\n      <td>8</td>\n      <td>1</td>\n      <td>1</td>\n      <td>Single</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>0</td>\n      <td>30</td>\n      <td>51498</td>\n      <td>3</td>\n      <td>0</td>\n      <td>1</td>\n      <td>Single</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
                        "text/plain": "   GENDER  AGE  INCOME  EXPERIENCE SCORE  LOYALTY GROUP  HOUSEHOLD SIZE  \\\n0       0   64  133498                 5              1               4   \n1       0   42   94475                 9              0               6   \n2       0   40   88610                 9              1               5   \n3       0   38   84313                 8              1               1   \n4       0   30   51498                 3              0               1   \n\n  MARITAL STATUS  \n0         Single  \n1        Married  \n2        Married  \n3         Single  \n4         Single  "
                    },
                    "output_type": "execute_result"
                }
            ],
            "source": "DF_input.head()"
        },
        {
            "source": "6. Now, in a new cell, replace the \u201cDF_input\u201d frame with the one-hot encoded Pandas data frame by running the following command:",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 26,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "DF_input=pd.get_dummies(DF_input)"
        },
        {
            "source": "Verify that the operation completed by running the head command:",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 27,
            "cell_type": "code",
            "metadata": {},
            "outputs": [
                {
                    "execution_count": 27,
                    "metadata": {},
                    "data": {
                        "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>GENDER</th>\n      <th>AGE</th>\n      <th>INCOME</th>\n      <th>EXPERIENCE SCORE</th>\n      <th>LOYALTY GROUP</th>\n      <th>HOUSEHOLD SIZE</th>\n      <th>MARITAL STATUS_Divorced</th>\n      <th>MARITAL STATUS_Married</th>\n      <th>MARITAL STATUS_Single</th>\n      <th>MARITAL STATUS_Widow/Widower</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>0</td>\n      <td>64</td>\n      <td>133498</td>\n      <td>5</td>\n      <td>1</td>\n      <td>4</td>\n      <td>0</td>\n      <td>0</td>\n      <td>1</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>0</td>\n      <td>42</td>\n      <td>94475</td>\n      <td>9</td>\n      <td>0</td>\n      <td>6</td>\n      <td>0</td>\n      <td>1</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>0</td>\n      <td>40</td>\n      <td>88610</td>\n      <td>9</td>\n      <td>1</td>\n      <td>5</td>\n      <td>0</td>\n      <td>1</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>0</td>\n      <td>38</td>\n      <td>84313</td>\n      <td>8</td>\n      <td>1</td>\n      <td>1</td>\n      <td>0</td>\n      <td>0</td>\n      <td>1</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>0</td>\n      <td>30</td>\n      <td>51498</td>\n      <td>3</td>\n      <td>0</td>\n      <td>1</td>\n      <td>0</td>\n      <td>0</td>\n      <td>1</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
                        "text/plain": "   GENDER  AGE  INCOME  EXPERIENCE SCORE  LOYALTY GROUP  HOUSEHOLD SIZE  \\\n0       0   64  133498                 5              1               4   \n1       0   42   94475                 9              0               6   \n2       0   40   88610                 9              1               5   \n3       0   38   84313                 8              1               1   \n4       0   30   51498                 3              0               1   \n\n   MARITAL STATUS_Divorced  MARITAL STATUS_Married  MARITAL STATUS_Single  \\\n0                        0                       0                      1   \n1                        0                       1                      0   \n2                        0                       1                      0   \n3                        0                       0                      1   \n4                        0                       0                      1   \n\n   MARITAL STATUS_Widow/Widower  \n0                             0  \n1                             0  \n2                             0  \n3                             0  \n4                             0  "
                    },
                    "output_type": "execute_result"
                }
            ],
            "source": "DF_input.head()"
        },
        {
            "source": "You now have the data frame \u201cDF_input\u201d, which has all the input variables of the regression model. Before feeding the data frame to the LinearRegession Scikit-learn library, scale the features to the 0 \u2013 1 range because every input variable has its own scale. For example, age has two digits (scale of tens) and income might have five or six digits (scale of thousands).\n\nThe linear regression model calculates the coefficient for each input variable without scaling. The coefficient for income is much lower than other coefficients (the coefficients are multiplied by their predictors, so a low coefficient is needed for a high-order predictor). You might conclude that the income variable is less relevant to total spending than other variables. To balance the comparison and select the top features that contribute to customer total spending, scale every feature to have a more balanced comparison.",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "source": "## Min- max Scaling",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "source": "1. Import the pre-processing library and save the Pandas columns to re-create the Pandas data frame for the NumPy format by running the following command (the minmax scale command outputs NumPy):",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 28,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "from sklearn import preprocessing\nDF_input_column_names = DF_input.columns.values\nDF_input_np=preprocessing.minmax_scale(DF_input)\nReg_input_scaled=pd.DataFrame(DF_input_np, columns=DF_input_column_names)\nReg_input_scaled=pd.DataFrame(DF_input_np, columns=DF_input_column_names)\n"
        },
        {
            "source": "View the scaled Pandas by running the following command:",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 30,
            "cell_type": "code",
            "metadata": {},
            "outputs": [
                {
                    "execution_count": 30,
                    "metadata": {},
                    "data": {
                        "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>GENDER</th>\n      <th>AGE</th>\n      <th>INCOME</th>\n      <th>EXPERIENCE SCORE</th>\n      <th>LOYALTY GROUP</th>\n      <th>HOUSEHOLD SIZE</th>\n      <th>MARITAL STATUS_Divorced</th>\n      <th>MARITAL STATUS_Married</th>\n      <th>MARITAL STATUS_Single</th>\n      <th>MARITAL STATUS_Widow/Widower</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>0.0</td>\n      <td>0.638889</td>\n      <td>0.872818</td>\n      <td>0.444444</td>\n      <td>1.0</td>\n      <td>0.6</td>\n      <td>0.0</td>\n      <td>0.0</td>\n      <td>1.0</td>\n      <td>0.0</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>0.0</td>\n      <td>0.333333</td>\n      <td>0.572046</td>\n      <td>0.888889</td>\n      <td>0.0</td>\n      <td>1.0</td>\n      <td>0.0</td>\n      <td>1.0</td>\n      <td>0.0</td>\n      <td>0.0</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>0.0</td>\n      <td>0.305556</td>\n      <td>0.526842</td>\n      <td>0.888889</td>\n      <td>1.0</td>\n      <td>0.8</td>\n      <td>0.0</td>\n      <td>1.0</td>\n      <td>0.0</td>\n      <td>0.0</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>0.0</td>\n      <td>0.277778</td>\n      <td>0.493722</td>\n      <td>0.777778</td>\n      <td>1.0</td>\n      <td>0.0</td>\n      <td>0.0</td>\n      <td>0.0</td>\n      <td>1.0</td>\n      <td>0.0</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>0.0</td>\n      <td>0.166667</td>\n      <td>0.240799</td>\n      <td>0.222222</td>\n      <td>0.0</td>\n      <td>0.0</td>\n      <td>0.0</td>\n      <td>0.0</td>\n      <td>1.0</td>\n      <td>0.0</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
                        "text/plain": "   GENDER       AGE    INCOME  EXPERIENCE SCORE  LOYALTY GROUP  \\\n0     0.0  0.638889  0.872818          0.444444            1.0   \n1     0.0  0.333333  0.572046          0.888889            0.0   \n2     0.0  0.305556  0.526842          0.888889            1.0   \n3     0.0  0.277778  0.493722          0.777778            1.0   \n4     0.0  0.166667  0.240799          0.222222            0.0   \n\n   HOUSEHOLD SIZE  MARITAL STATUS_Divorced  MARITAL STATUS_Married  \\\n0             0.6                      0.0                     0.0   \n1             1.0                      0.0                     1.0   \n2             0.8                      0.0                     1.0   \n3             0.0                      0.0                     0.0   \n4             0.0                      0.0                     0.0   \n\n   MARITAL STATUS_Single  MARITAL STATUS_Widow/Widower  \n0                    1.0                           0.0  \n1                    0.0                           0.0  \n2                    0.0                           0.0  \n3                    1.0                           0.0  \n4                    1.0                           0.0  "
                    },
                    "output_type": "execute_result"
                }
            ],
            "source": "Reg_input_scaled.head()"
        },
        {
            "source": "2. Create a train \u2013 test split of 80% training and 20% testing by running the following command:",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 31,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "from sklearn.model_selection import train_test_split\nX_train,X_test,Y_train,Y_test = train_test_split(Reg_input_scaled,customer_all_view['TOTAL SPENT'], test_size=0.2,random_state=42)"
        },
        {
            "source": "In this instruction, we split both the input variables data frame \u201cReg_input_scaled\u201d and the target variable data frame \u201ccustomer_all_view[\u2018TOTAL SPENT\u2019]\u201d.",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "source": "3. Import the LinearRegression Library by running the following command:",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 32,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "from sklearn.linear_model import LinearRegression"
        },
        {
            "source": "4. Instantiate the model by running the following command:",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 33,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "regr = LinearRegression()"
        },
        {
            "source": "5. Train the model on the training data set by running the following command:",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 34,
            "cell_type": "code",
            "metadata": {},
            "outputs": [
                {
                    "execution_count": 34,
                    "metadata": {},
                    "data": {
                        "text/plain": "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
                    },
                    "output_type": "execute_result"
                }
            ],
            "source": "regr.fit(X_train, Y_train)"
        },
        {
            "source": "6. Now that training is complete, you can predict the values of total spending for the testing feature set by running the following command:",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 35,
            "cell_type": "code",
            "metadata": {},
            "outputs": [],
            "source": "Y_pred = regr.predict(X_test)"
        },
        {
            "source": "7. Obtain the regression coefficients of the linear model by running the \u201cregr.coef_\u201d method:",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 36,
            "cell_type": "code",
            "metadata": {},
            "outputs": [
                {
                    "output_type": "stream",
                    "name": "stdout",
                    "text": "Coefficients: \n [    35.58494877  -7076.18001534  12403.0952701     505.84578936\n   1566.48092344   1040.19923196    377.33573376    -82.02665324\n    -34.57620088   -260.73287964]\n"
                }
            ],
            "source": "print('Coefficients: \\n', regr.coef_)"
        },
        {
            "source": "8. The coefficients are in the same order that the variables were fed to the model. To get the order that the variables were fed to the model, run the following command:",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 37,
            "cell_type": "code",
            "metadata": {},
            "outputs": [
                {
                    "execution_count": 37,
                    "metadata": {},
                    "data": {
                        "text/plain": "array(['GENDER', 'AGE', 'INCOME', 'EXPERIENCE SCORE', 'LOYALTY GROUP',\n       'HOUSEHOLD SIZE', 'MARITAL STATUS_Divorced',\n       'MARITAL STATUS_Married', 'MARITAL STATUS_Single',\n       'MARITAL STATUS_Widow/Widower'], dtype=object)"
                    },
                    "output_type": "execute_result"
                }
            ],
            "source": "Reg_input_scaled.columns.values"
        },
        {
            "source": "9. Create a table. In the table, write in the coefficients and the type of correlation between the inputs and the targets.\n\n    The importance of a predictor can be measured by the magnitude of its regression coefficient. High-magnitude coefficients mean high importance and lower magnitudes means less importance regarding target variables. ",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "source": "10. Arrange the predictors in the order of absolute magnitude of their regression coefficients (ignore sign because sign just means the direction of the correlation).\n\n    High importance to low: The order is Income, Age, Loyalty Group, Household size, Experience score, Marital Status, and Gender.\n",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "source": "\n11. Obtain the constant or intercept term of the equation by running the following command:",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 38,
            "cell_type": "code",
            "metadata": {},
            "outputs": [
                {
                    "output_type": "stream",
                    "name": "stdout",
                    "text": "intercept: \n -38.1694752187\n"
                }
            ],
            "source": "print('intercept: \\n', regr.intercept_)"
        },
        {
            "source": "The linear regression equation to predict total spending is as follows:\n\nPredicted Total Spend = 35.58 * Gender \u2013 7076 * AGE + 12403 * Income + 505 * Experience score + 155 * Loyalty group + 1040 * House hold size + 377.3 * (Marital Status \u2013 Divorced) \u2013 82.02 * (Marital Status \u2013 Married) \u2013 35.57 * (Marital Status \u2013 Single) \u2013 260 * (Marital Status \u2013 Widow) \u2013 38.17\n",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "source": "12. Evaluate the performance of the linear model by calculating the mean squared error. Import its library and then apply the function to Predicted and actual values of customer spending by running the following command:",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "execution_count": 39,
            "cell_type": "code",
            "metadata": {},
            "outputs": [
                {
                    "output_type": "stream",
                    "name": "stdout",
                    "text": "Mean squared error: 3164098.64\n"
                }
            ],
            "source": "from sklearn.metrics import mean_squared_error\nprint(\"Mean squared error: %.2f\" % mean_squared_error(Y_test, Y_pred))\n"
        },
        {
            "source": "This is the mean squared of the error. To get the mean error, take the square root of this number. So, the mean error is 1779. Therefore, our linear model predicts total spending with an average error of $1779.",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "source": "### The  End",
            "cell_type": "markdown",
            "metadata": {}
        },
        {
            "source": "Notebook edited by Harison Mwangi\n\nFind me @ https://www.linkedin.com/in/harison-m-418641102",
            "cell_type": "markdown",
            "metadata": {}
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3.5",
            "name": "python3",
            "language": "python"
        },
        "language_info": {
            "mimetype": "text/x-python",
            "nbconvert_exporter": "python",
            "version": "3.5.5",
            "name": "python",
            "file_extension": ".py",
            "pygments_lexer": "ipython3",
            "codemirror_mode": {
                "version": 3,
                "name": "ipython"
            }
        }
    },
    "nbformat": 4
}