Skip to content

Instantly share code, notes, and snippets.

@harisonmg
Created April 25, 2019 11:38
Show Gist options
  • Save harisonmg/af5e042b75b0f5e9b54083dee1d87358 to your computer and use it in GitHub Desktop.
Save harisonmg/af5e042b75b0f5e9b54083dee1d87358 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"nbformat_minor": 1,
"cells": [
{
"source": "# Cluster Analysis Course Notebook",
"cell_type": "markdown",
"metadata": {
"collapsed": true
}
},
{
"source": "### Importing Data files",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 1,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 1,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>PRODUCT CODE</th>\n <th>PRODUCT CATEGORY</th>\n <th>UNIT LIST PRICE</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>30001</td>\n <td>HEALTH &amp; BEAUTY</td>\n <td>$7.45</td>\n </tr>\n <tr>\n <th>1</th>\n <td>30002</td>\n <td>HEALTH &amp; BEAUTY</td>\n <td>$5.35</td>\n </tr>\n <tr>\n <th>2</th>\n <td>30003</td>\n <td>HEALTH &amp; BEAUTY</td>\n <td>$5.49</td>\n </tr>\n <tr>\n <th>3</th>\n <td>30004</td>\n <td>HEALTH &amp; BEAUTY</td>\n <td>$6.46</td>\n </tr>\n <tr>\n <th>4</th>\n <td>30005</td>\n <td>HEALTH &amp; BEAUTY</td>\n <td>$7.33</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " PRODUCT CODE PRODUCT CATEGORY UNIT LIST PRICE\n0 30001 HEALTH & BEAUTY $7.45 \n1 30002 HEALTH & BEAUTY $5.35 \n2 30003 HEALTH & BEAUTY $5.49 \n3 30004 HEALTH & BEAUTY $6.46 \n4 30005 HEALTH & BEAUTY $7.33 "
},
"output_type": "execute_result"
}
],
"source": "# The code was removed by Watson Studio for sharing."
},
{
"execution_count": 2,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 2,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>CUSTOMER NUM</th>\n <th>PRODUCT NUM</th>\n <th>QUANTITY PURCHASED</th>\n <th>DISCOUNT TAKEN</th>\n <th>TRANSACTION DATE</th>\n <th>STOCKOUT</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>10114</td>\n <td>30011</td>\n <td>4</td>\n <td>0.0</td>\n <td>1/2/2015</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>10217</td>\n <td>30016</td>\n <td>3</td>\n <td>0.0</td>\n <td>1/2/2015</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>10224</td>\n <td>30013</td>\n <td>4</td>\n <td>0.0</td>\n <td>1/2/2015</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>10103</td>\n <td>30012</td>\n <td>3</td>\n <td>0.2</td>\n <td>1/2/2015</td>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>10037</td>\n <td>30010</td>\n <td>8</td>\n <td>0.0</td>\n <td>1/2/2015</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " CUSTOMER NUM PRODUCT NUM QUANTITY PURCHASED DISCOUNT TAKEN \\\n0 10114 30011 4 0.0 \n1 10217 30016 3 0.0 \n2 10224 30013 4 0.0 \n3 10103 30012 3 0.2 \n4 10037 30010 8 0.0 \n\n TRANSACTION DATE STOCKOUT \n0 1/2/2015 0 \n1 1/2/2015 0 \n2 1/2/2015 0 \n3 1/2/2015 0 \n4 1/2/2015 0 "
},
"output_type": "execute_result"
}
],
"source": "#Import Transaction DataSet here\nbody = client_0b7f25d6f4c743e1bbf81e3ecac0de47.get_object(Bucket='ibmdnadatasciencelearningproject-donotdelete-pr-foj79k0orfi5wk',Key='DNA- Transaction Data Set - Student 3 of 3.csv')['Body']\n# add missing __iter__ method, so pandas accepts body as file-like object\nif not hasattr(body, \"__iter__\"): body.__iter__ = types.MethodType( __iter__, body )\n\ntransactions_data = pd.read_csv(body,sep='|')\ntransactions_data.head()\n"
},
{
"execution_count": 3,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 3,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>CUSTOMERID</th>\n <th>GENDER</th>\n <th>AGE</th>\n <th>INCOME</th>\n <th>EXPERIENCE SCORE</th>\n <th>LOYALTY GROUP</th>\n <th>ENROLLMENT DATE</th>\n <th>HOUSEHOLD SIZE</th>\n <th>MARITAL STATUS</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>10001</td>\n <td>0</td>\n <td>64</td>\n <td>$133,498</td>\n <td>5</td>\n <td>enrolled</td>\n <td>06-03-2013</td>\n <td>4</td>\n <td>Single</td>\n </tr>\n <tr>\n <th>1</th>\n <td>10002</td>\n <td>0</td>\n <td>42</td>\n <td>$94,475</td>\n <td>9</td>\n <td>notenrolled</td>\n <td>NaN</td>\n <td>6</td>\n <td>Married</td>\n </tr>\n <tr>\n <th>2</th>\n <td>10003</td>\n <td>0</td>\n <td>40</td>\n <td>$88,610</td>\n <td>9</td>\n <td>enrolled</td>\n <td>02-09-2010</td>\n <td>5</td>\n <td>Married</td>\n </tr>\n <tr>\n <th>3</th>\n <td>10004</td>\n <td>0</td>\n <td>38</td>\n <td>$84,313</td>\n <td>8</td>\n <td>enrolled</td>\n <td>06-04-2015</td>\n <td>1</td>\n <td>Single</td>\n </tr>\n <tr>\n <th>4</th>\n <td>10005</td>\n <td>0</td>\n <td>30</td>\n <td>$51,498</td>\n <td>3</td>\n <td>notenrolled</td>\n <td>NaN</td>\n <td>1</td>\n <td>Single</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " CUSTOMERID GENDER AGE INCOME EXPERIENCE SCORE LOYALTY GROUP \\\n0 10001 0 64 $133,498 5 enrolled \n1 10002 0 42 $94,475 9 notenrolled \n2 10003 0 40 $88,610 9 enrolled \n3 10004 0 38 $84,313 8 enrolled \n4 10005 0 30 $51,498 3 notenrolled \n\n ENROLLMENT DATE HOUSEHOLD SIZE MARITAL STATUS \n0 06-03-2013 4 Single \n1 NaN 6 Married \n2 02-09-2010 5 Married \n3 06-04-2015 1 Single \n4 NaN 1 Single "
},
"output_type": "execute_result"
}
],
"source": "#Import Customer Dataset Here\nbody = client_0b7f25d6f4c743e1bbf81e3ecac0de47.get_object(Bucket='ibmdnadatasciencelearningproject-donotdelete-pr-foj79k0orfi5wk',Key='DNA - Customer Data Set - Student 1 of 3.csv')['Body']\n# add missing __iter__ method, so pandas accepts body as file-like object\nif not hasattr(body, \"__iter__\"): body.__iter__ = types.MethodType( __iter__, body )\n\ncustomer_data=pd.read_csv(body)\ncustomer_data.head()\n\n"
},
{
"source": "### Changing data types",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 4,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "customer_data['INCOME']=customer_data['INCOME'].map(lambda x : x.replace('$',''))"
},
{
"execution_count": 5,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "customer_data['INCOME']=customer_data['INCOME'].map(lambda x : int(x.replace(',','')))"
},
{
"source": "### Creating Customer View",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 6,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "trans_products=transactions_data.merge(product_data,how='inner', left_on='PRODUCT NUM', right_on='PRODUCT CODE')"
},
{
"execution_count": 7,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "trans_products['UNIT LIST PRICE']=trans_products['UNIT LIST PRICE'].map(lambda x : float(x.replace('$','')))"
},
{
"execution_count": 8,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "trans_products['Total_Price']=trans_products['QUANTITY PURCHASED'] * trans_products['UNIT LIST PRICE'] * (1- trans_products['DISCOUNT TAKEN'])"
},
{
"execution_count": 9,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "customer_prod_categ=trans_products.groupby(['CUSTOMER NUM','PRODUCT CATEGORY']).agg({'Total_Price':'sum'})"
},
{
"execution_count": 10,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "customer_prod_categ=customer_prod_categ.reset_index()"
},
{
"execution_count": 11,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "customer_pivot=customer_prod_categ.pivot(index='CUSTOMER NUM',columns='PRODUCT CATEGORY',values='Total_Price')"
},
{
"execution_count": 12,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "trans_total_spend=trans_products.groupby('CUSTOMER NUM').agg({'Total_Price':'sum'}).\\\nrename(columns={'Total_Price':'TOTAL SPENT'})"
},
{
"execution_count": 13,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "customer_KPIs=customer_pivot.merge(trans_total_spend,how='inner',left_index=True, right_index=True )"
},
{
"execution_count": 14,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "customer_KPIs=customer_KPIs.fillna(0)\n"
},
{
"execution_count": 15,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "customer_all_view=customer_data.merge(customer_KPIs,how='inner', left_on='CUSTOMERID', right_index=True)"
},
{
"execution_count": 16,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 16,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>CUSTOMERID</th>\n <th>GENDER</th>\n <th>AGE</th>\n <th>INCOME</th>\n <th>EXPERIENCE SCORE</th>\n <th>LOYALTY GROUP</th>\n <th>ENROLLMENT DATE</th>\n <th>HOUSEHOLD SIZE</th>\n <th>MARITAL STATUS</th>\n <th>APPAREL</th>\n <th>ELECTRONICS</th>\n <th>FOOD</th>\n <th>HEALTH &amp; BEAUTY</th>\n <th>TOTAL SPENT</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>10001</td>\n <td>0</td>\n <td>64</td>\n <td>133498</td>\n <td>5</td>\n <td>enrolled</td>\n <td>06-03-2013</td>\n <td>4</td>\n <td>Single</td>\n <td>4022.430</td>\n <td>1601.315</td>\n <td>68.688</td>\n <td>1134.337</td>\n <td>6826.770</td>\n </tr>\n <tr>\n <th>1</th>\n <td>10002</td>\n <td>0</td>\n <td>42</td>\n <td>94475</td>\n <td>9</td>\n <td>notenrolled</td>\n <td>NaN</td>\n <td>6</td>\n <td>Married</td>\n <td>2312.509</td>\n <td>2473.163</td>\n <td>276.779</td>\n <td>0.000</td>\n <td>5062.451</td>\n </tr>\n <tr>\n <th>2</th>\n <td>10003</td>\n <td>0</td>\n <td>40</td>\n <td>88610</td>\n <td>9</td>\n <td>enrolled</td>\n <td>02-09-2010</td>\n <td>5</td>\n <td>Married</td>\n <td>2887.382</td>\n <td>5414.418</td>\n <td>260.640</td>\n <td>0.000</td>\n <td>8562.440</td>\n </tr>\n <tr>\n <th>3</th>\n <td>10004</td>\n <td>0</td>\n <td>38</td>\n <td>84313</td>\n <td>8</td>\n <td>enrolled</td>\n <td>06-04-2015</td>\n <td>1</td>\n <td>Single</td>\n <td>3637.213</td>\n <td>1840.211</td>\n <td>45.270</td>\n <td>0.000</td>\n <td>5522.694</td>\n </tr>\n <tr>\n <th>4</th>\n <td>10005</td>\n <td>0</td>\n <td>30</td>\n <td>51498</td>\n <td>3</td>\n <td>notenrolled</td>\n <td>NaN</td>\n <td>1</td>\n <td>Single</td>\n <td>213.512</td>\n <td>0.000</td>\n <td>0.000</td>\n <td>0.000</td>\n <td>213.512</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " CUSTOMERID GENDER AGE INCOME EXPERIENCE SCORE LOYALTY GROUP \\\n0 10001 0 64 133498 5 enrolled \n1 10002 0 42 94475 9 notenrolled \n2 10003 0 40 88610 9 enrolled \n3 10004 0 38 84313 8 enrolled \n4 10005 0 30 51498 3 notenrolled \n\n ENROLLMENT DATE HOUSEHOLD SIZE MARITAL STATUS APPAREL ELECTRONICS \\\n0 06-03-2013 4 Single 4022.430 1601.315 \n1 NaN 6 Married 2312.509 2473.163 \n2 02-09-2010 5 Married 2887.382 5414.418 \n3 06-04-2015 1 Single 3637.213 1840.211 \n4 NaN 1 Single 213.512 0.000 \n\n FOOD HEALTH & BEAUTY TOTAL SPENT \n0 68.688 1134.337 6826.770 \n1 276.779 0.000 5062.451 \n2 260.640 0.000 8562.440 \n3 45.270 0.000 5522.694 \n4 0.000 0.000 213.512 "
},
"output_type": "execute_result"
}
],
"source": "customer_all_view.head()"
},
{
"source": "# Clustering ",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "_Step 1_",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "Import the necessary libraries by using the following code: ",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 17,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "from sklearn.cluster import KMeans\nfrom sklearn.cluster import AgglomerativeClustering"
},
{
"source": "_Step 2_",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "Select the features on which you are clustering. In this example, we cluster \u201cincome\u201d and \u201cTotal spent\u201d variables by using the following code: ",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 18,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 18,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>INCOME</th>\n <th>TOTAL SPENT</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>133498</td>\n <td>6826.770</td>\n </tr>\n <tr>\n <th>1</th>\n <td>94475</td>\n <td>5062.451</td>\n </tr>\n <tr>\n <th>2</th>\n <td>88610</td>\n <td>8562.440</td>\n </tr>\n <tr>\n <th>3</th>\n <td>84313</td>\n <td>5522.694</td>\n </tr>\n <tr>\n <th>4</th>\n <td>51498</td>\n <td>213.512</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " INCOME TOTAL SPENT\n0 133498 6826.770\n1 94475 5062.451\n2 88610 8562.440\n3 84313 5522.694\n4 51498 213.512"
},
"output_type": "execute_result"
}
],
"source": "cluster_input=customer_all_view[['INCOME','TOTAL SPENT']]\ncluster_input.head(5)"
},
{
"source": "The \u201ccluster_input\u201d variable is a Pandas data frame that contains only the columns \u201cincome\u201d and \u201ctotal spent\u201d. We use these two continuous variables because of the following reasons:\n\n Two variables can be easily visualized on a 2-dimensional plot\n Clustering algorithms rely on a distance function (like Euclidean distance) to compute similarity among data points. The sample space for categorical data is discrete and doesn\u2019t have a natural origin, so a Euclidean distance function on such a space isn\u2019t really meaningful.\n",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "_Step 3_",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "Initialize a K-means model with four clusters as follows: ",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 19,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "Kmeans_model=KMeans(n_clusters=4)"
},
{
"source": "Although you can use the elbow method or silhouette to determine the optimal number of clusters, you and Retailer X agreed to divide their customer base into only four clusters. ",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "_Step 4_",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "Look at the parameters of the model by running the following code:\n",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 20,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 20,
"metadata": {},
"data": {
"text/plain": "KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,\n n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',\n random_state=None, tol=0.0001, verbose=0)"
},
"output_type": "execute_result"
}
],
"source": "Kmeans_model"
},
{
"source": "The output shows the parameters that were passed to the model, which decide how the algorithm works because we passed only the number of cluster(k) parameter; every other parameter used its default value.\n\nThe K-means algorithm follows an iterative way for clustering data points. The number of iterations default value is 300, which is visible Out (23) output. This parameter defines the maximum iterative limit because the algorithm can reach convergence before reaching this maximum limit.",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "_Step 5_",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "Run the K-means cluster algorithm on the input by using \u201cfit_predict\u201d method:\n\n",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 21,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "cluster_output = Kmeans_model.fit_predict(cluster_input)"
},
{
"source": "_Step 6_",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "Get the output of the K-means algorithm by using the following code:\n\n",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 22,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 22,
"metadata": {},
"data": {
"text/plain": "array([1, 3, 0, 0, 2, 1, 0, 3, 3, 0, 2, 3, 0, 0, 0, 0, 0, 3, 1, 0, 3, 3, 3,\n 2, 3, 3, 0, 2, 2, 1, 0, 2, 2, 3, 2, 2, 0, 3, 3, 2, 2, 0, 0, 0, 1, 2,\n 2, 0, 1, 2, 3, 2, 0, 2, 3, 2, 3, 3, 3, 2, 0, 3, 1, 3, 1, 0, 0, 3, 2,\n 0, 0, 2, 0, 0, 1, 2, 0, 2, 0, 1, 0, 3, 1, 3, 0, 3, 1, 2, 0, 2, 3, 1,\n 3, 3, 3, 0, 3, 2, 1, 3, 2, 2, 2, 2, 0, 3, 3, 2, 0, 2, 2, 2, 0, 2, 3,\n 2, 1, 1, 0, 0, 2, 3, 3, 2, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 1, 3, 1, 1,\n 3, 0, 0, 2, 2, 2, 0, 3, 3, 2, 3, 0, 0, 3, 3, 3, 0, 3, 2, 2, 3, 2, 1,\n 2, 1, 3, 2, 0, 2, 1, 0, 2, 1, 2, 3, 3, 1, 2, 1, 1, 3, 2, 1, 2, 0, 2,\n 3, 1, 3, 3, 2, 1, 2, 3, 3, 1, 3, 1, 3, 1, 3, 0, 3, 2, 1, 1, 3, 2, 0,\n 1, 1, 1, 1, 2, 3, 0, 1, 3, 0, 3, 0, 2, 0, 2, 3, 2, 2, 0, 3, 2, 2, 0,\n 3, 0, 3, 0, 0, 2, 3, 1, 2, 2, 2, 0, 0, 0, 0, 2, 3, 3, 1, 2, 0, 3, 2,\n 3, 2, 0, 2, 1, 3, 3, 2, 1, 2, 0, 0, 1, 3, 0, 0, 3, 1, 0, 1, 2, 2, 2,\n 1, 0, 3, 3, 3, 0, 0, 0, 1, 0, 3, 0, 3, 1, 1, 2, 2, 1, 2, 2, 2, 2, 0,\n 0, 2, 0, 0, 1, 2, 1, 2, 2, 1, 2, 2, 3, 0, 0, 0, 0, 2, 1, 3, 0, 3, 0,\n 2, 0, 2, 0, 0, 2, 3, 2, 2, 3, 2, 0, 0, 2, 2, 3, 3, 3, 3, 1, 3, 0, 3,\n 3, 0, 1, 3, 0, 2, 0, 2, 3, 1, 1, 3, 0, 0, 3, 1, 3, 2, 3, 3, 2, 0, 3,\n 3, 0, 3, 0, 0, 1, 0, 1, 3, 0, 0, 3, 0, 2, 2, 2, 1, 0, 2, 0, 1, 2, 1,\n 2, 0, 1, 2, 3, 1, 0, 1, 0, 3, 1, 2, 0, 1, 0, 2, 1, 0, 0, 1, 3, 2, 0,\n 1, 2, 3, 3, 1, 2, 2, 3, 0, 0, 3, 0, 1, 3, 0, 2, 3, 0, 0, 2, 1, 0, 3,\n 3, 0, 3, 3, 2, 0, 1, 2, 0, 1, 2, 2, 0, 2, 2, 2, 2, 1, 0, 3, 3, 0, 2,\n 1, 0, 3, 0, 1, 3, 0, 2, 2, 3, 1, 0, 1, 3, 3, 0, 3, 0, 2, 2, 2, 1, 3,\n 1, 0, 2, 3, 2, 3, 1, 0, 0, 1, 0, 2, 3, 3, 3, 2, 0], dtype=int32)"
},
"output_type": "execute_result"
}
],
"source": "cluster_output"
},
{
"source": "_Step 7_",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "The output of step 6 is in a NumPy array type. Run the following command to confirm that the output is in NumPy format and not Pandas. ",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 23,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 23,
"metadata": {},
"data": {
"text/plain": "numpy.ndarray"
},
"output_type": "execute_result"
}
],
"source": "type(cluster_output)"
},
{
"source": "Recall that we used the Pandas data frame structure to store the tabular data that is in the CSV files (products, customers, and transactions). Let us look at the first few records of products by using the following code:\n\n",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 24,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 24,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>PRODUCT CODE</th>\n <th>PRODUCT CATEGORY</th>\n <th>UNIT LIST PRICE</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>30001</td>\n <td>HEALTH &amp; BEAUTY</td>\n <td>$7.45</td>\n </tr>\n <tr>\n <th>1</th>\n <td>30002</td>\n <td>HEALTH &amp; BEAUTY</td>\n <td>$5.35</td>\n </tr>\n <tr>\n <th>2</th>\n <td>30003</td>\n <td>HEALTH &amp; BEAUTY</td>\n <td>$5.49</td>\n </tr>\n <tr>\n <th>3</th>\n <td>30004</td>\n <td>HEALTH &amp; BEAUTY</td>\n <td>$6.46</td>\n </tr>\n <tr>\n <th>4</th>\n <td>30005</td>\n <td>HEALTH &amp; BEAUTY</td>\n <td>$7.33</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " PRODUCT CODE PRODUCT CATEGORY UNIT LIST PRICE\n0 30001 HEALTH & BEAUTY $7.45 \n1 30002 HEALTH & BEAUTY $5.35 \n2 30003 HEALTH & BEAUTY $5.49 \n3 30004 HEALTH & BEAUTY $6.46 \n4 30005 HEALTH & BEAUTY $7.33 "
},
"output_type": "execute_result"
}
],
"source": "product_data.head()"
},
{
"source": " If you run the values method on a Pandas dataframe, it will convert it to Numpy. Run it on Product Pandas and observe the output. ",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": null,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "product_data.head().values"
},
{
"source": "To access particular elements in that array, you use square brackets ([]) to index array values\n\nFor example, if you need to know the value represented by Row 1 (second row) and column 2 (third column), You should use the index [1,2]\n\nTo access the element in the second row and third column, use the following code:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 26,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 26,
"metadata": {},
"data": {
"text/plain": "' $5.35 '"
},
"output_type": "execute_result"
}
],
"source": "product_data.head().values[1,2]"
},
{
"source": "To view all elements presented in the first row, use \u201c:\u201d for the column index to call all columns\n\n",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 27,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 27,
"metadata": {},
"data": {
"text/plain": "array([30002, 'HEALTH & BEAUTY', ' $5.35 '], dtype=object)"
},
"output_type": "execute_result"
}
],
"source": "product_data.head().values[1,:]"
},
{
"source": "Values in the output are presented in a 1-Dimensional array, since we only called a single row of data\n\nTo view the values of the third column, use the following code:\n\n",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 28,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 28,
"metadata": {},
"data": {
"text/plain": "array([' $7.45 ', ' $5.35 ', ' $5.49 ', ' $6.46 ', ' $7.33 '], dtype=object)"
},
"output_type": "execute_result"
}
],
"source": "product_data.head().values[:,2]"
},
{
"source": "Values in the output are presented in a 1-dimensional array because we called only a single column of data.\n\nYou can convert the 1-dimensional NumPy array to a Pandas data frame by using the following code:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 29,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "cluster_output_pd=pd.DataFrame(cluster_output,columns=['segment'])"
},
{
"source": "The \u201ccluster_output\u201d is a 1-dimensional array because a single cluster index is assigned to every customer record.\n\nVerify that \u201ccluster_output_pd\u201d is a Pandas data frame with a single column called \u201csegment\u201d by using the following code:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 30,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 30,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>segment</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n </tr>\n <tr>\n <th>1</th>\n <td>3</td>\n </tr>\n <tr>\n <th>2</th>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>2</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " segment\n0 1\n1 3\n2 0\n3 0\n4 2"
},
"output_type": "execute_result"
}
],
"source": "cluster_output_pd.head()"
},
{
"source": "_Step 8_",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "Merge the cluster input containing the income and total spending for each customer and the cluster output, which contains the cluster index, by using the following code:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 31,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "segment_DF=pd.concat([cluster_input,cluster_output_pd],axis=1)"
},
{
"source": "_Step 9_",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "Verify output using the following command",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 32,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 32,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>INCOME</th>\n <th>TOTAL SPENT</th>\n <th>segment</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>133498</td>\n <td>6826.770</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1</th>\n <td>94475</td>\n <td>5062.451</td>\n <td>3</td>\n </tr>\n <tr>\n <th>2</th>\n <td>88610</td>\n <td>8562.440</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>84313</td>\n <td>5522.694</td>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>51498</td>\n <td>213.512</td>\n <td>2</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " INCOME TOTAL SPENT segment\n0 133498 6826.770 1\n1 94475 5062.451 3\n2 88610 8562.440 0\n3 84313 5522.694 0\n4 51498 213.512 2"
},
"output_type": "execute_result"
}
],
"source": "segment_DF.head()"
},
{
"source": "_Step 10_",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "The cluster centroids that are computed by the algorithm can be found by using a method that is called \u201ccluster_centers\u201d. Apply the method to the K-means model by using the following code:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 33,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 33,
"metadata": {},
"data": {
"text/plain": "array([[ 76337.14084507, 5260.48642958],\n [ 138471.625 , 6972.91513636],\n [ 38530.82608696, 2260.43836232],\n [ 110254.62121212, 7744.12999242]])"
},
"output_type": "execute_result"
}
],
"source": "Kmeans_model.cluster_centers_"
},
{
"source": "_Step 11_",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "The \u201csegment_DF\u201d Pandas data frame that is created in step 8 contains points that belong to all customer segments. To select only those segments that belong to the first cluster (cluster index=0), use the following code:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 34,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 34,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>INCOME</th>\n <th>TOTAL SPENT</th>\n <th>segment</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>2</th>\n <td>88610</td>\n <td>8562.440</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>84313</td>\n <td>5522.694</td>\n <td>0</td>\n </tr>\n <tr>\n <th>6</th>\n <td>65002</td>\n <td>5224.616</td>\n <td>0</td>\n </tr>\n <tr>\n <th>9</th>\n <td>76994</td>\n <td>6620.147</td>\n <td>0</td>\n </tr>\n <tr>\n <th>12</th>\n <td>88829</td>\n <td>4685.902</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " INCOME TOTAL SPENT segment\n2 88610 8562.440 0\n3 84313 5522.694 0\n6 65002 5224.616 0\n9 76994 6620.147 0\n12 88829 4685.902 0"
},
"output_type": "execute_result"
}
],
"source": "segment_DF[segment_DF.segment==0].head()"
},
{
"source": "Condition \u201csegment==0\u201d selects only those assigned to the first cluster. Similarly, you can do the same for other clusters ",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "_Step 12_",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "Plot the clustering results. First, import the plotting library by using the following code:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 35,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "import matplotlib.pyplot as plt"
},
{
"source": "_Step 13_",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "Use the filter condition that was applied in step 10 to select only those customers that belong to the first cluster. Plot their income and age in purple. Select the second cluster and plot their income and age in blue. Repeat for clusters 3 and 4. Plot the cluster centroids that you have computed in step 9. ",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 36,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": "<matplotlib.figure.Figure at 0x7fe7b4eeccc0>"
},
"metadata": {}
}
],
"source": "\nplt.scatter(segment_DF[segment_DF.segment==0]['INCOME'],segment_DF[segment_DF.segment==0]['TOTAL SPENT'],s=50, c='purple',label='Cluster1')\n\nplt.scatter(segment_DF[segment_DF.segment==1]['INCOME'],segment_DF[segment_DF.segment==1]['TOTAL SPENT'],s=50, c='blue',label='Cluster3')\n\nplt.scatter(segment_DF[segment_DF.segment==2]['INCOME'],segment_DF[segment_DF.segment==2]['TOTAL SPENT'],s=50, c='green',label='Cluster4')\n\nplt.scatter(segment_DF[segment_DF.segment==3]['INCOME'],segment_DF[segment_DF.segment==3]['TOTAL SPENT'],s=50, c='cyan',label='Cluster2')\n\nplt.scatter(Kmeans_model.cluster_centers_[:,0], Kmeans_model.cluster_centers_[:,1],s=200,marker='s', c='red', alpha=0.7, label='Centroids')\n\nplt.title('Customer segments using K-means (k=4)')\n\nplt.xlabel('Income')\n\nplt.ylabel('Total Spend')\n\nplt.legend()\n\nplt.show()\n"
},
{
"source": "_Step 14_",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "Using the graph that was produced in step 12, create a table to describe the four customer segments relative to their income and spend.",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "_Step 15_",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "Retailer X must know all about the different customer segments demographics. So, you must discover the characteristics that are associated with each segment, such as the age group, household size, and loyalty enrolment. To do this task, group by each customer segment and calculate group measures such as average age, percentage of loyalty enrolment, and median of house hold size.\n\nMerge the clustering output with the customer all view by using the following code:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 37,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 37,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>CUSTOMERID</th>\n <th>GENDER</th>\n <th>AGE</th>\n <th>INCOME</th>\n <th>EXPERIENCE SCORE</th>\n <th>LOYALTY GROUP</th>\n <th>ENROLLMENT DATE</th>\n <th>HOUSEHOLD SIZE</th>\n <th>MARITAL STATUS</th>\n <th>APPAREL</th>\n <th>ELECTRONICS</th>\n <th>FOOD</th>\n <th>HEALTH &amp; BEAUTY</th>\n <th>TOTAL SPENT</th>\n <th>segment</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>10001</td>\n <td>0</td>\n <td>64</td>\n <td>133498</td>\n <td>5</td>\n <td>enrolled</td>\n <td>06-03-2013</td>\n <td>4</td>\n <td>Single</td>\n <td>4022.430</td>\n <td>1601.315</td>\n <td>68.688</td>\n <td>1134.337</td>\n <td>6826.770</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1</th>\n <td>10002</td>\n <td>0</td>\n <td>42</td>\n <td>94475</td>\n <td>9</td>\n <td>notenrolled</td>\n <td>NaN</td>\n <td>6</td>\n <td>Married</td>\n <td>2312.509</td>\n <td>2473.163</td>\n <td>276.779</td>\n <td>0.000</td>\n <td>5062.451</td>\n <td>3</td>\n </tr>\n <tr>\n <th>2</th>\n <td>10003</td>\n <td>0</td>\n <td>40</td>\n <td>88610</td>\n <td>9</td>\n <td>enrolled</td>\n <td>02-09-2010</td>\n <td>5</td>\n <td>Married</td>\n <td>2887.382</td>\n <td>5414.418</td>\n <td>260.640</td>\n <td>0.000</td>\n <td>8562.440</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>10004</td>\n <td>0</td>\n <td>38</td>\n <td>84313</td>\n <td>8</td>\n <td>enrolled</td>\n <td>06-04-2015</td>\n <td>1</td>\n <td>Single</td>\n <td>3637.213</td>\n <td>1840.211</td>\n <td>45.270</td>\n <td>0.000</td>\n <td>5522.694</td>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>10005</td>\n <td>0</td>\n <td>30</td>\n <td>51498</td>\n <td>3</td>\n <td>notenrolled</td>\n <td>NaN</td>\n <td>1</td>\n <td>Single</td>\n <td>213.512</td>\n <td>0.000</td>\n <td>0.000</td>\n <td>0.000</td>\n <td>213.512</td>\n <td>2</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " CUSTOMERID GENDER AGE INCOME EXPERIENCE SCORE LOYALTY GROUP \\\n0 10001 0 64 133498 5 enrolled \n1 10002 0 42 94475 9 notenrolled \n2 10003 0 40 88610 9 enrolled \n3 10004 0 38 84313 8 enrolled \n4 10005 0 30 51498 3 notenrolled \n\n ENROLLMENT DATE HOUSEHOLD SIZE MARITAL STATUS APPAREL ELECTRONICS \\\n0 06-03-2013 4 Single 4022.430 1601.315 \n1 NaN 6 Married 2312.509 2473.163 \n2 02-09-2010 5 Married 2887.382 5414.418 \n3 06-04-2015 1 Single 3637.213 1840.211 \n4 NaN 1 Single 213.512 0.000 \n\n FOOD HEALTH & BEAUTY TOTAL SPENT segment \n0 68.688 1134.337 6826.770 1 \n1 276.779 0.000 5062.451 3 \n2 260.640 0.000 8562.440 0 \n3 45.270 0.000 5522.694 0 \n4 0.000 0.000 213.512 2 "
},
"output_type": "execute_result"
}
],
"source": "customer_demographics=pd.concat([customer_all_view,cluster_output_pd],axis=1)\ncustomer_demographics.head()\n"
},
{
"source": "This output shows the mean age and median household size for each cluster. Notice how age varies significantly across segments.\n\nWith regards to loyalty enrolment, you must calculate the percentage of participation by using the following formula:\n\nSo, you create a function by the name \u201cpercent_loyalty\u201d by using the following code: ",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 38,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "def percent_loyalty(series):\n percent=100 * series.value_counts()['enrolled'] /series.count()\n return percent"
},
{
"source": "This function accepts the Pandas series as input, which is the series that we are going to pass as the \u201cloyalty group\u201d column (one Pandas data frame column is a Pandas series). The function returns the percentage of enrolment by calculating the number of enrolled customers (by using the value_counts method) to the total number of customers (by using the count function).\n\nPass the created function as an aggregate measure in the \u201cagg\u201d command as follows:",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 39,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 39,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>LOYALTY GROUP</th>\n <th>HOUSEHOLD SIZE</th>\n <th>AGE</th>\n </tr>\n <tr>\n <th>segment</th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>32.394366</td>\n <td>3</td>\n <td>35.661972</td>\n </tr>\n <tr>\n <th>1</th>\n <td>65.909091</td>\n <td>2</td>\n <td>73.420455</td>\n </tr>\n <tr>\n <th>2</th>\n <td>52.173913</td>\n <td>2</td>\n <td>24.449275</td>\n </tr>\n <tr>\n <th>3</th>\n <td>66.666667</td>\n <td>3</td>\n <td>47.416667</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " LOYALTY GROUP HOUSEHOLD SIZE AGE\nsegment \n0 32.394366 3 35.661972\n1 65.909091 2 73.420455\n2 52.173913 2 24.449275\n3 66.666667 3 47.416667"
},
"output_type": "execute_result"
}
],
"source": "customer_demographics.groupby('segment').agg({'AGE':'mean','HOUSEHOLD SIZE':'median','LOYALTY GROUP': percent_loyalty})"
},
{
"source": "Extend the tabular report that you constructed in step 15 to include the segment demographic data you produced\n\nThis table shows the demographic segmentation for Retailer X customer segments.\nDemographic segmentation for Retailer X customer segments\n",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "THE END!",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "#### Harison Mwangi",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "https://www.linkedin.com/in/harison-m-418641102",
"cell_type": "markdown",
"metadata": {}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.5",
"name": "python3",
"language": "python"
},
"language_info": {
"mimetype": "text/x-python",
"nbconvert_exporter": "python",
"version": "3.5.5",
"name": "python",
"file_extension": ".py",
"pygments_lexer": "ipython3",
"codemirror_mode": {
"version": 3,
"name": "ipython"
}
}
},
"nbformat": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment