Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Analysis for customer segmentation blog post
import pandas as pd
# http://blog.yhathq.com/static/misc/data/WineKMC.xlsx
df_offers = pd.read_excel("./WineKMC.xlsx", sheetname=0)
df_offers.columns = ["offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak"]
df_offers.head()
df_transactions = pd.read_excel("./WineKMC.xlsx", sheetname=1)
df_transactions.columns = ["customer_name", "offer_id"]
df_transactions['n'] = 1
df_transactions.head()
# join the offers and transactions table
df = pd.merge(df_offers, df_transactions)
# create a "pivot table" which will give us the number of times each
# customer responded to a given variable
matrix = df.pivot_table(index=['customer_name'], columns=['offer_id'], values='n')
# a little tidying up. fill NA values with 0 and make the index into a column
matrix = matrix.fillna(0).reset_index()
x_cols = matrix.columns[1:]
from sklearn.cluster import KMeans
cluster = KMeans(n_clusters=5)
# slice matrix so we only include the 0/1 indicator columns in the clustering
matrix['cluster'] = cluster.fit_predict(matrix[x_cols])
matrix.cluster.value_counts()
from ggplot import *
ggplot(matrix, aes(x='factor(cluster)')) + geom_bar() + xlab("Cluster") + ylab("Customers\n(# in cluster)")
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
matrix['x'] = pca.fit_transform(matrix[x_cols])[:,0]
matrix['y'] = pca.fit_transform(matrix[x_cols])[:,1]
matrix = matrix.reset_index()
customer_clusters = matrix[['customer_name', 'cluster', 'x', 'y']]
customer_clusters.head()
df = pd.merge(df_transactions, customer_clusters)
df = pd.merge(df_offers, df)
from ggplot import *
ggplot(df, aes(x='x', y='y', color='cluster')) + \
geom_point(size=75) + \
ggtitle("Customers Grouped by Cluster")
cluster_centers = pca.transform(cluster.cluster_centers_)
cluster_centers = pd.DataFrame(cluster_centers, columns=['x', 'y'])
cluster_centers['cluster'] = range(0, len(cluster_centers))
ggplot(df, aes(x='x', y='y', color='cluster')) + \
geom_point(size=75) + \
geom_point(cluster_centers, size=500) +\
ggtitle("Customers Grouped by Cluster")
df['is_4'] = df.cluster==4
df.groupby("is_4").varietal.value_counts()
df.groupby("is_4")[['min_qty', 'discount']].mean()
@PerryGrossman

This comment has been minimized.

Copy link

@PerryGrossman PerryGrossman commented Sep 9, 2015

Hi, Thanks for this! I have a few questions though as I get different clusters and different centers. My results differ from those on the blog.

@gustavodemari

This comment has been minimized.

Copy link

@gustavodemari gustavodemari commented Dec 17, 2015

@glamp Line 25 differs from the blog post on Yhat.

@bjhaveri

This comment has been minimized.

Copy link

@bjhaveri bjhaveri commented Nov 8, 2016

When I try to run visualize a similar dataset using ggplot, it just hangs. I have over a million records. Any ideas on how I can visualize the clusters.

@Awilonk

This comment has been minimized.

Copy link

@Awilonk Awilonk commented Nov 16, 2016

I find it hard to display the center.
my ggplot seems can only display one size...
I tried older ggplot which can even no plot two data.

plz let me know the version of your ggplot

@Awilonk

This comment has been minimized.

Copy link

@Awilonk Awilonk commented Nov 16, 2016

image
this is your code running on my pc..
do you know how to fix it?

@alitrack

This comment has been minimized.

Copy link

@alitrack alitrack commented May 3, 2017

when execute

cluster_centers = pca.transform(cluster.cluster_centers_)

I got the following error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-989caa1671e5> in <module>()
----> 1 cluster_centers = pca.transform(cluster.cluster_centers_)

/home/ubuntu/tensorflow/lib/python3.4/site-packages/sklearn/decomposition/base.py in transform(self, X, y)
    130         X = check_array(X)
    131         if self.mean_ is not None:
--> 132             X = X - self.mean_
    133         X_transformed = fast_dot(X, self.components_.T)
    134         if self.whiten:

ValueError: operands could not be broadcast together with shapes (5,31) (32,) 

I use python3

@avinash-mishra

This comment has been minimized.

Copy link

@avinash-mishra avinash-mishra commented Jun 5, 2018

Where is the blog post?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment