Skip to content

Instantly share code, notes, and snippets.

@Mashimo
Last active August 5, 2022 02:45
Show Gist options
  • Save Mashimo/39436d4c94d5827e81a18b286b832b4c to your computer and use it in GitHub Desktop.
Save Mashimo/39436d4c94d5827e81a18b286b832b4c to your computer and use it in GitHub Desktop.
Clustering data
Clustering groups samples that are similar within the same cluster.
Unsupervised: no label provided in the data samples.
Use the K-means algorithm.

Clustering

Clustering groups samples that are similar within the same cluster. The more similar the samples belonging to a cluster group are (and conversely, the more dissimilar samples in separate groups), the better the clustering algorithm has performed. Since clustering is an unsupervised algorithm, this similarity metric must be measured automatically and based solely on your data.

The implementation details and definition of similarity are what differentiate the many clustering algorithms.

K-means

The K-Means way of doing this, is to iteratively separate your samples into a user-specified number of "K" cluster groups of roughly equal variance.

"""
After the September 11 attacks, a series of secret regulations, laws, and processes were enacted, perhaps to better protect the citizens of the United States. These processes continued through president Bush's term and were renewed and and strengthened during the Obama administration. Then, on May 24, 2006, the United States Foreign Intelligence Surveillance Court (FISC) made a fundamental shift in its approach to Section 215 of the Patriot Act, permitting the FBI to compel production of "business records" relevant to terrorism investigations, which are shared with the NSA. The court now defined as business records the entirety of a telephone company's call database, also known as Call Detail Records (CDR or metadata).
News of this came to public light after an ex-NSA contractor leaked the information, and a few more questions were raised when it was further discovered that not just the call records of suspected terrorists were being collected in bulk... but perhaps the entirety of Americans as a whole. After all, if you know someone who knows someone who knows someone, your private records are relevant to a terrorism investigation. The white house quickly reassured the public in a press release that "Nobody is listening to your telephone calls," since, "that's not what this program is about." The public was greatly relieved.
The questions explored here
: exactly how useful is telephone metadata? It must have some use, otherwise the government wouldn't have invested however many millions they did into it secretly collecting it from phone carriers.
Also what kind of intelligence can you extract from CDR metadata besides its face value?
It uses a sample CDR dataset generated for 10 people living in the Dallas, Texas metroplex area.
It attempts to do what many researchers have already successfully done - partly de-anonymize the CDR data. People generally behave in predictable manners, moving from home to work with a few errands in between. With enough call data, given a few K-locations of interest, K-Means should be able to isolate rather easily the geolocations where a person spends the most of their time.
CDRs are at least supposed to be protected by privacy laws, and are the basis for proprietary revenue calculations. In reality, there are quite a few public CDRs out there. Much information can be discerned from them such as social networks, criminal acts, and believe it or not, even the spread of decreases as was demonstrated by Flowminder Foundation paper on Ebola.
"""
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot') # Look Pretty
def showandtell(title=None):
if title != None:
plt.savefig(title + ".png", bbox_inches='tight', dpi=300)
plt.show()
#
# INFO: This dataset has call records for 10 users tracked over the course of 3 years.
# we find out where the users likely live and work at
def clusterInfo(model):
print ("Cluster Analysis Inertia: ", model.inertia_)
print ('------------------------------------------')
for i in range(len(model.cluster_centers_)):
print ("\n Cluster ", i)
print (" Centroid ", model.cluster_centers_[i])
print (" #Samples ", (model.labels_==i).sum())
# Find the cluster with the least # attached nodes
def clusterWithFewestSamples(model):
# Ensure there's at least on cluster...
minSamples = len(model.labels_)
minCluster = 0
for i in range(len(model.cluster_centers_)):
if minSamples > (model.labels_==i).sum():
minCluster = i
minSamples = (model.labels_==i).sum()
print ("\n Cluster With Fewest Samples: ", minCluster)
return (model.labels_==minCluster)
# Find the cluster with the least # attached nodes
def clusterWithMostSamples(model):
# Ensure there's at least on cluster...
minSamples = 0
maxCluster = 0
for i in range(len(model.cluster_centers_)):
if (model.labels_==i).sum() > minSamples:
maxCluster = i
minSamples = (model.labels_==i).sum()
print ("\n Cluster With Most Samples: ", maxCluster)
print ("\n Centroid: : ", model.cluster_centers_[maxCluster])
return (model.labels_ == maxCluster)
def doKMeans(data, clusters=0):
#
# only feed in Lat and Lon coordinates to the KMeans algo, since none of the other
# data is suitable for your purposes. Since both Lat and Lon are (approximately) on the same scale,
# no feature scaling is required. Print out the centroid locations and add them onto your scatter
# plot. Use a distinguishable marker and color.
#
#
sliceK = data[['TowerLat', 'TowerLon']]
model = KMeans(n_clusters= clusters, n_init=10, init='random')
model.fit(sliceK)
return model
#
# : Load up the dataset and take a peek at its head
# Convert the date using pd.to_datetime, and the time using pd.to_timedelta
#
df = pd.read_csv("Datasets/CDR.csv")
df.CallDate = pd.to_datetime(df.CallDate, errors='coerce')
df.CallTime = pd.to_timedelta(df.CallTime, errors='coerce')
print(df.dtypes)
#
# : Get a distinct list of "In" phone numbers (users) and store the values in a
# regular python list.
#
usersIn = df.In.tolist()
#
# : Create a slice that filters to only include dataset records where the
# "In" feature (user phone number) is equal to the first number on your unique list above;
# that is, the very first number in the dataset
#
user1 = df[df.In == usersIn[0]]
# INFO: Plot all the call locations
user1.plot.scatter(x='TowerLon', y='TowerLat', c='gray', alpha=0.1, title='Call Locations')
showandtell()
#
# INFO: The locations map above should be too "busy" to really wrap your head around. This
# is where domain expertise comes into play. Your intuition tells you that people are likely
# to behave differently on weekends:
#
# On Weekdays:
# 1. People probably don't go into work
# 2. They probably sleep in late on Saturday
# 3. They probably run a bunch of random errands, since they couldn't during the week
# 4. They should be home, at least during the very late hours, e.g. 1-4 AM
#
# On Weekdays:
# 1. People probably are at work during normal working hours
# 2. They probably are at home in the early morning and during the late night
# 3. They probably spend time commuting between work and home everyday
#
# : Add more filters to the user1 slice you created. Add bitwise logic so that you're
# only examining records that came in on weekends (sat/sun).
#
user1 = user1[(user1.DOW == "Sat") | (user1.DOW == "Sun")]
#
# : Further filter it down for calls that are came in either before 6AM OR after 10pm (22:00:00).
# You can use < and > to compare the string times, just make sure you code them as military time
# strings, eg: "06:00:00", "22:00:00": https://en.wikipedia.org/wiki/24-hour_clock
#
# You might also want to review the Data Manipulation section for this. Once you have your filtered
# slice, print out its length:
#
user1 = user1[(user1.CallTime < "06:00:00") | (user1.CallTime > "22:00:00")]
#
# INFO: Visualize the dataframe with a scatter plot as a sanity check. Since you're familiar
# with maps, you know well that your X-Coordinate should be Longitude, and your Y coordinate
# should be the tower Latitude. Check the dataset headers for proper column feature names.
# https://en.wikipedia.org/wiki/Geographic_coordinate_system#Geographic_latitude_and_longitude
#
# At this point, you don't yet know exactly where the user is located just based off the cell
# phone tower position data; but considering the below are for Calls that arrived in the twilight
# hours of weekends, it's likely that wherever they are bunched up is probably near where the
# caller's residence:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(user1.TowerLon,user1.TowerLat, c='g', marker='o', alpha=0.2)
ax.set_title('Weekend Calls (<6am or >10p)')
showandtell()
#
# : Run K-Means with a K=1. There really should only be a single area of concentration. If you
# notice multiple areas that are "hot" (multiple areas the usr spends a lot of time at that are FAR
# apart from one another), then increase K=2, with the goal being that one of the centroids will
# sweep up the annoying outliers; and the other will zero in on the user's approximate home location.
# Or rather the location of the cell tower closest to their home.....
#
# Only feed in Lat and Lon coordinates to the KMeans algo, since none of the other
# data is suitable for your purposes. Since both Lat and Lon are (approximately) on the same scale,
# no feature scaling is required. Print out the centroid locations and add them onto your scatter
# plot. Use a distinguishable marker and color.
#
#
sliceK = user1[['TowerLon', 'TowerLat']]
model = KMeans(n_clusters=1, n_init=10, init='random')
model.fit(sliceK)
#
# INFO: Print and plot the centroids...
centroids = model.cluster_centers_
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(centroids[:,0], centroids[:,1], marker='x', c='red', alpha=0.5,
linewidths=3, s=169)
showandtell()
print (centroids)
#
# You can repeat the above steps for all 10 individuals, being sure to record their approximate home
# locations.
#
# : Create a unique list of of the phone-number values (users) stored in the
# "In" column of the dataset, and save it to a variable called `unique_numbers`.
# Manually check through unique_numbers to ensure the order the numbers appear is
# the same order they appear (uniquely) in your dataset:
# since the order matters, we cannot use sets ....
#
unique_numbers = []
for row in df.In:
if row not in unique_numbers:
unique_numbers.append(row)
for user in unique_numbers:
print ("\n\nExamining person: ", user)
userData = df[df.In == user]
#
# : Alter your slice so that it includes only Weekday (Mon-Fri) values.
#
userData = userData[(userData.DOW != "Sat") & (userData.DOW != "Sun")]
#
# : The idea is that the call was placed before 5pm. From Midnight-730a, the user is
# probably sleeping and won't call / wake up to take a call. There should be a brief time
# in the morning during their commute to work, then they'll spend the entire day at work.
# So the assumption is that most of the time is spent either at work, or in 2nd, at home.
#
userData = userData[(userData.CallTime < "17:00:00")]
#
# INFO: Run K-Means with K=3 or K=4. There really should only be a two areas of concentration. If you
# notice multiple areas that are "hot" (multiple areas the usr spends a lot of time at that are FAR
# apart from one another), then increase K=5, with the goal being that all centroids except two will
# sweep up the annoying outliers and not-home, not-work travel occasions. the other two will zero in
# on the user's approximate home location and work locations. Or rather the location of the cell
# tower closest to them.....
model = doKMeans(userData, 3)
midWayClusterIndices = clusterWithMostSamples(model)
# 32.7219411,-96.8993887 --> 2894365987
"""
Explore the City of Chicago's Crime data set, which is part of their Open Data initiative.
"""
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Look Pretty
plt.style.use('ggplot')
#
# To procure the dataset, follow these steps:
# 1. Navigate to: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
# 2. In the 'Primary Type' column, click on the 'Menu' button next to the info button,
# and select 'Filter This Column'. It might take a second for the filter option to
# show up, since it has to load the entire list first.
# 3. Scroll down to 'GAMBLING'
# 4. Click the light blue 'Export' button next to the 'Filter' button, and select 'Download As CSV'
def doKMeans(df):
#
# INFO: Plot data with a '.' marker, with 0.3 alpha at the Longitude,
# and Latitude locations in your dataset. Longitude = x, Latitude = y
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(df.Longitude, df.Latitude, marker='.', alpha=0.3)
#
# : Filter df so that you're only looking at Longitude and Latitude,
# since the remaining columns aren't really applicable for this purpose.
#
sliceK = df[['Longitude', 'Latitude']]
#
# : Use K-Means to try and find seven cluster centers in this df.
#
model = KMeans(n_clusters=7, n_init=10, init='random')
model.fit(sliceK)
#
# INFO: Print and plot the centroids...
centroids = model.cluster_centers_
ax.scatter(centroids[:,0], centroids[:,1], marker='x', c='red', alpha=0.5,
linewidths=3, s=169)
print (centroids)
#
# : Load dataset after importing Pandas
#
df = pd.read_csv("Datasets/Crimes_-_2001_to_present.csv", index_col="ID")
#
# : Drop any ROWs with nans in them
#
df.dropna(inplace=True)
#
# : Print out the dtypes of your dset
#
print (df.describe())
print(df.dtypes)
#
# Coerce the 'Date' feature (which is currently a string object) into real date,
# and confirm by re-printing the dtypes. NOTE: This is a slow process...
#
df.Date = pd.to_datetime(df.Date, errors='coerce')
print(df.dtypes)
# INFO: Print & Plot your data
doKMeans(df)
#
# : Filter out the data so that it only contains samples that have
# a Date > '2011-01-01', using indexing. Then, in a new figure, plot the
# crime incidents, as well as a new K-Means run's centroids.
#
df2 = df[df.Date > '2011-01-01']
# INFO: Print & Plot your data
doKMeans(df2)
plt.show()
import pandas as pd
from sklearn import preprocessing
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import matplotlib
#
# : Parameters to play around with
PLOT_TYPE_TEXT = False # If you'd like to see indices
PLOT_VECTORS = True # If you'd like to see your original features in P.C.-Space
matplotlib.style.use('ggplot') # Look Pretty
c = ['red', 'green', 'blue', 'orange', 'yellow', 'brown']
def drawVectors(transformed_features, components_, columns, plt):
num_columns = len(columns)
# This function will project your *original* feature (columns)
# onto your principal component feature-space, so that you can
# visualize how "important" each one was in the
# multi-dimensional scaling
# Scale the principal components by the max value in
# the transformed set belonging to that component
xvector = components_[0] * max(transformed_features[:,0])
yvector = components_[1] * max(transformed_features[:,1])
## Visualize projections
# Sort each column by its length. These are your *original*
# columns, not the principal components.
import math
important_features = { columns[i] : math.sqrt(xvector[i]**2 + yvector[i]**2) for i in range(num_columns) }
important_features = sorted(zip(important_features.values(), important_features.keys()), reverse=True)
print ("Projected Features by importance:\n", important_features)
ax = plt.axes()
for i in range(num_columns):
# Use an arrow to project each original feature as a
# labeled vector on your principal component axes
plt.arrow(0, 0, xvector[i], yvector[i], color='b', width=0.0005, head_width=0.02, alpha=0.75, zorder=600000)
plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='b', alpha=0.75, zorder=600000)
return ax
def doPCA(data, dimensions=2):
from sklearn.decomposition import PCA
model = PCA(n_components=dimensions, svd_solver='randomized', random_state=7)
model.fit(data)
return model
def doKMeans(data, clusters=0):
#
# KMeans clustering, passing in the # of clusters parameter
# and fit it against data. It returns a tuple containing the cluster
# centers and the labels.
#
model = KMeans(n_clusters= clusters, n_init=10, init='random')
model.fit(data)
return model.cluster_centers_, model.labels_
#
# : Load up the dataset. It has may or may not have nans in it. Make
# sure to catch and destroy them, by setting them to '0'. This is valid
# for this dataset, since if the value is missing, you can assume no $ was spent
# on it.
#
df = pd.read_csv("Datasets/Wholesale customers data.csv")
if df.isnull().values.any() == True:
print("substituted all NaN with Zero")
df.fillna(0, inplace=True)
else:
print("No NaN found!")
#
# get rid of the 'Channel' and 'Region' columns, since
# we'll be investigating as if this were a single location wholesaler, rather
# than a national / international one. Leaving these fields in here would cause
# KMeans to examine and give weight to them.
#
df.drop(['Channel', 'Region'], axis=1, inplace = True)
#
# : Before unitizing / standardizing / normalizing your data in preparation for
# K-Means, it's a good idea to get a quick peek at it. You can do this using the
# .describe() method, or even by using the built-in pandas df.plot.hist()
#
df.plot.hist()
#
# INFO: Having checked out your data, you may have noticed there's a pretty big gap
# between the top customers in each feature category and the rest. Some feature
# scaling algos won't get rid of outliers for you, so it's a good idea to handle that
# manually---particularly if your goal is NOT to determine the top customers. After
# all, you can do that with a simple Pandas .sort_values() and not a machine
# learning clustering algorithm. From a business perspective, you're probably more
# interested in clustering your +/- 2 standard deviation customers, rather than the
# creme de la creme, or bottom of the barrel'ers
#
# Remove top 5 and bottom 5 samples for each column:
drop = {}
for col in df.columns:
# Bottom 5
sort = df.sort_values(by=col, ascending=True)
if len(sort) > 5: sort=sort[:5]
for index in sort.index: drop[index] = True # Just store the index once
# Top 5
sort = df.sort_values(by=col, ascending=False)
if len(sort) > 5: sort=sort[:5]
for index in sort.index: drop[index] = True # Just store the index once
#
# INFO Drop rows by index. We do this all at once in case there is a
# collision. This way, we don't end up dropping more rows than we have
# to, if there is a single row that satisfies the drop for multiple columns.
# Since there are 6 rows, if we end up dropping < 5*6*2 = 60 rows, that means
# there indeed were collisions.
print ("Dropping {0} Outliers...".format(len(drop)))
df.drop(inplace=True, labels=drop.keys(), axis=0)
print (df.describe())
#
# INFO: What are you interested in?
#
# Depending on what you're interested in, you might take a different approach
# to normalizing/standardizing your data.
#
# You should note that all columns left in the dataset are of the same unit.
# You might ask yourself, do I even need to normalize / standardize the data?
# The answer depends on what you're trying to accomplish. For instance, although
# all the units are the same (generic money unit), the price per item in your
# store isn't. There may be some cheap items and some expensive one. If your goal
# is to find out what items people buy tend to buy together but you didn't
# unitize properly before running kMeans, the contribution of the lesser priced
# item would be dwarfed by the more expensive item.
#
# For a great overview on a few of the normalization methods supported in SKLearn,
# please check out: https://stackoverflow.com/questions/30918781/right-function-for-normalizing-input-of-sklearn-svm
#
# Suffice to say, at the end of the day, you're going to have to know what question
# you want answered and what data you have available in order to select the best
# method for your purpose. Luckily, SKLearn's interfaces are easy to switch out
# so in the mean time, you can experiment with all of them and see how they alter
# your results.
#
#
# 5-sec summary before you dive deeper online:
#
# NORMALIZATION: Let's say your user spend a LOT. Normalization divides each item by
# the average overall amount of spending. Stated differently, your
# new feature is = the contribution of overall spending going into
# that particular item: $spent on feature / $overall spent by sample
#
# MINMAX: What % in the overall range of $spent by all users on THIS particular
# feature is the current sample's feature at? When you're dealing with
# all the same units, this will produce a near face-value amount. Be
# careful though: if you have even a single outlier, it can cause all
# your data to get squashed up in lower percentages.
# Imagine your buyers usually spend $100 on wholesale milk, but today
# only spent $20. This is the relationship you're trying to capture
# with MinMax. NOTE: MinMax doesn't standardize (std. dev.); it only
# normalizes / unitizes your feature, in the mathematical sense.
# MinMax can be used as an alternative to zero mean, unit variance scaling.
# [(sampleFeatureValue-min) / (max-min)] * (max-min) + min
# Where min and max are for the overall feature values for all samples.
#
# : Un-comment just ***ONE*** of lines at a time and see how alters your results
# Pay attention to the direction of the arrows, as well as their LENGTHS
#T = preprocessing.StandardScaler().fit_transform(df)
#T = preprocessing.MinMaxScaler().fit_transform(df)
#T = preprocessing.MaxAbsScaler().fit_transform(df)
#T = preprocessing.Normalizer().fit_transform(df)
T = df # No Change
#
# INFO: Sometimes people perform PCA before doing KMeans, so that KMeans only
# operates on the most meaningful features. In our case, there are so few features
# that doing PCA ahead of time isn't really necessary, and you can do KMeans in
# feature space. But keep in mind you have the option to transform your data to
# bring down its dimensionality. If you take that route, then your Clusters will
# already be in PCA-transformed feature space, and you won't have to project them
# again for visualization.
# Do KMeans
n_clusters = 3
centroids, labels = doKMeans(T, n_clusters)
#
# : Print out your centroids. They're currently in feature-space, which
# is good. Print them out before you transform them into PCA space for viewing
#
print ("Centroids: ")
for cen in range(0,n_clusters):
print(" {0}: ".format(cen))
print (centroids[cen])
# Do PCA *after* to visualize the results. Project the centroids as well as
# the samples into the new 2D feature space for visualization purposes.
display_pca = doPCA(T)
T = display_pca.transform(T)
CC = display_pca.transform(centroids)
# Visualize all the samples. Give them the color of their cluster label
fig = plt.figure()
ax = fig.add_subplot(111)
if PLOT_TYPE_TEXT:
# Plot the index of the sample, so you can further investigate it in your dset
for i in range(len(T)): ax.text(T[i,0], T[i,1], df.index[i], color=c[labels[i]], alpha=0.75, zorder=600000)
ax.set_xlim(min(T[:,0])*1.2, max(T[:,0])*1.2)
ax.set_ylim(min(T[:,1])*1.2, max(T[:,1])*1.2)
else:
# Plot a regular scatter plot
sample_colors = [ c[labels[i]] for i in range(len(T)) ]
ax.scatter(T[:, 0], T[:, 1], c=sample_colors, marker='o', alpha=0.2)
# Plot the Centroids as X's, and label them
ax.scatter(CC[:, 0], CC[:, 1], marker='x', s=169, linewidths=3, zorder=1000, c=c)
for i in range(len(centroids)): ax.text(CC[i, 0], CC[i, 1], str(i), zorder=500010, fontsize=18, color=c[i])
# Display feature vectors for investigation:
if PLOT_VECTORS:
drawVectors(T, display_pca.components_, df.columns, plt)
# Add the cluster label back into the dataframe and display it:
df['label'] = pd.Series(labels, index=df.index)
print (df)
plt.show()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment