Skip to content

Instantly share code, notes, and snippets.

@Mashimo
Last active April 29, 2018 22:37
Show Gist options
  • Save Mashimo/3b412bd629d17a79a6dd5c44330508cd to your computer and use it in GitHub Desktop.
Save Mashimo/3b412bd629d17a79a6dd5c44330508cd to your computer and use it in GitHub Desktop.
Clustering supervised
Clustering groups samples that are similar within the same cluster.
Supervised: data samples have labels associated.
Use the K-nearest algorithm.

Clustering and classifying

Clustering groups samples that are similar within the same cluster. The more similar the samples belonging to a cluster group are (and conversely, the more dissimilar samples in separate groups), the better the clustering algorithm has performed. Since clustering is an unsupervised algorithm, this similarity metric must be measured automatically and based solely on your data.

The implementation details and definition of similarity are what differentiate the many clustering algorithms.

K-Nearest Neighbours

K-Neighbours is a supervised classification algorithm. If clustering is the process of separating your samples into groups, then classification would be the process of assigning samples into those groups. Given a set of groups, take a set of samples and mark each sample as being a member of a group. Each group being the correct answer, label, or classification of the sample.

The K-Nearest Neighbours - or K-Neighbours - classifier, is one of the simplest machine learning algorithms.

K-Nearest Neighbours works by first simply storing all of your training data samples.

Then in the future, when you attempt to check the classification of a new, never-before seen sample, it finds the nearest "K" number of samples to it from within your training data. You must have numeric features in order for 'nearest' to be meaningful. There are other methods you can use for categorical features. For example you can use bag of words to vectorize your data. SciKit-Learn's K-Nearest Neighbours only supports numeric features, so you'll have to do whatever has to be done to get your data into that format before proceeding. The distance will be measures as a standard Euclidean

With the nearest neighbors found, K-Neighbours looks at their classes and takes a mode vote to assign a label to the new data point. Further extensions of K-Neighbours can take into account the distance to the samples to weigh their voting power. Each new prediction or classification made, the algorithm has to again find the nearest neighbors to that sample in order to call a vote for it. This process is where a majority of the time is spent, so instead of using brute force to search the training data as if it were stored in a list, tree structures are used instead to optimize the search times. Due to this, the number of classes in dataset doesn't have a bearing on its execution speed. Only the number of records in your training data set.

Decision Boundaries

A unique feature of supervised classification algorithms are their decision boundaries, or more generally, their n-dimensional decision surface: a threshold or region where if superseded, will result in your sample being assigned that class.

The decision surface isn't always spherical. In fact, it can take many different types of shapes depending on the algorithm that generated it.

For K-Neighbours, generally the higher your "K" value, the smoother and less jittery your decision surface becomes. Higher K values also result in your model providing probabilistic information about the ratio of samples per each class. There is a tradeoff though, as higher K values mean the algorithm is less sensitive to local fluctuations since farther samples are taken into account. This causes it to only model the overall classification function without much attention to detail, and increases the computational complexity of the classification.

K-Neighbours is particularly useful when no other model fits your data well, as it is a parameter free approach to classification. So for example, you don't have to worry about things like your data being linearly separable or not.

Some of the caution-points to keep in mind while using K-Neighbours is that your data needs to be measurable. If there is no metric for discerning distance between your features, K-Neighbours cannot help you. As with all algorithms dependent on distance measures, it is also sensitive to feature scaling. K-Neighbours is also sensitive to perturbations and the local structure of your dataset, particularly at lower "K" values.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
matplotlib.style.use('ggplot') # Look Pretty
def plotDecisionBoundary(model, X, y):
fig = plt.figure()
ax = fig.add_subplot(111)
padding = 0.6
resolution = 0.0025
colors = ['royalblue','forestgreen','ghostwhite']
# Calculate the boundaris
x_min, x_max = X[:, 0].min(), X[:, 0].max()
y_min, y_max = X[:, 1].min(), X[:, 1].max()
x_range = x_max - x_min
y_range = y_max - y_min
x_min -= x_range * padding
y_min -= y_range * padding
x_max += x_range * padding
y_max += y_range * padding
# Create a 2D Grid Matrix. The values stored in the matrix
# are the predictions of the class at at said location
xx, yy = np.meshgrid(np.arange(x_min, x_max, resolution),
np.arange(y_min, y_max, resolution))
# What class does the classifier say?
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the contour map
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.terrain)
# Plot the test original points as well...
for label in range(len(np.unique(y))):
indices = np.where(y == label)
plt.scatter(X[indices, 0], X[indices, 1], c=colors[label], label=str(label), alpha=0.8)
p = model.get_params()
plt.axis('tight')
plt.title('K = ' + str(p['n_neighbors']))
#
# : Load up the dataset into a variable called X.
#
X = pd.read_csv("/Datasets/Wheat.data", index_col='id')
X.head()
#
# : Copy the 'wheat_type' series slice out of X, and into a series
# called 'y'. Then drop the original 'wheat_type' column from the X
#
y = X.wheat_type.copy()
X.drop(['wheat_type'], axis=1, inplace=True)
# : Do a quick, "ordinal" conversion of 'y'. In actuality our
# classification isn't ordinal, but just as an experiment...
#
y = y.astype("category").cat.codes
#
# : Basic nan munging. Fill each row's nans with the mean of the feature
#
X.fillna(X.mean(), inplace=True)
#
# : Split X into training and testing data sets
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=1)
#
# : Create an instance of SKLearn's Normalizer class and then train it
# using its .fit() method against the *training* data.
#
#
normaliser = preprocessing.Normalizer().fit(X_train)
#
# : With the trained pre-processor, transform both training AND
# testing data.
#
# NOTE: Any testing data has to be transformed with the preprocessor
# that has been fit against the training data, so that it exist in the same
# feature-space as the original data used to train the models.
#
X_train_normalised = normaliser.transform(X_train)
X_train = pd.DataFrame(X_train_normalised)
X_test_normalised = normaliser.transform(X_test)
X_test = pd.DataFrame(X_test_normalised)
#
# : Just like the preprocessing transformation, create a PCA
# transformation as well. Fit it against the training data, and then
# project the training and testing features into PCA space using the
# PCA model's .transform() method.
#
# NOTE: This has to be done because the only way to visualize the decision
# boundary in 2D would be if the KNN algo ran in 2D as well:
# Removing the PCA will improve the accuracy
# (KNeighbours is applied to the entire train data, not just the
# two principal components)
#
pca_reducer = PCA(n_components=2).fit(X_train_normalised)
X_train = pca_reducer.transform(X_train_normalised)
X_test = pca_reducer.transform(X_test_normalised)
#
# : Create and train a KNeighborsClassifier. Start with K=9 neighbors.
# NOTE: Be sure to train the classifier against the pre-processed, PCA-
# transformed training data above!
#
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train, y_train)
plotDecisionBoundary(knn, X_train, y_train)
#------------------------------------
#
# : Display the accuracy score of the test data/labels, computed by
# the KNeighbors model.
#
# NOTE: You do NOT have to run .predict before calling .score, since
# .score will take care of running the predictions for you automatically.
#
print(knn.score(X_test, y_test))
# Answer = 0.871428571429
plt.show()
"""
Breast cancer doesn't develop over night and, like any other cancer, can be treated extremely effectively if detected in its earlier stages. Part of the understanding cancer is knowing that not all irregular cell growths are malignant; some are benign, or non-dangerous, non-cancerous growths.
Being able to properly assess if a tumor is actually benign and ignorable, or malignant and alarming is therefore of importance, and also is a problem that might be solvable through data and machine learning.
Using the Breast Cancer Wisconsin Original data set, provided courtesy of UCI's Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)
"""
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn import manifold
from sklearn.neighbors import KNeighborsClassifier
# If you'd like to try with PCA instead of Isomap,
# as the dimensionality reduction technique:
Test_PCA = True
def plotDecisionBoundary(model, X, y):
print ("Plotting...")
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot') # Look Pretty
fig = plt.figure()
ax = fig.add_subplot(111)
padding = 0.1
resolution = 0.1
#(2 for benign, 4 for malignant)
colors = {2:'royalblue',4:'lightsalmon'}
# Calculate the boundaries
x_min, x_max = X[:, 0].min(), X[:, 0].max()
y_min, y_max = X[:, 1].min(), X[:, 1].max()
x_range = x_max - x_min
y_range = y_max - y_min
x_min -= x_range * padding
y_min -= y_range * padding
x_max += x_range * padding
y_max += y_range * padding
# Create a 2D Grid Matrix. The values stored in the matrix
# are the predictions of the class at at said location
import numpy as np
xx, yy = np.meshgrid(np.arange(x_min, x_max, resolution),
np.arange(y_min, y_max, resolution))
# What class does the classifier say?
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the contour map
plt.contourf(xx, yy, Z, cmap=plt.cm.seismic)
plt.axis('tight')
# Plot your testing points as well...
for label in np.unique(y):
indices = np.where(y == label)
plt.scatter(X[indices, 0], X[indices, 1], c=colors[label], alpha=0.8)
p = model.get_params()
plt.title('K = ' + str(p['n_neighbors']))
plt.show()
#
# : Load in the dataset, identify nans, and set proper headers.
#
X = pd.read_csv("Datasets/breast-cancer-wisconsin.data", header=None,
names=['sample', 'thickness', 'size', 'shape', 'adhesion',
'epithelial', 'nuclei', 'chromatin', 'nucleoli',
'mitoses', 'status'], index_col='sample', na_values='?')
#
# : Copy out the status column into a slice, then drop it from the main
# dataframe.
#
#
y = X.status.copy()
X.drop(['status'], axis=1, inplace=True)
#
# : With the labels safely extracted from the dataset, replace any nan values
# with the mean feature / column value
#
if X.isnull().values.any() == True:
print("Preprocessing data: substituted all NaN with mean value")
X.fillna(X.mean(), inplace=True)
else:
print("Preprocessing data: No NaN found!")
#
# : Do train_test_split. set the random_state=7 for reproduceability, and keep
# the test_size at 0.5 (50%).
#
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.5, random_state=7)
print("*** Starting K-neighbours classifier")
# automate the tuning of hyper-parameters using for-loops to traverse your
# search space.
scalers = [preprocessing.Normalizer, preprocessing.StandardScaler,
preprocessing.MinMaxScaler, preprocessing.RobustScaler]
reducers = [False, True]
weights = ['uniform', 'distance']
#
# : Experiment with the basic SKLearn preprocessing scalers. We know that
# the features consist of different units mixed in together, so it might be
# reasonable to assume feature scaling is necessary. Print out a description
# of the dataset, post transformation. Recall: when you do pre-processing,
# which portion of the dataset is your model trained upon? Also which portion(s)
# of your dataset actually get transformed?
#
for scaler in scalers:
print("* Scaler = ", scaler)
scalerTrained = scaler().fit(X_train)
X_train_scaled = scalerTrained.transform(X_train)
X_test_scaled = scalerTrained.transform(X_test)
print("PCA? | K | Weight | Score")
print("--------------------------------------")
#
reducer = None
for isPCA in reducers:
if isPCA:
#
# : Implement PCA here.
# You should reduce down to two dimensions.
#
reducer = PCA(n_components=2).fit(X_train_scaled)
else:
#
# : Implement Isomap here. K values from 5-10.
# You should reduce down to two dimensions.
#
reducer = manifold.Isomap(n_neighbors=10, n_components=2).fit(X_train_scaled)
#
# : Train your model against data_train, then transform both
# data_train and data_test using your model. You can save the results right
# back into the variables themselves.
#
X_train_reduced = reducer.transform(X_train_scaled)
X_test_reduced = reducer.transform(X_test_scaled)
#
# : Implement and train KNeighborsClassifier on your projected 2D
# training data here. You can use any K value from 1 - 15, so play around
# with it and see what results you can come up. Your goal is to find a
# good balance where you aren't too specific (low-K), nor are you too
# general (high-K). You should also experiment with how changing the weights
# parameter affects the results.
#
for k in range(1,16):
for weight in weights:
knmodel = KNeighborsClassifier(n_neighbors=k, weights = weight)
knmodel.fit(X_train_reduced, y_train)
#
# INFO: Be sure to always keep the domain of the problem in mind! It's
# WAY more important to errantly classify a benign tumor as malignant,
# and have it removed, than to incorrectly leave a malignant tumor, believing
# it to be benign, and then having the patient progress in cancer. Since the UDF
# weights don't give you any class information, the only way to introduce this
# data into SKLearn's KNN Classifier is by "baking" it into your data. For
# example, randomly reducing the ratio of benign samples compared to malignant
# samples from the training set.
#
# : Calculate + Print the accuracy of the testing set
#
print("{0} | {1} | {2} | {3}".format(isPCA, k, weight,
knmodel.score(X_test_reduced, y_test)))
plotDecisionBoundary(knmodel, X_test_reduced, y_test)
import math
import pandas as pd
import numpy as np
import scipy.io
import matplotlib.pyplot as plt
import matplotlib
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn import manifold
from sklearn.neighbors import KNeighborsClassifier
# set the dimensionality reduction technique: PCA or Isomap
Test_PCA = True
matplotlib.style.use('ggplot') # Look Pretty
def Plot2DBoundary(model, DTrain, LTrain, DTest, LTest):
# The dots are training samples (img not drawn), and the pics are testing samples (images drawn)
# Play around with the K values. This is very controlled dataset so it
# should be able to get perfect classification on testing entries
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_title('Transformed Boundary, Image Space -> 2D')
padding = 0.1 # Zoom out
resolution = 1 # Don't get too detailed; smaller values (finer rez) will take longer to compute
colors = ['blue','green','orange','red']
# ------
# Calculate the boundaries of the mesh grid. The mesh grid is
# a standard grid (think graph paper), where each point will be
# sent to the classifier (KNeighbors) to predict what class it
# belongs to. This is why KNeighbors has to be trained against
# 2D data, so we can produce this countour. Once we have the
# label for each point on the grid, we can color it appropriately
# and plot it.
x_min, x_max = DTrain[:, 0].min(), DTrain[:, 0].max()
y_min, y_max = DTrain[:, 1].min(), DTrain[:, 1].max()
x_range = x_max - x_min
y_range = y_max - y_min
x_min -= x_range * padding
y_min -= y_range * padding
x_max += x_range * padding
y_max += y_range * padding
# Using the boundaries, actually make the 2D Grid Matrix:
xx, yy = np.meshgrid(np.arange(x_min, x_max, resolution),
np.arange(y_min, y_max, resolution))
# What class does the classifier say about each spot on the chart?
# The values stored in the matrix are the predictions of the model
# at said location:
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the mesh grid as a filled contour plot:
plt.contourf(xx, yy, Z, cmap=plt.cm.terrain, z=-100)
# ------
# When plotting the testing images, used to validate if the algorithm
# is functioning correctly, size them as 5% of the overall chart size
x_size = x_range * 0.05
y_size = y_range * 0.05
# First, plot the images in your TEST dataset
img_num = 0
for index in LTest.index:
# DTest is a regular NDArray, so you'll iterate over that 1 at a time.
x0, y0 = DTest[img_num,0]-x_size/2., DTest[img_num,1]-y_size/2.
x1, y1 = DTest[img_num,0]+x_size/2., DTest[img_num,1]+y_size/2.
# DTest = our images isomap-transformed into 2D. But we still want
# to plot the original image, so we look to the original, untouched
# dataset (at index) to get the pixels:
img = df.iloc[index,:].reshape(num_pixels, num_pixels)
ax.imshow(img, aspect='auto', cmap=plt.cm.gray, interpolation='nearest', zorder=100000, extent=(x0, x1, y0, y1), alpha=0.8)
img_num += 1
# Plot your TRAINING points as well... as points rather than as images
for label in range(len(np.unique(LTrain))):
indices = np.where(LTrain == label)
ax.scatter(DTrain[indices, 0], DTrain[indices, 1], c=colors[label], alpha=0.8, marker='o')
# Plot
plt.show()
# Same datasets as in the PCA example!
# load up the face_data.mat, calculate the
# num_pixels value, and rotate the images to being right-side-up
# instead of sideways.
#
mat = scipy.io.loadmat('/Datasets/face_data.mat')
df = pd.DataFrame(mat['images']).T
num_images, num_pixels = df.shape
num_pixels = int(math.sqrt(num_pixels))
# Rotate the pictures, so we don't have to crane our necks:
for i in range(num_images):
df.loc[i,:] = df.loc[i,:].reshape(num_pixels, num_pixels).T.reshape(-1)
#
# : Load up your face_labels dataset. It only has a single column, and
# you're only interested in that single column. You have to slice the
# column out so that you have access to it as a "Series" rather than as a
# "Dataframe".
#
labelsDF = pd.read_csv("Datasets/face_labels.csv", header=None)
labels = labelsDF.iloc[:,0]
#
# : Do train_test_split. The labels are actually passed in as a series
# (instead of as an NDArray) to access their underlying indices
# later on. This is necessary to find the samples in the original
# dataframe, which is used to plot the testing data as images rather
# than as points:
#
data_train, data_test, label_train, label_test = train_test_split(df, labels,
test_size=0.15, random_state=7)
if Test_PCA:
# INFO: PCA is used *before* KNeighbors to simplify the high dimensionality
# image samples down to just 2 principal components! A lot of information
# (variance) is lost during the process, as I'm sure you can imagine. But
# you have to drop the dimension down to two, otherwise you wouldn't be able
# to visualize a 2D decision surface / boundary. In the wild, you'd probably
# leave in a lot more dimensions, but wouldn't need to plot the boundary;
# simply checking the results would suffice.
#
# The model should only be trained (fit) against the training data (data_train)
# Once you've done this, use the model to transform both data_train
# and data_test from their original high-D image feature space, down to 2D
#
#
# : Implement PCA. ONLY train against your training data, but
# transform both training + test data, storing the results back into
# data_train and data_test.
#
pca_reducer = PCA(n_components=2).fit(data_train)
data_train = pca_reducer.transform(data_train)
data_test = pca_reducer.transform(data_test)
else:
# INFO: Isomap is used *before* KNeighbors to simplify the high dimensionality
# image samples down to just 2 components! A lot of information has been is
# lost during the process, as I'm sure you can imagine. But if you have
# non-linear data that can be represented on a 2D manifold, you probably will
# be left with a far superior dataset to use for classification. Plus by
# having the images in 2D space, you can plot them as well as visualize a 2D
# decision surface / boundary. In the wild, you'd probably leave in a lot
# more dimensions, but wouldn't need to plot the boundary; simply checking
# the results would suffice.
#
# The model should only be trained (fit) against the training data (data_train)
# Once done this, use the model to transform both data_train
# and data_test from their original high-D image feature space, down to 2D
#
# : Implement Isomap. ONLY train against your training data, but
# transform both your training + test data, storing the results back into
# data_train, and data_test.
#
iso_reducer = manifold.Isomap(n_neighbors=5, n_components=2).fit(data_train)
data_train = iso_reducer.transform(data_train)
data_test = iso_reducer.transform(data_test)
#
# : Implement KNeighborsClassifier.
#
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(data_train, label_train)
#
# : Calculate + Print the accuracy of the testing set (data_test and
# label_test).
#
#
print(knn.score(data_test, label_test))
# isomap: 0.961904761905
# pca: 0.571428571429
# Chart the combined decision boundary, the training data as 2D plots, and
# the testing data as small images so we can visually validate performance.
Plot2DBoundary(knn, data_train, label_train, data_test, label_test)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment