-
-
Save PulkitS01/97c9920b1c913ba5e7e101d0e9030b0e to your computer and use it in GitHub Desktop.
K-Means implementation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
frame = pd.DataFrame(data_scaled) | |
frame['cluster'] = pred | |
frame['cluster'].value_counts() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
color=['blue','green','cyan'] | |
for k in range(K): | |
data=X[X["Cluster"]==k+1] | |
plt.scatter(data["ApplicantIncome"],data["LoanAmount"],c=color[k]) | |
plt.scatter(Centroids["ApplicantIncome"],Centroids["LoanAmount"],c='red') | |
plt.xlabel('Income') | |
plt.ylabel('Loan Amount (In Thousands)') | |
plt.show() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
data = pd.read_csv('clustering.csv') | |
data.head() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# statistics of the data | |
data.describe() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# fitting multiple k-means algorithms and storing the values in an empty list | |
SSE = [] | |
for cluster in range(1,20): | |
kmeans = KMeans(n_jobs = -1, n_clusters = cluster, init='k-means++') | |
kmeans.fit(data_scaled) | |
SSE.append(kmeans.inertia_) | |
# converting the results into a dataframe and plotting them | |
frame = pd.DataFrame({'Cluster':range(1,20), 'SSE':SSE}) | |
plt.figure(figsize=(12,6)) | |
plt.plot(frame['Cluster'], frame['SSE'], marker='o') | |
plt.xlabel('Number of clusters') | |
plt.ylabel('Inertia') |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# k means using 5 clusters and k-means++ initialization | |
kmeans = KMeans(n_jobs = -1, n_clusters = 5, init='k-means++') | |
kmeans.fit(data_scaled) | |
pred = kmeans.predict(data_scaled) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# inertia on the fitted data | |
kmeans.inertia_ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# defining the kmeans function with initialization as k-means++ | |
kmeans = KMeans(n_clusters=2, init='k-means++') | |
# fitting the k means algorithm on scaled data | |
kmeans.fit(data_scaled) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Step 3 - Assign all the points to the closest cluster centroid | |
# Step 4 - Recompute centroids of newly formed clusters | |
# Step 5 - Repeat step 3 and 4 | |
diff = 1 | |
j=0 | |
while(diff!=0): | |
XD=X | |
i=1 | |
for index1,row_c in Centroids.iterrows(): | |
ED=[] | |
for index2,row_d in XD.iterrows(): | |
d1=(row_c["ApplicantIncome"]-row_d["ApplicantIncome"])**2 | |
d2=(row_c["LoanAmount"]-row_d["LoanAmount"])**2 | |
d=np.sqrt(d1+d2) | |
ED.append(d) | |
X[i]=ED | |
i=i+1 | |
C=[] | |
for index,row in X.iterrows(): | |
min_dist=row[1] | |
pos=1 | |
for i in range(K): | |
if row[i+1] < min_dist: | |
min_dist = row[i+1] | |
pos=i+1 | |
C.append(pos) | |
X["Cluster"]=C | |
Centroids_new = X.groupby(["Cluster"]).mean()[["LoanAmount","ApplicantIncome"]] | |
if j == 0: | |
diff=1 | |
j=j+1 | |
else: | |
diff = (Centroids_new['LoanAmount'] - Centroids['LoanAmount']).sum() + (Centroids_new['ApplicantIncome'] - Centroids['ApplicantIncome']).sum() | |
print(diff.sum()) | |
Centroids = X.groupby(["Cluster"]).mean()[["LoanAmount","ApplicantIncome"]] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#import libraries | |
import pandas as pd | |
import numpy as np | |
import random as rd | |
import matplotlib.pyplot as plt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# importing required libraries | |
import pandas as pd | |
import numpy as np | |
import matplotlib.pyplot as plt | |
%matplotlib inline | |
from sklearn.cluster import KMeans |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Step 1 and 2 - Choose the number of clusters (k) and select random centroid for each cluster | |
#number of clusters | |
K=3 | |
# Select random observation as centroids | |
Centroids = (X.sample(n=K)) | |
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c='black') | |
plt.scatter(Centroids["ApplicantIncome"],Centroids["LoanAmount"],c='red') | |
plt.xlabel('AnnualIncome') | |
plt.ylabel('Loan Amount (In Thousands)') | |
plt.show() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# standardizing the data | |
from sklearn.preprocessing import StandardScaler | |
scaler = StandardScaler() | |
data_scaled = scaler.fit_transform(data) | |
# statistics of scaled data | |
pd.DataFrame(data_scaled).describe() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
X = data[["LoanAmount","ApplicantIncome"]] | |
#Visualise data points | |
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c='black') | |
plt.xlabel('AnnualIncome') | |
plt.ylabel('Loan Amount (In Thousands)') | |
plt.show() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# reading the data and looking at the first five rows of the data | |
data=pd.read_csv("Wholesale customers data.csv") | |
data.head() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment