Skip to content

Instantly share code, notes, and snippets.

@shantanuo
Created December 11, 2019 07:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shantanuo/ec18f411d96d0006e504acb886423290 to your computer and use it in GitHub Desktop.
Save shantanuo/ec18f411d96d0006e504acb886423290 to your computer and use it in GitHub Desktop.
Find duplicate strings
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
df = pd.read_excel('final_dupes_all.xlsx', sheet_name = 'all_records')
df.columns = [' xyz', ... ' flg_univ ', ]
df['mylen'] = df.college_name.str.len()
df['mylen'] = df['mylen'].fillna('0').astype(int)
df=df[df['mylen'] != 0]
messages=df.iloc[:,1].astype(str).values
nelbow=int(len(df)/2)
messages1 = messages[messages != np.array(None)]
tf = TfidfVectorizer()
tfidf_matrix = tf.fit_transform(messages1)
ndf = pd.SparseDataFrame(tfidf_matrix)
ndf.columns = tf.get_feature_names()
X = ndf.fillna("0")
kmeans = KMeans(n_clusters=nelbow)
pred = kmeans.fit_predict(X)
ml_df = pd.DataFrame()
ml_df["messages"] = messages
ml_df["template_types"] = pred
ndf=ml_df.groupby("template_types")["messages"].apply(list)
final=ndf.apply(pd.Series)
final.to_csv('report21.csv')
final[~final[1].isnull()]
final[~final[1].isnull()].to_csv('possible_dupes.csv')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment