Skip to content

Instantly share code, notes, and snippets.

@joshua-taylor
Last active November 24, 2019 13:48
Show Gist options
  • Save joshua-taylor/8f4ab58e90c8abff2a6bcbecfd509861 to your computer and use it in GitHub Desktop.
Save joshua-taylor/8f4ab58e90c8abff2a6bcbecfd509861 to your computer and use it in GitHub Desktop.
Cluster labels
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')
tfidf = vectorizer.fit_transform(df.questionText.values)
totals = 0
for cluster in df.cluster.value_counts()[0:10].index:
stg = " ".join(df.loc[df.cluster==cluster].questionText.values)
response = vectorizer.transform([stg])
count = df.cluster.value_counts().loc[cluster]
totals += count
feature_array = np.array(vectorizer.get_feature_names())
tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1]
n = 10
print("Cluster Label: {}, Items in Cluster: {}".format(cluster,count))
print(feature_array[tfidf_sorting][:n])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment