This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from gensim.models import TfidfModel | |
tfidf = TfidfModel(bow_corpus) #, smartirs=’npu’) | |
tfidf_corpus = tfidf[bow_corpus] | |
print(tfidf_corpus[:3]) | |
id_words = [[(gensim_dictionary[id], count) for id, count in line] for line in bow_corpus] | |
print(id_words) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from gensim.corpora import Dictionary | |
gensim_dictionary = Dictionary() | |
bow_corpus = [gensim_dictionary.doc2bow(doc, allow_update=True) for doc in text_tokenized] | |
print(bow_corpus[:3]) | |
id_words = [[(gensim_dictionary[id], count) for id, count in line] for line in bow_corpus] | |
print(id_words) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from gensim.parsing.preprocessing import preprocess_string | |
text_tokenized = [] | |
for doc in train['Description']: | |
k = preprocess_string(doc) | |
text_tokenized.append(k) | |
text_tokenized[0:3] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
import numpy as np | |
pd.set_option('display.max_colwidth', -1) | |
train = pd.read_csv("/content/gdrive/My Drive/data/gensim/ag_news_train.csv") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pyLDAvis.save_html(p, "/content/gdrive/My Drive/data/gensim/gensim_LDA_AGnews.html") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.metrics.pairwise import cosine_similarity | |
df = pd.DataFrame(cv_fit.toarray()) | |
print(cosine_similarity(df, df)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
cv_fit.toarray() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.feature_extraction.text import CountVectorizer | |
import pandas as pd | |
cv = CountVectorizer() | |
cv_fit = cv.fit_transform(doc_list) | |
word_list = cv.get_feature_names() | |
count_list = cv_fit.toarray().sum(axis=0) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
doc_list = [ | |
"Start spreading the news", | |
"You're leaving today (tell him friend)", | |
"I want to be a part of it, New York, New York", | |
"Your vagabond shoes, they are longing to stray", | |
"And steps around the heart of it, New York, New York" | |
] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
for i in [55,16,0]: | |
print("Topic", i,"is:", lda_bow.print_topic(i)) |