Skip to content

Instantly share code, notes, and snippets.

View Yuktha-Majella's full-sized avatar

Yuktha-Majella

View GitHub Profile
@Yuktha-Majella
Yuktha-Majella / LDA_topicdist_article5_word
Last active October 21, 2021 21:34
Topic distribution for each word from fifth article's corpus
print("Total words in the corpus: ", len(data_corpus[4]))
for i in range(len(data_corpus[4])):
print("\t ",i, ": ", doc_lda[1][i], doc_lda[2][i])
@Yuktha-Majella
Yuktha-Majella / LDA_topicdist_article5
Last active October 21, 2021 21:20
Topic distribution of the fifth article in the dataset
#Topic distribution for fifth article
print("Category of Article 5: ", df["Category"][4])
print("Article 5: ", df["Text"][4])
doc_lda = lda_model[data_corpus][4]
print("\nTopic Distribution in the fifth article: ", doc_lda[0])
@Yuktha-Majella
Yuktha-Majella / LDA_topic-keyword_dist
Created October 21, 2021 20:46
The list of topics are produced along with the keywords and weights associated with each topic
pprint(lda_model.print_topics())
@Yuktha-Majella
Yuktha-Majella / LDA_ldamodel
Created October 21, 2021 20:43
Setting up the LDA model for topic modeling
#Construct LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=data_corpus,
id2word=data_dict,
num_topics=5,
chunksize=100,
alpha='auto',
per_word_topics=True)
#Better readable representation of corpus
data_corpus_word = [[(data_dict[id], freq) for id, freq in cp] for cp in data_corpus[:1]]
print("Corpus: \n", data_corpus_word[:1])
@Yuktha-Majella
Yuktha-Majella / LDA_dict_corpus
Created October 21, 2021 20:33
Dictionary and corpus is constructed from the processed data
#Create Dictionary
data_dict = corpora.Dictionary(data_tokens_lem)
print("Dictionary: ", data_dict)
#Create Corpus
data_corpus = [data_dict.doc2bow(text) for text in data_tokens_lem]
print("Corpus: \n", data_corpus[:1])
@Yuktha-Majella
Yuktha-Majella / LDA_bigrams_lemmatize
Created October 21, 2021 20:31
The stop words are eliminated from the tokenized data, bigrams are constructed and the data is lemmatized
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
return texts_out
#Eliminate Stop Words
stop_words = stopwords.words('english')
stop_words.extend(['say', 'may', 'also', 'get', 'go', 'know', 'need', 'like', 'make', 'see', 'want', 'say', 'come', 'take', 'use', 'would', 'tell', 'could', 'include', 'can', 'bbc', 'mr', 'mrs'])
@Yuktha-Majella
Yuktha-Majella / LDA_textcleaning_tokenize
Created October 21, 2021 20:28
The text data is cleaned and tokenized
#Function to clean the text and remove punctuations
def normalized_text(text, stem_words=True):
if pd.isnull(text):
return ''
if type(text) != str or text=='':
return ''
text = re.sub("\s+", " ", text)
text = re.sub("\'s", " ", text)
text = re.sub("\'ve", " have ", text)
@Yuktha-Majella
Yuktha-Majella / LDA_import_data
Created October 21, 2021 20:25
Importing the BBC News Dataset and converting it to panda's dataframe.
df = pd.read_csv('/content/BBC News Data.csv')
df = df.dropna().reset_index(drop=True)
print("Shape: ", df.shape)
print("Unique Categories: ", df.Category.unique())
print(df.head())
@Yuktha-Majella
Yuktha-Majella / LDA_import_packages
Created October 21, 2021 20:17
Importing necessary libraries and packages for constructing the LDA model
import pandas as pd
import re
import numpy as np
from string import punctuation
from pprint import pprint
import gensim
from gensim import corpora
from gensim.models import Phrases
from gensim.models.phrases import Phraser