Created
August 7, 2021 16:03
-
-
Save Yuktha-Majella/54d0087aa57bfae81a1531a607edbc5c to your computer and use it in GitHub Desktop.
Create dictionary from text file in Gensim
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from gensim.utils import simple_preprocess | |
from gensim import corpora | |
text2 = open('sample_text.txt', encoding ='utf-8') | |
tokens2 =[] | |
for line in text2.read().split('.'): | |
tokens2.append(simple_preprocess(line, deacc = True)) | |
g_dict2 = corpora.Dictionary(tokens2) | |
print("The dictionary has: " +str(len(g_dict2)) + " tokens\n") | |
print(g_dict2.token2id) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner. By utilizing NLP and its components, one can organize the massive chunks of text data, perform numerous automated tasks and solve a wide range of problems such as – automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment