This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Pre-processing for Content | |
| List_Content = DF['Content_nGrams'].to_list() | |
| Final_Article = [] | |
| Complete_Content = [] | |
| for article in List_Content: | |
| Processed_Content = text_preprocessing(article) #Cleaned text of Content attribute after pre-processing | |
| Final_Article.append(Processed_Content) | |
| Complete_Content.extend(Final_Article) | |
| DF['Updated_content'] = Complete_Content | |
| #print(Complete_Content) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Writing main function to merge all the preprocessing steps. | |
| def text_preprocessing(text, punctuations=True, token = True, | |
| stop_words=True, apostrophe=False, verbs=False): | |
| """ | |
| This function will preprocess input text and return | |
| the clean text. | |
| """ | |
| stoplist = stopwords.words('english') | |
| stoplist = set(stoplist) | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| def tokenize_text(Updated_content): | |
| """ | |
| This function will tokenize the word after removing stopwords & punctuations | |
| and return the list of list of articles. | |
| """ | |
| tokenized_text = [word for word in word_tokenize(Updated_content)] | |
| return tokenized_text |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| def removing_special_characters(text): | |
| """Removing all the special characters except the one that is passed within | |
| the regex to match, as they have imp meaning in the text provided. | |
| arguments: | |
| input_text: "text" of type "String". | |
| return: | |
| value: Text with removed special characters that don't require. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| def removing_stopwords(text): | |
| """This function will remove stopwords which doesn't add much meaning to a sentence | |
| & they can be remove safely without comprimising meaning of the sentence. | |
| arguments: | |
| input_text: "text" of type "String". | |
| return: | |
| value: Text after omitted all stopwords. | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| def add_ngrams_to_input(Processed_content,Mapping): | |
| """ | |
| This function will replace original occurrence of n_Grams in the text with that of Combined n_Grams. | |
| """ | |
| for i in range(len(Processed_content)): | |
| for key, value in Mapping.items(): | |
| Processed_content[i] = Processed_content[i].replace(key, value) | |
| return Processed_content | |
| content_nGrams = add_ngrams_to_input(Processed_Content,Mapping) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| def mapping(n_grams_to_use, Combined_nGrams): | |
| """ | |
| This function will map combined n_Grams with that of individual n_Grams & return the dictionary. | |
| """ | |
| dic=dict() | |
| for i in range(len(Combined_nGrams)): | |
| dic[n_grams_to_use[i]] = Combined_nGrams[i] | |
| return dic | |
| Mapping = mapping(n_grams_to_use, Combined_nGrams) | |
| Mapping |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Combine each n_Gram using '_' | |
| def combined_n_Grams(n_grams_to_use): | |
| """ | |
| This function will read n_Grams & return list of combined n_Grams using '_' | |
| """ | |
| Combined_nGrams = [] | |
| for i in range(len(n_grams_to_use)): | |
| Combined_nGrams.append(n_grams_to_use[i].replace(' ','_')) | |
| return Combined_nGrams | |
| Combined_nGrams = combined_n_Grams(n_grams_to_use) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| def read_nGrams(): | |
| """ | |
| This function will read bigrams & trigrams and | |
| return list of n_Grams. | |
| """ | |
| # read bigrams | |
| original_bigram = readFile("bigram.txt") | |
| # read trigrams | |
| original_trigram = readFile("trigram.txt") |