This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
vocabulary=len(tokenizer.word_index)+1 | |
print('Vocabulary Size=>',vocabulary) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from keras.preprocessing.sequence import pad_sequences | |
# Padding with zero | |
train_seq=pad_sequences(train_seq,maxlen=100,padding='post') | |
test_seq=pad_sequences(test_seq,maxlen=100,padding='post') |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import matplotlib.pyplot as plt | |
comment_word_count = [] | |
# Populate the lists with length of comments | |
for i in df_train['cleaned']: | |
comment_word_count.append(len(i.split())) | |
# Create a dataframe with length of comments | |
length_df = pd.DataFrame({'Comment Length':comment_word_count}) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Converting word sequence to integer sequence | |
train_seq = tokenizer.texts_to_sequences(df_train['cleaned']) | |
test_seq = tokenizer.texts_to_sequences(df_test['cleaned']) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
print('Vocabulary Size=>',len(tokenizer.word_index)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from keras.preprocessing.text import Tokenizer | |
# Instantiating Tokenizer | |
tokenizer = Tokenizer() | |
# Creating index for words | |
tokenizer.fit_on_texts(df_train['cleaned']) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# For working with regular expressions | |
import re | |
# Function for cleaning text | |
def cleaner(text): | |
# Lowercasing text | |
text=text.lower() | |
# Keeping only words | |
text=re.sub("[^a-z]+"," ",text) | |
# Removing extra spaces |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Printing sample comments | |
for i,v in enumerate(df_train['comment_text'].sample(5).values): | |
print('Comment ',i+1,'=>',repr(v)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Printing class distribution in percentage | |
for i in ['toxic','severe_toxic','obscene','threat','insult','identity_hate']: | |
print(df_train[i].value_counts(normalize=True)*100) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Loading Test set | |
df_test=pd.read_csv('./test.csv') | |
print('Shape=>',df_test.shape) | |
df_test.head() |