You can find the notebook here
The goal of this introduction is to let you know how to use Keras to preprocess text in character level. The rest of the article is organized as follows.
- Load data
- Preprocess
- Tokenizer
- Change vocabulary
- Character to index
- Padding
- Get Labels
First, we use pandas to load the training data.
Combining column 1 and column 2 as one text.
##Preprocess
Saving the column 1 to texts
and convert all sentence to lower case.
When initializing the Tokenizer, there are only two parameters important.
char_level=True
: this can telltk.texts_to_sequences()
to process sentence in char level.oov_token='UNK'
: this will add a UNK token in the vocabulary. We can call it bytk.oov_token
.
After call tk.fit_on_texts(texts)
, tk
class will contain the neccery information about the training data. We can call tk.word_index
to see the dictionary. Here the index of UNK
is word_count+1
.
This is the character dictionary learned from the training data. But if we already have a character list, we have to change the tk_word_index
.
See I already have a character list call alphabet
, we build a char_dict
based on alphabet
.
We will assign a new index to UNK
.
After we get the right vocabulary, we can represent all texts by using character index.
This step is very simple, tk.texts_to_sequences()
will do this conversion automatically for us.
We can see the string representation is replaced by index representation. We list top 5 sentence length.
Because text have different length, we have to make all text as the same length, so the CNN can handle the batch data.
Here we set the max sentence length as 1014. If the length of a text is smaller than 1014, the rest of the part will be padded as 0. If the length of a text is bigger than 1014, the part longer than 1014 will be truncated. So all texts will maintain the same length.
Finally, we convert the list to Numpy array.
First, we assign column 0 in train_df
to a class_list
, this 1-dimensional list contains all labels for each text. But our task is a multiclass task, so we have to convert it to a 2-dimensional array. Here we could use the to_categorical
method in Keras
As for the test dataset, we just need to do the same process again.