BrambleXu/blog.md Secret

## blog.md

      
    Raw
  

              blog.md
            
          
You can find the notebook here

The goal of this introduction is to let you know how to use Keras to preprocess text in character level. The rest of the article is organized as follows.

Load data
Preprocess

Tokenizer
Change vocabulary
Character to index
Padding
Get Labels


Load data

First, we use pandas to load the training data.

Combining column 1 and column 2 as one text.

##Preprocess
Tokenizer

Saving the column 1 to texts and convert all sentence to lower case.

When initializing the Tokenizer, there are only two parameters important.

char_level=True: this can tell tk.texts_to_sequences() to process sentence in char level.
oov_token='UNK': this will add a UNK token in the vocabulary. We can call it by tk.oov_token.

After call tk.fit_on_texts(texts), tk class will contain the neccery information about the training data. We can call tk.word_index to see the dictionary. Here the index of UNK is word_count+1.

This is the character dictionary learned from the training data. But if we already have a character list, we have to change the tk_word_index.
Change vocabulary

See I already have a character list call alphabet, we build a char_dict based on alphabet.

We will assign a new index to UNK.

Character to index

After we get the right vocabulary, we can represent all texts by using character index.
This step is very simple, tk.texts_to_sequences() will do this conversion automatically for us.

We can see the string representation is replaced by index representation. We list top 5 sentence length.

Padding

Because text have different length, we have to make all text as the same length, so the CNN can handle the batch data.

Here we set the max sentence length as 1014. If the length of a text is smaller than 1014, the rest of the part will be padded as 0. If the length of a text is bigger than 1014, the part longer than 1014 will be truncated. So all texts will maintain the same length.
Finally, we convert the list to Numpy array.

Get Labels

First, we assign column 0 in train_df to a class_list, this 1-dimensional list contains all labels for each text. But our task is a multiclass task, so we have to convert it to a 2-dimensional array. Here we could use the to_categorical method in Keras

As for the test dataset, we just need to do the same process again.