Skip to content

Instantly share code, notes, and snippets.

@BrambleXu
Created July 5, 2018 07:55
Show Gist options
  • Save BrambleXu/eccfc94fde1e59b9402a830c3de9a4dd to your computer and use it in GitHub Desktop.
Save BrambleXu/eccfc94fde1e59b9402a830c3de9a4dd to your computer and use it in GitHub Desktop.

You can find the notebook here

The goal of this introduction is to let you know how to use Keras to preprocess text in character level. The rest of the article is organized as follows.

  • Load data
  • Preprocess
    • Tokenizer
    • Change vocabulary
    • Character to index
    • Padding
    • Get Labels

Load data

First, we use pandas to load the training data. image.png

Combining column 1 and column 2 as one text.

##Preprocess

Tokenizer

Saving the column 1 to texts and convert all sentence to lower case.

When initializing the Tokenizer, there are only two parameters important.

  • char_level=True: this can tell tk.texts_to_sequences() to process sentence in char level.
  • oov_token='UNK': this will add a UNK token in the vocabulary. We can call it by tk.oov_token.

After call tk.fit_on_texts(texts), tk class will contain the neccery information about the training data. We can call tk.word_index to see the dictionary. Here the index of UNK is word_count+1.

This is the character dictionary learned from the training data. But if we already have a character list, we have to change the tk_word_index.

Change vocabulary

See I already have a character list call alphabet, we build a char_dict based on alphabet.

スクリーンショット 2018-07-05 16.27.41.png

We will assign a new index to UNK. スクリーンショット 2018-07-05 16.30.29.png

Character to index

After we get the right vocabulary, we can represent all texts by using character index.

This step is very simple, tk.texts_to_sequences() will do this conversion automatically for us.

We can see the string representation is replaced by index representation. We list top 5 sentence length.

Padding

Because text have different length, we have to make all text as the same length, so the CNN can handle the batch data.

Here we set the max sentence length as 1014. If the length of a text is smaller than 1014, the rest of the part will be padded as 0. If the length of a text is bigger than 1014, the part longer than 1014 will be truncated. So all texts will maintain the same length.

Finally, we convert the list to Numpy array.

Get Labels

First, we assign column 0 in train_df to a class_list, this 1-dimensional list contains all labels for each text. But our task is a multiclass task, so we have to convert it to a 2-dimensional array. Here we could use the to_categorical method in Keras

As for the test dataset, we just need to do the same process again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment