Skip to content

Instantly share code, notes, and snippets.

@johnleung8888
Last active March 30, 2022 16:41
Show Gist options
  • Save johnleung8888/cd1d66123e68fef87a6bc63186f88f22 to your computer and use it in GitHub Desktop.
Save johnleung8888/cd1d66123e68fef87a6bc63186f88f22 to your computer and use it in GitHub Desktop.
C3_W2_Lab_3_imdb_subwords.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/johnleung8888/cd1d66123e68fef87a6bc63186f88f22/c3_w2_lab_3_imdb_subwords.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "X0gCExSWaAqE"
},
"source": [
"<a href=\"https://colab.research.google.com/github/https-deeplearning-ai/tensorflow-1-public/blob/master/C3/W2/ungraded_labs/C3_W2_Lab_3_imdb_subwords.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cLKIel77CJPi"
},
"source": [
"## Ungraded Lab: Subword Tokenization with the IMDB Reviews Dataset\n",
"\n",
"In this lab, you will look at a pre-tokenized dataset that is using subword text encoding. This is an alternative to word-based tokenization which you have been using in the previous labs. You will see how it works and its implications on preparing your data and training your model.\n",
"\n",
"Let's begin!\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qrzOn9quZ0Sv"
},
"source": [
"## Download the IMDB reviews plain text and tokenized datasets\n",
"\n",
"First, you will download the [IMDB Reviews](https://www.tensorflow.org/datasets/catalog/imdb_reviews) dataset from Tensorflow Datasets. You will get two configurations:\n",
"\n",
"* `plain_text` - this is the default and the one you used in Lab 1 of this week\n",
"* `subwords8k` - a pre-tokenized dataset (i.e. instead of sentences of type string, it will already give you the tokenized sequences). You will see how this looks in later sections."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_IoM4VFxWpMR"
},
"outputs": [],
"source": [
"import tensorflow_datasets as tfds\n",
"\n",
"# Download the plain text default config\n",
"imdb_plaintext, info_plaintext = tfds.load(\"imdb_reviews\", with_info=True, as_supervised=True)\n",
"\n",
"# Download the subword encoded pretokenized dataset\n",
"imdb_subwords, info_subwords = tfds.load(\"imdb_reviews/subwords8k\", with_info=True, as_supervised=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JggMZRCEcdlN"
},
"source": [
"## Compare the two datasets\n",
"\n",
"As mentioned, the data types returned by the two datasets will be different. For the default, it will be strings as you also saw in Lab 1. Notice the description of the `text` key below and the sample sentences:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "3J7IAJMGH-VN"
},
"outputs": [],
"source": [
"# Print description of features\n",
"info_plaintext.features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "jTO45ghxc4js"
},
"outputs": [],
"source": [
"# Take 2 training examples and print the text feature\n",
"for example in imdb_plaintext['train'].take(2):\n",
" print(example[0].numpy())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "f87JvGD9dId5"
},
"source": [
"For `subwords8k`, the dataset is already tokenized so the data type will be integers. Notice that the `text` features also include an `encoder` field and has a `vocab_size` of around 8k, hence the name."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "3wp_a7292mxk"
},
"outputs": [],
"source": [
"# Print description of features\n",
"info_subwords.features"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9ssDU_TddyLF"
},
"source": [
"If you print the results, you will not see string sentences but a sequence of tokens:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "35oQQIUG21cG"
},
"outputs": [],
"source": [
"# Take 2 training examples and print its contents\n",
"for example in imdb_subwords['train'].take(2):\n",
" print(example)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rWOrkYGug--B"
},
"source": [
"You can get the `encoder` object included in the download and use it to decode the sequences above. You'll see that you will arrive at the same sentences provided in the `plain_text` config:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "4kNEGgEgfO6x"
},
"outputs": [],
"source": [
"# Get the encoder\n",
"tokenizer_subwords = info_subwords.features['text'].encoder\n",
"\n",
"# Take 2 training examples and decode the text feature\n",
"for example in imdb_subwords['train'].take(2):\n",
" print(tokenizer_subwords.decode(example[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "20_XNWbXiwcE"
},
"source": [
"*Note: The documentation for the encoder can be found [here](https://www.tensorflow.org/datasets/api_docs/python/tfds/deprecated/text/SubwordTextEncoder) but don't worry if it's marked as deprecated. As mentioned, the objective of this exercise is just to show the characteristics of subword encoding.*"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YKrbY2fjjFHM"
},
"source": [
"## Subword Text Encoding\n",
"\n",
"From previous labs, the number of tokens in the sequence is the same as the number of words in the text (i.e. word tokenization). The following cells shows a review of this process."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "O6ly_yOIkM-K"
},
"outputs": [],
"source": [
"# Get the train set\n",
"train_data = imdb_plaintext['train']\n",
"\n",
"# Initialize sentences list\n",
"training_sentences = []\n",
"\n",
"# Loop over all training examples and save to the list\n",
"for s,_ in train_data:\n",
" training_sentences.append(s.numpy().decode('utf8'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "-N6Yd_TE3gZ5"
},
"outputs": [],
"source": [
"from tensorflow.keras.preprocessing.text import Tokenizer\n",
"from tensorflow.keras.preprocessing.sequence import pad_sequences\n",
"\n",
"vocab_size = 10000\n",
"oov_tok = '<OOV>'\n",
"\n",
"# Initialize the Tokenizer class\n",
"tokenizer_plaintext = Tokenizer(num_words = 10000, oov_token=oov_tok)\n",
"\n",
"# Generate the word index dictionary for the training sentences\n",
"tokenizer_plaintext.fit_on_texts(training_sentences)\n",
"\n",
"# Generate the training sequences\n",
"sequences = tokenizer_plaintext.texts_to_sequences(training_sentences)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nNUlDp76lf94"
},
"source": [
"The cell above uses a `vocab_size` of 10000 but you'll find that it's easy to find OOV tokens when decoding using the lookup dictionary it created. See the result below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "YmsECyVr4OPE"
},
"outputs": [],
"source": [
"# Decode the first sequence using the Tokenizer class\n",
"tokenizer_plaintext.sequences_to_texts(sequences[0:1])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "O0HQqkBmpujb"
},
"source": [
"For binary classifiers, this might not have a big impact but you may have other applications that will benefit from avoiding OOV tokens when training the model (e.g. text generation). If you want the tokenizer above to not have OOVs, then the `vocab_size` will increase to more than 88k. This can slow down training and bloat the model size. The encoder also won't be robust when used on other datasets which may contain new words, thus resulting in OOVs again. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "u7m-Ds9lpUQc"
},
"outputs": [],
"source": [
"# Total number of words in the word index dictionary\n",
"len(tokenizer_plaintext.word_index)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "McxNKhHIsNvl"
},
"source": [
"*Subword text encoding* gets around this problem by using parts of the word to compose whole words. This makes it more flexible when it encounters uncommon words. See how these subwords look like for this particular encoder:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "SqyMSZbnwFBo"
},
"outputs": [],
"source": [
"# Print the subwords\n",
"print(tokenizer_subwords.subwords)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kaRA9LBUwfHM"
},
"source": [
"If you use it on the previous plain text sentence, you'll see that it won't have any OOVs even if it has a smaller vocab size (only 8k compared to 10k above):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tn_eLaS5mR7H"
},
"outputs": [],
"source": [
"# Encode the first plaintext sentence using the subword text encoder\n",
"tokenized_string = tokenizer_subwords.encode(training_sentences[0])\n",
"print(tokenized_string)\n",
"\n",
"# Decode the sequence\n",
"original_string = tokenizer_subwords.decode(tokenized_string)\n",
"\n",
"# Print the result\n",
"print (original_string)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "iL9O3hEqw4Bl"
},
"source": [
"Subword encoding can even perform well on words that are not commonly found on movie reviews. See first the result when using the plain text tokenizer. As expected, it will show many OOVs:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "MHRj1J0j8ApE"
},
"outputs": [],
"source": [
"# Define sample sentence\n",
"sample_string = 'TensorFlow, from basics to mastery'\n",
"\n",
"# Encode using the plain text tokenizer\n",
"tokenized_string = tokenizer_plaintext.texts_to_sequences([sample_string])\n",
"print ('Tokenized string is {}'.format(tokenized_string))\n",
"\n",
"# Decode and print the result\n",
"original_string = tokenizer_plaintext.sequences_to_texts(tokenized_string)\n",
"print ('The original string: {}'.format(original_string))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZhQ-4O-uxdbJ"
},
"source": [
"Then compare to the subword text encoder:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "fPl2BXhYEHRP"
},
"outputs": [],
"source": [
"# Encode using the subword text encoder\n",
"tokenized_string = tokenizer_subwords.encode(sample_string)\n",
"print ('Tokenized string is {}'.format(tokenized_string))\n",
"\n",
"# Decode and print the results\n",
"original_string = tokenizer_subwords.decode(tokenized_string)\n",
"print ('The original string: {}'.format(original_string))\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "89sbfXjz0MSW"
},
"source": [
"As you may notice, the sentence is correctly decoded. The downside is the token sequence is much longer. Instead of only 5 when using word-encoding, you ended up with 11 tokens instead. The mapping for this sentence is shown below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_3t7vvNLEZml"
},
"outputs": [],
"source": [
"# Show token to subword mapping:\n",
"for ts in tokenized_string:\n",
" print ('{} ----> {}'.format(ts, tokenizer_subwords.decode([ts])))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aZ22ugch1TFy"
},
"source": [
"## Training the model\n",
"\n",
"You will now train your model using this pre-tokenized dataset. Since these are already saved as sequences, you can jump straight to making uniform sized arrays for the train and test sets. These are also saved as `tf.data.Dataset` type so you can use the [`padded_batch()`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#padded_batch) method to create batches and pad the arrays into a uniform size for training."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "LVSTLBe_SOUr"
},
"outputs": [],
"source": [
"BUFFER_SIZE = 10000\n",
"BATCH_SIZE = 64\n",
"\n",
"# Get the train and test splits\n",
"train_data, test_data = imdb_subwords['train'], imdb_subwords['test'], \n",
"\n",
"# Shuffle the training data\n",
"train_dataset = train_data.shuffle(BUFFER_SIZE)\n",
"\n",
"# Batch and pad the datasets to the maximum length of the sequences\n",
"train_dataset = train_dataset.padded_batch(BATCH_SIZE)\n",
"test_dataset = test_data.padded_batch(BATCH_SIZE)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HCjHCG7s2sAR"
},
"source": [
"Next, you will build the model. You can just use the architecture from the previous lab. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5NEpdhb8AxID"
},
"outputs": [],
"source": [
"import tensorflow as tf\n",
"\n",
"# Define dimensionality of the embedding\n",
"embedding_dim = 64\n",
"\n",
"# Build the model\n",
"model = tf.keras.Sequential([\n",
" tf.keras.layers.Embedding(tokenizer_subwords.vocab_size, embedding_dim),\n",
" tf.keras.layers.GlobalAveragePooling1D(),\n",
" tf.keras.layers.Dense(6, activation='relu'),\n",
" tf.keras.layers.Dense(1, activation='sigmoid')\n",
"])\n",
"\n",
"# Print the model summary\n",
"model.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2aOn2bAc3AUj"
},
"source": [
"Similarly, you can use the same parameters for training. In Colab, it will take around 20 seconds per epoch (without an accelerator) and you will reach around 94% training accuracy and 88% validation accuracy."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "fkt8c5dNuUlT"
},
"outputs": [],
"source": [
"num_epochs = 10\n",
"\n",
"# Set the training parameters\n",
"model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\n",
"\n",
"# Start training\n",
"history = model.fit(train_dataset, epochs=num_epochs, validation_data=test_dataset)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3ygYaD6H3qGX"
},
"source": [
"## Visualize the results\n",
"\n",
"You can use the cell below to plot the training results. See if you can improve it by tweaking the parameters such as the size of the embedding and number of epochs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "-_rMnm7WxQGT"
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"# Plot utility\n",
"def plot_graphs(history, string):\n",
" plt.plot(history.history[string])\n",
" plt.plot(history.history['val_'+string])\n",
" plt.xlabel(\"Epochs\")\n",
" plt.ylabel(string)\n",
" plt.legend([string, 'val_'+string])\n",
" plt.show()\n",
"\n",
"# Plot the accuracy and results \n",
"plot_graphs(history, \"accuracy\")\n",
"plot_graphs(history, \"loss\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "R0TRE-Lb4C5b"
},
"source": [
"## Wrap Up\n",
"\n",
"In this lab, you saw how subword text encoding can be a robust technique to avoid out-of-vocabulary tokens. It can decode uncommon words it hasn't seen before even with a relatively small vocab size. Consequently, it results in longer token sequences when compared to full word tokenization. Next week, you will look at other architectures that you can use when building your classifier. These will be recurrent neural networks and convolutional neural networks."
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "C3_W2_Lab_3_imdb_subwords.ipynb",
"provenance": [],
"include_colab_link": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment