Last active
March 30, 2022 09:48
-
-
Save johnleung8888/2338ffb00d4be70d25a7ae9df5bdfb57 to your computer and use it in GitHub Desktop.
C3_W2_Lab_1_imdb.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/johnleung8888/2338ffb00d4be70d25a7ae9df5bdfb57/c3_w2_lab_1_imdb.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "_BaMCnwCKSxE" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/github/https-deeplearning-ai/tensorflow-1-public/blob/master/C3/W2/ungraded_labs/C3_W2_Lab_1_imdb.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "dGATMXZE4oAD" | |
}, | |
"source": [ | |
"# Ungraded Lab: Training a binary classifier with the IMDB Reviews Dataset\n", | |
"\n", | |
"In this lab, you will be building a sentiment classification model to distinguish between positive and negative movie reviews. You will train it on the [IMDB Reviews](http://ai.stanford.edu/~amaas/data/sentiment/) dataset and visualize the word embeddings generated after training. \n", | |
"\n", | |
"Let's get started!\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "sWt8QlOGpy1j" | |
}, | |
"source": [ | |
"## Download the Dataset\n", | |
"\n", | |
"First, you will need to fetch the dataset you will be working on. This is hosted via [Tensorflow Datasets](https://www.tensorflow.org/datasets), a collection of prepared datasets for machine learning. If you're running this notebook on your local machine, make sure to have the [`tensorflow-datasets`](https://pypi.org/project/tensorflow-datasets/) package installed before importing it. You can install it via `pip` as shown in the commented cell below." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "vqB3GzBorwBh" | |
}, | |
"outputs": [], | |
"source": [ | |
"# Install this package if running on your local machine\n", | |
"# !pip install -q tensorflow-datasets" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "SpLsMxO2wDrn" | |
}, | |
"source": [ | |
"The [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load) method downloads the dataset into your working directory. You can set the `with_info` parameter to `True` if you want to see the description of the dataset. The `as_supervised` parameter, on the other hand, is set to load the data as `(input, label)` pairs." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "_IoM4VFxWpMR" | |
}, | |
"outputs": [], | |
"source": [ | |
"import tensorflow_datasets as tfds\n", | |
"\n", | |
"# Load the IMDB Reviews dataset\n", | |
"imdb, info = tfds.load(\"imdb_reviews\", with_info=True, as_supervised=True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "J3PEarpKw9_j" | |
}, | |
"outputs": [], | |
"source": [ | |
"# Print information about the dataset\n", | |
"print(info)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "kLRAoHil5poj" | |
}, | |
"source": [ | |
"As you can see in the output above, there is a total of 100,000 examples in the dataset and it is split into `train`, `test` and `unsupervised` sets. For this lab, you will only use `train` and `test` sets because you will need labeled examples to train your model." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "5EzNDkdkpvrv" | |
}, | |
"source": [ | |
"## Split the dataset\n", | |
"\n", | |
"If you try printing the `imdb` dataset that you downloaded earlier, you will see that it contains the dictionary that points to [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) objects. You will explore more of this class and its API in Course 4 of this specialization. For now, you can just think of it as a collection of examples." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "tA5397cs-EwN" | |
}, | |
"outputs": [], | |
"source": [ | |
"# Print the contents of the dataset you downloaded\n", | |
"print(imdb)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "L4oiQ0waBduJ" | |
}, | |
"source": [ | |
"You can preview the raw format of a few examples by using the [`take()`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#take) method and iterating over it as shown below:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "2NgUwTDu7Q1O" | |
}, | |
"outputs": [], | |
"source": [ | |
"# Take 2 training examples and print its contents\n", | |
"for example in imdb['train'].take(2):\n", | |
" print(example)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "hOtXX2gxB8pe" | |
}, | |
"source": [ | |
"You can see that each example is a 2-element tuple of tensors containing the text first, then the label (shown in the `numpy()` property). The next cell below will take all the `train` and `test` sentences and labels into separate lists so you can preprocess the text and feed it to the model later." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "wHQ2Ko0zl7M4" | |
}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"\n", | |
"# Get the train and test sets\n", | |
"train_data, test_data = imdb['train'], imdb['test']\n", | |
"\n", | |
"# Initialize sentences and labels lists\n", | |
"training_sentences = []\n", | |
"training_labels = []\n", | |
"\n", | |
"testing_sentences = []\n", | |
"testing_labels = []\n", | |
"\n", | |
"# Loop over all training examples and save the sentences and labels\n", | |
"for s,l in train_data:\n", | |
" training_sentences.append(s.numpy().decode('utf8'))\n", | |
" training_labels.append(l.numpy())\n", | |
"\n", | |
"# Loop over all test examples and save the sentences and labels\n", | |
"for s,l in test_data:\n", | |
" testing_sentences.append(s.numpy().decode('utf8'))\n", | |
" testing_labels.append(l.numpy())\n", | |
"\n", | |
"# Convert labels lists to numpy array\n", | |
"training_labels_final = np.array(training_labels)\n", | |
"testing_labels_final = np.array(testing_labels)\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "ePTIgXj3q8Sg" | |
}, | |
"source": [ | |
"## Generate Padded Sequences\n", | |
"\n", | |
"Now you can do the text preprocessing steps you've learned last week. You will tokenize the sentences and pad them to a uniform length. We've separated the parameters into its own code cell below so it will be easy for you to tweak it later if you want." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "lggoZqYUGYgX" | |
}, | |
"outputs": [], | |
"source": [ | |
"# Parameters\n", | |
"\n", | |
"vocab_size = 10000\n", | |
"max_length = 120\n", | |
"embedding_dim = 16\n", | |
"trunc_type='post'\n", | |
"oov_tok = \"<OOV>\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "7n15yyMdmoH1" | |
}, | |
"outputs": [], | |
"source": [ | |
"from tensorflow.keras.preprocessing.text import Tokenizer\n", | |
"from tensorflow.keras.preprocessing.sequence import pad_sequences\n", | |
"\n", | |
"# Initialize the Tokenizer class\n", | |
"tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)\n", | |
"\n", | |
"# Generate the word index dictionary for the training sentences\n", | |
"tokenizer.fit_on_texts(training_sentences)\n", | |
"word_index = tokenizer.word_index\n", | |
"\n", | |
"# Generate and pad the training sequences\n", | |
"sequences = tokenizer.texts_to_sequences(training_sentences)\n", | |
"padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type)\n", | |
"\n", | |
"# Generate and pad the test sequences\n", | |
"testing_sequences = tokenizer.texts_to_sequences(testing_sentences)\n", | |
"testing_padded = pad_sequences(testing_sequences,maxlen=max_length)\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "N2rCmp7ArGL_" | |
}, | |
"source": [ | |
"## Build and Compile the Model\n", | |
"\n", | |
"With the data already preprocessed, you can proceed to building your sentiment classification model. The input will be an [`Embedding`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer. The main idea here is to represent each word in your vocabulary with vectors. These vectors have trainable weights so as your neural network learns, words that are most likely to appear in a positive tweet will converge towards similar weights. Similarly, words in negative tweets will be clustered more closely together. You can read more about word embeddings [here](https://www.tensorflow.org/text/guide/word_embeddings).\n", | |
"\n", | |
"After the `Embedding` layer, you will flatten its output and feed it into a `Dense` layer. You will explore other architectures for these hidden layers in the next labs.\n", | |
"\n", | |
"The output layer would be a single neuron with a sigmoid activation to distinguish between the 2 classes. As is typical with binary classifiers, you will use the `binary_crossentropy` as your loss function while training." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "5NEpdhb8AxID" | |
}, | |
"outputs": [], | |
"source": [ | |
"import tensorflow as tf\n", | |
"\n", | |
"# Build the model\n", | |
"model = tf.keras.Sequential([\n", | |
" tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),\n", | |
" tf.keras.layers.Flatten(),\n", | |
" tf.keras.layers.Dense(6, activation='relu'),\n", | |
" tf.keras.layers.Dense(1, activation='sigmoid')\n", | |
"])\n", | |
"\n", | |
"# Setup the training parameters\n", | |
"model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\n", | |
"\n", | |
"# Print the model summary\n", | |
"model.summary()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "e8gbnoRdqp8O" | |
}, | |
"source": [ | |
"## Train the Model\n", | |
"\n", | |
"Next, of course, is to train your model. With the current settings, you will get near perfect training accuracy after just 5 epochs but the validation accuracy will plateau at around 83%. See if you can still improve this by adjusting some of the parameters earlier (e.g. the `vocab_size`, number of `Dense` neurons, number of epochs, etc.). " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "V5LLrXC-uNX6" | |
}, | |
"outputs": [], | |
"source": [ | |
"num_epochs = 10\n", | |
"\n", | |
"# Train the model\n", | |
"model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "mroDvjEJqwm4" | |
}, | |
"source": [ | |
"## Visualize Word Embeddings\n", | |
"\n", | |
"After training, you can visualize the trained weights in the `Embedding` layer to see words that are clustered together. The [Tensorflow Embedding Projector](https://projector.tensorflow.org/) is able to reduce the 16-dimension vectors you defined earlier into fewer components so it can be plotted in the projector. First, you will need to get these weights and you can do that with the cell below:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "yAmjJqEyCOF_" | |
}, | |
"outputs": [], | |
"source": [ | |
"# Get the embedding layer from the model (i.e. first layer)\n", | |
"embedding_layer = model.layers[0]\n", | |
"\n", | |
"# Get the weights of the embedding layer\n", | |
"embedding_weights = embedding_layer.get_weights()[0]\n", | |
"\n", | |
"# Print the shape. Expected is (vocab_size, embedding_dim)\n", | |
"print(embedding_weights.shape) " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "DEuG9AqIuF6i" | |
}, | |
"source": [ | |
"You will need to generate two files:\n", | |
"\n", | |
"* `vecs.tsv` - contains the vector weights of each word in the vocabulary\n", | |
"* `meta.tsv` - contains the words in the vocabulary" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "1u4Ty097uRYP" | |
}, | |
"source": [ | |
"For this, it is useful to have `reverse_word_index` dictionary so you can quickly lookup a word based on a given index. For example, `reverse_word_index[1]` will return your OOV token because it is always at index = 1. Fortunately, the `Tokenizer` class already provides this dictionary through its `index_word` property. Yes, as the name implies, it is the reverse of the `word_index` property which you used earlier!" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "pPhhHqvxvS8f" | |
}, | |
"outputs": [], | |
"source": [ | |
"# Get the index-word dictionary\n", | |
"reverse_word_index = tokenizer.index_word" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "ykM0Q9ThvszB" | |
}, | |
"source": [ | |
"Now you can start the loop to generate the files. You will loop `vocab_size-1` times, skipping the `0` key because it is just for the padding." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "jmB0Uxk0ycP6" | |
}, | |
"outputs": [], | |
"source": [ | |
"import io\n", | |
"\n", | |
"# Open writeable files\n", | |
"out_v = io.open('vecs.tsv', 'w', encoding='utf-8')\n", | |
"out_m = io.open('meta.tsv', 'w', encoding='utf-8')\n", | |
"\n", | |
"# Initialize the loop. Start counting at `1` because `0` is just for the padding\n", | |
"for word_num in range(1, vocab_size):\n", | |
"\n", | |
" # Get the word associated at the current index\n", | |
" word_name = reverse_word_index[word_num]\n", | |
"\n", | |
" # Get the embedding weights associated with the current index\n", | |
" word_embedding = embedding_weights[word_num]\n", | |
"\n", | |
" # Write the word name\n", | |
" out_m.write(word_name + \"\\n\")\n", | |
"\n", | |
" # Write the word embedding\n", | |
" out_v.write('\\t'.join([str(x) for x in word_embedding]) + \"\\n\")\n", | |
"\n", | |
"# Close the files\n", | |
"out_v.close()\n", | |
"out_m.close()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "3t92Osu3u8Qh" | |
}, | |
"source": [ | |
"When running this on Colab, you can run the code below to download the files. Otherwise, you can see the files in your current working directory and download it manually.\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "VDeqpOCVydtq" | |
}, | |
"outputs": [], | |
"source": [ | |
"# Import files utilities in Colab\n", | |
"try:\n", | |
" from google.colab import files\n", | |
"except ImportError:\n", | |
" pass\n", | |
"\n", | |
"# Download the files\n", | |
"else:\n", | |
" files.download('vecs.tsv')\n", | |
" files.download('meta.tsv')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "TRV8ag3nyAOb" | |
}, | |
"source": [ | |
"Now you can go to the [Tensorflow Embedding Projector](https://projector.tensorflow.org/) and load the two files you downloaded to see the visualization. You can search for words like `worst` and `fantastic` and see the other words closely located to these." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "4GOiu0WHzMzk" | |
}, | |
"source": [ | |
"## Wrap Up\n", | |
"\n", | |
"In this lab, you were able build a simple sentiment classification model and train it on preprocessed text data. In the next lessons, you will revisit the Sarcasm Dataset you used in Week 1 and build a model to train on it." | |
] | |
} | |
], | |
"metadata": { | |
"colab": { | |
"collapsed_sections": [], | |
"name": "C3_W2_Lab_1_imdb.ipynb", | |
"private_outputs": true, | |
"provenance": [], | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.4" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment