Skip to content

Instantly share code, notes, and snippets.

@patternproject
Created March 22, 2020 19:42
Show Gist options
  • Save patternproject/0ae55288e9d07d52d58ad93b8d381d43 to your computer and use it in GitHub Desktop.
Save patternproject/0ae55288e9d07d52d58ad93b8d381d43 to your computer and use it in GitHub Desktop.
Wk3_Submission.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Wk3_Submission.ipynb",
"provenance": [],
"collapsed_sections": [
"I_LJ_VmN20qo",
"y43HipKZ275r",
"sPY9Z38C3dEQ",
"q_Fv00Lf3gby",
"ZJbgtg-S3rYL",
"l_zz9r6Y322G",
"PECSzyLI37LZ",
"MbY0wZMBN36X",
"4VNdFGvGN9Jm"
],
"toc_visible": true,
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/patternproject/0ae55288e9d07d52d58ad93b8d381d43/wk3_submission.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QazQ6ZyQR03W",
"colab_type": "text"
},
"source": [
"Manning LP \n",
"\"Classifying Customer Feedback with Imbalanced Text Data\""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3o9qTS480De0",
"colab_type": "text"
},
"source": [
"Wk3 - Generate New Corpus"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YoaKf9LpPJ1k",
"colab_type": "text"
},
"source": [
"# 1.Environment Setup"
]
},
{
"cell_type": "code",
"metadata": {
"id": "iD-DW3dEpB4v",
"colab_type": "code",
"colab": {}
},
"source": [
"from __future__ import absolute_import, division, print_function, unicode_literals\n",
"import os\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"import pandas as pd\n",
"\n",
"\n",
"import collections\n",
"\n",
"\n",
"# for file upload/download\n",
"from google.colab import files\n",
"\n",
"\n",
"#import the TfidfVectorizer from Scikit-Learn.\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"\n",
"# for removing unicode like über' 'émigré' and 'æsthetic' from pos review samples\n",
"import unicodedata\n",
"import re\n",
"\n",
"# for pickling\n",
"import pickle"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "iXKC0oveMysp",
"colab_type": "code",
"outputId": "0afa0946-98ab-4c66-96ef-c27a23841cd0",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 70
}
},
"source": [
"!pip install -q tensorflow-gpu==2.0.0-rc0"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"\u001b[K |████████████████████████████████| 380.5MB 40kB/s \n",
"\u001b[K |████████████████████████████████| 4.3MB 59.2MB/s \n",
"\u001b[K |████████████████████████████████| 501kB 44.8MB/s \n",
"\u001b[?25h"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "53wFNGq8PPE7",
"colab_type": "code",
"outputId": "de707a3e-7e53-4492-f3ec-5cafd8e181a0",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 63
}
},
"source": [
"\n",
"import tensorflow as tf\n",
"from tensorflow.keras import layers"
],
"execution_count": 0,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/html": [
"<p style=\"color: red;\">\n",
"The default version of TensorFlow in Colab will soon switch to TensorFlow 2.x.<br>\n",
"We recommend you <a href=\"https://www.tensorflow.org/guide/migrate\" target=\"_blank\">upgrade</a> now \n",
"or ensure your notebook will continue to use TensorFlow 1.x via the <code>%tensorflow_version 1.x</code> magic:\n",
"<a href=\"https://colab.research.google.com/notebooks/tensorflow_version.ipynb\" target=\"_blank\">more info</a>.</p>\n"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {
"tags": []
}
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "uagJZ53OPbQL",
"colab_type": "code",
"colab": {}
},
"source": [
"keras = tf.keras"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "IaVtTJitQDdK",
"colab_type": "code",
"colab": {}
},
"source": [
"# enable eager execution\n",
"tf.enable_eager_execution()"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "mRqbc7hYQ4Tm",
"colab_type": "code",
"outputId": "0a45b798-b05b-4541-c5db-c7a49bad4db8",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"# Check if eager execution is enabled\n",
"tf.executing_eagerly()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"True"
]
},
"metadata": {
"tags": []
},
"execution_count": 7
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tjiY9q1kPWRD",
"colab_type": "text"
},
"source": [
"# 2.Loading Data"
]
},
{
"cell_type": "code",
"metadata": {
"id": "ODDTw8vrPSEs",
"colab_type": "code",
"outputId": "846b721a-ef07-4359-9c2b-943c8baedf18",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"imdb = keras.datasets.imdb\n",
"print(tf.__version__)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"1.15.0\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wwCXc334Ptva",
"colab_type": "text"
},
"source": [
"Load IMDB movie review dataset"
]
},
{
"cell_type": "code",
"metadata": {
"id": "drW6gA4zm1Z9",
"colab_type": "code",
"colab": {}
},
"source": [
"# parameter detaisl from:\n",
"# https://marcinbogdanski.github.io/ai-sketchpad/KerasNN/0150_MLP_IMDB.html "
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "rv1sg93tPhXy",
"colab_type": "code",
"outputId": "8b88329e-8329-4770-85c7-83f8a3d7c70a",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 52
}
},
"source": [
"(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(\n",
" path='imdb.npz', # download to '~/.keras/datasets/' + path\n",
" num_words=None, # top most frequent words to consider\n",
" skip_top=0, # top most frequent words to ignore ('the', 'a', 'at', ...)\n",
" maxlen=None, # truncate reviews longer than this\n",
" seed=113, # data shuffling seed\n",
" start_char=1, # start-of-sequence token\n",
" oov_char=2, # if skip_top used, then dropped words replaced with this token\n",
" index_from=3 # actual word tokens start here\n",
")"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz\n",
"17465344/17464789 [==============================] - 0s 0us/step\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BivwwTP3NGXT",
"colab_type": "text"
},
"source": [
"#### EDA"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dHJRsiUQlYFt",
"colab_type": "text"
},
"source": [
"Returns:\n",
"\n",
"2 tuples:\n",
"* x_train, x_test: list of sequences, which are lists of indexes (integers). If the num_words argument was specific, the maximum possible index value is num_words-1. If the maxlen argument was specified, the largest possible sequence length is maxlen.\n",
"* y_train, y_test: list of integer labels (1 or 0)."
]
},
{
"cell_type": "code",
"metadata": {
"id": "N91K-YYFnWAL",
"colab_type": "code",
"outputId": "fb6bbf48-b7cc-427a-cfcc-3c5682892edc",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"#Example data sample\n",
"\n",
"print('Label:', y_train[0])"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Label: 1\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "Fkm0zW38nbJJ",
"colab_type": "code",
"outputId": "eae21126-fd20-484a-960c-49ca354cf771",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 54
}
},
"source": [
"print('Review:', x_train[0])"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Review: [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "S85sBsDAU5Af",
"colab_type": "code",
"colab": {}
},
"source": [
"unique, counts = np.unique(y_train, return_counts=True)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "kDaKqel_VCb3",
"colab_type": "code",
"outputId": "68d1b0a4-9846-43a5-a1bd-a5afa0a1324d",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 52
}
},
"source": [
"print(np.asarray((unique, counts)).T)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"[[ 0 12500]\n",
" [ 1 12500]]\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "z5-RUK-FRtCA",
"colab_type": "text"
},
"source": [
"### Dictionary Setup for Word Lookup"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1jE6uzV_Pw0G",
"colab_type": "text"
},
"source": [
"Load IMDB word index"
]
},
{
"cell_type": "code",
"metadata": {
"id": "CUdPePuMPqb4",
"colab_type": "code",
"outputId": "0dee68be-077a-418e-a860-626d72624c5b",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 52
}
},
"source": [
"word_index = tf.keras.datasets.imdb.get_word_index(path='imdb_word_index.json')"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json\n",
"1646592/1641221 [==============================] - 0s 0us/step\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aR_kHUVKUGA5",
"colab_type": "text"
},
"source": [
"Modify the word index by updating it with four extra tokens to indicate <PAD>, <START>, <UNK>, and <UNUSED>. These are extra tokens to mark the start of each review and fill with the <PAD> token in short reviews, so that all reviews are equal in length. <UNK> and <UNUSED> are extra tokens for handling words that didn’t appear in the word index during inference time."
]
},
{
"cell_type": "code",
"metadata": {
"id": "8Jbz1BH5VTCj",
"colab_type": "code",
"outputId": "6ccb60aa-ad5c-443c-fdc8-f24b1b6c72bb",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"len(word_index) "
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"88584"
]
},
"metadata": {
"tags": []
},
"execution_count": 9
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "TAICB1-SSqOc",
"colab_type": "code",
"colab": {}
},
"source": [
"# The first indices are reserved\n",
"word_index = {k:(v+3) for k,v in word_index.items()} \n",
"word_index[\"<PAD>\"] = 0\n",
"word_index[\"<START>\"] = 1\n",
"word_index[\"<UNK>\"] = 2 # unknown\n",
"word_index[\"<UNUSED>\"] = 3"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "jKSz2pENUSA2",
"colab_type": "text"
},
"source": [
"### Swapping Key and Value"
]
},
{
"cell_type": "code",
"metadata": {
"id": "MQKqng13UaAz",
"colab_type": "code",
"colab": {}
},
"source": [
"reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "nma6Ao_rUVQf",
"colab_type": "text"
},
"source": [
"#### EDA"
]
},
{
"cell_type": "code",
"metadata": {
"id": "DXtIrj64VLcH",
"colab_type": "code",
"outputId": "c7dd4fc1-2ee1-4785-c8ea-98ce4ec1c9c8",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"type(word_index)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"dict"
]
},
"metadata": {
"tags": []
},
"execution_count": 11
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "3_Nq_h1bVN1c",
"colab_type": "code",
"outputId": "6a127923-f0be-4437-ee54-46c1dfd585aa",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"len(word_index) # with addition of 4 more, compare to above"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"88584"
]
},
"metadata": {
"tags": []
},
"execution_count": 12
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "d9IwGgybVaRV",
"colab_type": "code",
"outputId": "456a704a-546d-4f96-f9b6-da0a6df3cdad",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
}
},
"source": [
"list(word_index)[0:10]"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['fawn',\n",
" 'tsukino',\n",
" 'nunnery',\n",
" 'sonja',\n",
" 'vani',\n",
" 'woods',\n",
" 'spiders',\n",
" 'hanging',\n",
" 'woody',\n",
" 'trawling']"
]
},
"metadata": {
"tags": []
},
"execution_count": 13
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "riGLvhsAVeo_",
"colab_type": "code",
"outputId": "ddd56a49-7219-40f3-fb0d-842857cb05c8",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 70
}
},
"source": [
"# Getting first items of a Python 3 dict\n",
"for x in list(word_index)[0:3]:\n",
" print (\"key {}, value {} \".format(x, word_index[x]))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"key fawn, value 34704 \n",
"key tsukino, value 52009 \n",
"key nunnery, value 52010 \n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "6epNbPNqVoNM",
"colab_type": "code",
"outputId": "0ab67972-df8f-4ba5-98a3-919a741cc308",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 70
}
},
"source": [
"# Getting first items of a Python 3 dict\n",
"for x in list(reverse_word_index)[0:3]:\n",
" print (\"key {}, value {} \".format(x, reverse_word_index[x]))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"key 34704, value fawn \n",
"key 52009, value tsukino \n",
"key 52010, value nunnery \n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "xusEptqIVrxE",
"colab_type": "code",
"colab": {}
},
"source": [
"# Parameters for dict.get()\n",
"## key − This is the Key to be searched in the dictionary.\n",
"## default − This is the Value to be returned in case key does not exist.\n",
"\n",
"def decode_review(text_indexes):\n",
" # text_indexes means int mapping\n",
" return ' '.join([reverse_word_index.get(i, '?') for i in text_indexes])"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "WJI0Vm9RVupP",
"colab_type": "code",
"outputId": "dc7251c8-1092-4dfe-9359-68968111b660",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"decode_review([32, 10])"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"'all br'"
]
},
"metadata": {
"tags": []
},
"execution_count": 63
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "Avt0ytL6W93n",
"colab_type": "code",
"outputId": "58292632-ac63-4630-8cf7-9a983cc2a003",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"word_index.get('all', '?')"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"32"
]
},
"metadata": {
"tags": []
},
"execution_count": 64
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "UNWo0wjaCoej",
"colab_type": "text"
},
"source": [
"# 3.Follow Along TF Example"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "w64ygI0dCx1U",
"colab_type": "text"
},
"source": [
"https://colab.research.google.com/drive/1_qMWdtADKBjZQE1RuyZnODtHi8wWsD0s"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZrNCq9vGDBMB",
"colab_type": "text"
},
"source": [
"## 3.1 Setup"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-1LkWHQEDDqq",
"colab_type": "text"
},
"source": [
"### Import TensorFlow and other libraries\n",
"done above\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gLW4NKIsDOzG",
"colab_type": "text"
},
"source": [
"### Download Data\n",
"done above, small changes required to match our WK3 requirements"
]
},
{
"cell_type": "code",
"metadata": {
"id": "eXu6OfxuDt1U",
"colab_type": "code",
"colab": {}
},
"source": [
"# our dataset is the sampled x_train\n",
"\n",
"idx = np.argwhere(y_train>0) # Select positive comment's index in training data\n",
"np.random.shuffle(idx) #Shuffle it at random\n",
"\n",
"# Lets randomly select a fraction of the positive reviews while keeping all negative reviews. \n",
"# We are going to use these subset of positive reviews as our positive base, and oversample \n",
"# these reviews in the base at random.\n",
"\n",
"FRAC = 0.1\n",
"idxs = idx[:round(len(idx)*FRAC)] # Select random fraction FRAC\n",
"y_trains = y_train[idxs]\n",
"x_trains = x_train[idxs]"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "jcX44zEeEaY6",
"colab_type": "code",
"colab": {}
},
"source": [
"# our dataset is the sampled x_train\n",
"text = x_trains"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "H9Cq2zyBMBWD",
"colab_type": "text"
},
"source": [
"Not Quite ! The original TF example takes in raw text and process it on character basis first. Meaning we need to combine all our lists into one big list"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Q7T-lE4RMNJy",
"colab_type": "text"
},
"source": [
"### From Array of List to One Big List"
]
},
{
"cell_type": "code",
"metadata": {
"id": "K9-zHbM_Efkx",
"colab_type": "code",
"outputId": "005c197b-b61e-4239-cd7f-dd79bf898c86",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"np.shape(text)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(1250, 1)"
]
},
"metadata": {
"tags": []
},
"execution_count": 19
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "0h4R88pSEmRw",
"colab_type": "code",
"outputId": "0841a840-26f8-4805-98ac-7f2b809b7e80",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 72
}
},
"source": [
"text[0]"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([list([1, 165, 14, 20, 16, 24, 38, 78, 12, 1367, 206, 212, 5, 2318, 50, 26, 52, 156, 11, 14, 22, 18, 1825, 7517, 39973, 18746, 39, 4, 1419, 3693, 37, 299, 18063, 160, 73, 573, 284, 9, 3546, 4224, 39, 2039, 5, 289, 8822, 4, 293, 105, 26, 256, 34, 3546, 5788, 17, 6184, 37, 16, 184, 52, 5, 82, 163, 21, 4, 31, 37, 91, 770, 72, 16, 628, 8335, 17, 4500, 39520, 29, 299, 6, 275, 109, 74, 29, 633, 127, 88, 11, 85, 108, 40, 4, 1419, 3693, 1395, 5808, 4, 31025, 42, 4, 43737, 29, 299, 6, 55, 2259, 415, 5, 11, 7242, 4, 299, 220, 4, 1961, 6, 132, 209, 101, 1438, 63, 16, 327, 8, 67, 4, 64, 66, 1566, 155, 44, 14, 22, 26, 4, 450, 1268, 7, 4, 182, 3331, 2216, 63, 166, 14, 22, 382, 168, 6, 117, 1967, 444, 13, 197, 14, 16, 6, 184, 52, 117, 22])],\n",
" dtype=object)"
]
},
"metadata": {
"tags": []
},
"execution_count": 20
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "grQ7wz2MFMml",
"colab_type": "code",
"outputId": "41a0a42f-7a08-463a-f501-ece39c955eff",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"type(text[0])"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"numpy.ndarray"
]
},
"metadata": {
"tags": []
},
"execution_count": 21
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "YBlu1RSfFSaY",
"colab_type": "code",
"outputId": "fa253e7d-a4df-4ce0-8507-73fde3bcd980",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"type(text[0][0])"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"list"
]
},
"metadata": {
"tags": []
},
"execution_count": 33
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "7VQyg-3SE86x",
"colab_type": "code",
"outputId": "cf77dc63-2b90-48c6-f914-b93f82626cbf",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 54
}
},
"source": [
"decode_review(text[0][0])"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"'<START> actually this movie was not so bad it contains action comedy and excitement there are good actors in this film for instance doug hutchison percy from the green mile who plays bristol another well known actor is jamie kennedy from scream and three kings the main characters are played by jamie foxx as alvin who was pretty good and also funny but the one who most surprised me was david morse as edgar clenteen he plays a different character than he usually does because in other films like the green mile indian runner the negotiator or the langoliers he plays a very sympathetic person and in bait the plays almost the opposite a man without any emotions which was nice to see the only really negative thing about this film are the several pictures of the world trade center which makes this film perhaps look a little dated overall i thought this was a pretty good little film'"
]
},
"metadata": {
"tags": []
},
"execution_count": 34
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "sw0aMI8TKxcu",
"colab_type": "code",
"colab": {}
},
"source": [
"l_strings = (list(map(decode_review, text[:,0])))"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "xf6KhjEYLVkv",
"colab_type": "code",
"outputId": "4e708141-fd0a-41e5-8359-eb3ee0337d4a",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 90
}
},
"source": [
"l_strings[0:3]"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['<START> actually this movie was not so bad it contains action comedy and excitement there are good actors in this film for instance doug hutchison percy from the green mile who plays bristol another well known actor is jamie kennedy from scream and three kings the main characters are played by jamie foxx as alvin who was pretty good and also funny but the one who most surprised me was david morse as edgar clenteen he plays a different character than he usually does because in other films like the green mile indian runner the negotiator or the langoliers he plays a very sympathetic person and in bait the plays almost the opposite a man without any emotions which was nice to see the only really negative thing about this film are the several pictures of the world trade center which makes this film perhaps look a little dated overall i thought this was a pretty good little film',\n",
" \"<START> the matador is a strange film its main character julian played with an unusual mix of charm and unbalance by brosnan is not your typical hero julian is a hit man who is experiencing a late mid life crises having spent 22 years in the profession of cold blooded murder he now finds himself stressed out and desperately lonely and so after a chance meeting at a bar with danny greg kinnear he latches on and begins a halting awkward friendship danny the quintessential nice guy is dealing with some stuff in his own life and truth be told could use a friend as well the two make an unexpected connection and danny sticks around to hear julian's story even after learning the unsavory truth about julian's work br br matador approaches a subject not completely unheard of in cinema the anti hero assassin films like 'assassins' and 'grosse pointe blank' come to mind but matador differs in several key ways first of all the killing and gore is implied but never really shown in any detail meaning that if you are an action movie buff looking for an adrenaline rush this movie will probably disappoint you and second unlike most anti hero films matador makes no attempt to show remorse and redemption from its main character julian's job is simply presented as an 'it is what it is' kind of thing this is unusual given that 99 99 of us would consider killing for money horrific and yet this unorthodox approach is perhaps what makes the film feel authentic although we don't like to admit it almost anything could become mundane after we did it long enough maybe even murder did julian's victims deserve to die who is paying to have people killed who knows the movie never deals with these questions the focus is on julian and his stumbling shuffle into a genuine friendship if you read about someone like julian in the paper you would have a passing thought that people like him should be ripped out of society like a cancer but forced to watch his life you are drawn in by his intense humanity sympathy for the devil i guess br br brosnan's take on julian is well done and deeply unsettling he doesn't completely divorce himself from his james bond good looks and smooth charm but rather just adds disturbing quirks into the mix weird or crude remarks in the middle polite conversations and sudden shifts from suave charm to childish tantrums and sad desperate pleas for acceptance it keeps you guessing about his grasp on his sanity and how it will affect those around him it's a bit like listening to a piano player that occasionally and unexpectedly hits a wrong note while he plays but it works the films only other major role that of danny is not nearly as meaty kinnear turns in a solid if unspectacular performance as a regular joe with a regular joe life and problems br br the film doesn't really have any huge shocks or m night shyamalan twists but i wasn't able to guess the ending and it felt satisfying it doesn't have any deep philosophical or spiritual insights and yet it felt very human and it didn't have any heart pounding car chases or gun battles and yet i thought the pacing was well done and i was never bored maybe the only real message here is about the human need to reach out and make connections with one another and how those needs have no moral prerequisites even a murderer needs friends and even good people can be friends with bad people it's a comment on the strange random world we live in a good film worth seeing\",\n",
" \"<START> i think this movie had to be fun to make it for us it was fun to watch it the actors look like they have a fun time my girlfriends like the boy actors and my boyfriends like the girl actors not very much do we get to have crazy fun with a movie that is horror make i see a lot of scary movies and i would watch this one all together once more or more because we laugh together if this actors make other scary movies i will watch them the grander mad man thats chase to kill the actors is very much a good bad man he make us laugh together the most i would give this movie a high score if you ask me br br i don't know if the market has any more of the movies with the actors but the main boy is cute the actor with the grand chest has to be not real they doesn't look to real\"]"
]
},
"metadata": {
"tags": []
},
"execution_count": 23
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "0c9skMQGLMau",
"colab_type": "code",
"colab": {}
},
"source": [
"text = ' '.join(l_strings)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "XwS0UG69MVCn",
"colab_type": "code",
"outputId": "48cc6586-3a0f-4afd-d777-0dc9e6854dca",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 54
}
},
"source": [
"text[0:2000]"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"\"<START> actually this movie was not so bad it contains action comedy and excitement there are good actors in this film for instance doug hutchison percy from the green mile who plays bristol another well known actor is jamie kennedy from scream and three kings the main characters are played by jamie foxx as alvin who was pretty good and also funny but the one who most surprised me was david morse as edgar clenteen he plays a different character than he usually does because in other films like the green mile indian runner the negotiator or the langoliers he plays a very sympathetic person and in bait the plays almost the opposite a man without any emotions which was nice to see the only really negative thing about this film are the several pictures of the world trade center which makes this film perhaps look a little dated overall i thought this was a pretty good little film <START> the matador is a strange film its main character julian played with an unusual mix of charm and unbalance by brosnan is not your typical hero julian is a hit man who is experiencing a late mid life crises having spent 22 years in the profession of cold blooded murder he now finds himself stressed out and desperately lonely and so after a chance meeting at a bar with danny greg kinnear he latches on and begins a halting awkward friendship danny the quintessential nice guy is dealing with some stuff in his own life and truth be told could use a friend as well the two make an unexpected connection and danny sticks around to hear julian's story even after learning the unsavory truth about julian's work br br matador approaches a subject not completely unheard of in cinema the anti hero assassin films like 'assassins' and 'grosse pointe blank' come to mind but matador differs in several key ways first of all the killing and gore is implied but never really shown in any detail meaning that if you are an action movie buff looking for an adrenaline rush this movie will probably disappoint you and\""
]
},
"metadata": {
"tags": []
},
"execution_count": 25
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7nSNiZlwC-8X",
"colab_type": "text"
},
"source": [
"## 3.2 Process the text"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "47AGoOQ6DRTG",
"colab_type": "text"
},
"source": [
"### Vectorize the text\n",
"Before training, we need to map strings to a numerical representation. Create two lookup tables: one mapping characters to numbers, and another for numbers to characters."
]
},
{
"cell_type": "code",
"metadata": {
"id": "TWoBUNmMDhJU",
"colab_type": "code",
"colab": {}
},
"source": [
"\n",
"\n",
"# The unique characters in the file\n",
"vocab = sorted(set(text))"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "tevRVNs0dN--",
"colab_type": "code",
"colab": {}
},
"source": [
"# Creating a mapping from unique characters to indices\n",
"char2idx = {u:i for i, u in enumerate(vocab)}\n",
"idx2char = np.array(vocab)\n",
"idx2char_dict = {i:u for i, u in enumerate(vocab)}\n",
"\n",
"text_as_int = np.array([char2idx[c] for c in text])"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZPqv0huUOGsK",
"colab_type": "text"
},
"source": [
"#### EDA"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Zvrq9_u2Nd6n",
"colab_type": "code",
"outputId": "a3040052-30f0-42c6-8e03-9d71b6b2c901",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
}
},
"source": [
"# Getting first items of a Python 3 dict\n",
"for x in list(char2idx)[0:10]:\n",
" print (\"key {}, value {} \".format(x, char2idx[x]))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"key , value 0 \n",
"key ', value 1 \n",
"key 0, value 2 \n",
"key 1, value 3 \n",
"key 2, value 4 \n",
"key 3, value 5 \n",
"key 4, value 6 \n",
"key 5, value 7 \n",
"key 6, value 8 \n",
"key 7, value 9 \n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "vAFCrt4yNgLH",
"colab_type": "code",
"outputId": "d3adad5b-2059-4bf6-e1b7-db91543922c5",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"type(idx2char)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"numpy.ndarray"
]
},
"metadata": {
"tags": []
},
"execution_count": 51
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "LrrgLM2vN4DK",
"colab_type": "code",
"outputId": "92344cf8-f1be-4a64-8de2-ba197603cf8c",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"idx2char[0:10]"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([' ', \"'\", '0', '1', '2', '3', '4', '5', '6', '7'], dtype='<U1')"
]
},
"metadata": {
"tags": []
},
"execution_count": 52
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "4yfJi5zJN8rt",
"colab_type": "code",
"outputId": "a2383d05-214b-44a4-d35c-fe2740f41aed",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
}
},
"source": [
"# Getting first items of a Python 3 dict\n",
"for x in list(text_as_int)[0:10]:\n",
" print (\"key {}, value {} \".format(x, text_as_int[x]))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"key 0, value 0 \n",
"key 12, value 20 \n",
"key 14, value 0 \n",
"key 31, value 0 \n",
"key 32, value 13 \n",
"key 12, value 20 \n",
"key 23, value 30 \n",
"key 23, value 30 \n",
"key 36, value 20 \n",
"key 0, value 0 \n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bL59EAA3OcQR",
"colab_type": "text"
},
"source": [
"Now we have an integer representation for each character. Notice that we mapped the character as indexes from 0 to len(unique)."
]
},
{
"cell_type": "code",
"metadata": {
"id": "QYaJI7Z-OUcw",
"colab_type": "code",
"outputId": "96d65aa1-4e5d-48a4-f7e3-6afa15bd00c7",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 425
}
},
"source": [
"print('{')\n",
"for char,_ in zip(char2idx, range(20)):\n",
" print(' {:4s}: {:3d},'.format(repr(char), char2idx[char]))\n",
"print(' ...\\n}')"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"{\n",
" ' ' : 0,\n",
" \"'\" : 1,\n",
" '0' : 2,\n",
" '1' : 3,\n",
" '2' : 4,\n",
" '3' : 5,\n",
" '4' : 6,\n",
" '5' : 7,\n",
" '6' : 8,\n",
" '7' : 9,\n",
" '8' : 10,\n",
" '9' : 11,\n",
" 'a' : 12,\n",
" 'b' : 13,\n",
" 'c' : 14,\n",
" 'd' : 15,\n",
" 'e' : 16,\n",
" 'f' : 17,\n",
" 'g' : 18,\n",
" 'h' : 19,\n",
" ...\n",
"}\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "q8gS0kTxOeNm",
"colab_type": "code",
"outputId": "c575bf12-ba75-468f-8954-5aaba2fefd9d",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"# Show how the first 13 characters from the text are mapped to integers\n",
"print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"' actually thi' ---- characters mapped to int ---- > [ 0 12 14 31 32 12 23 23 36 0 31 19 20]\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "I3HffWbD6hd6",
"colab_type": "code",
"outputId": "9daa8ec7-3b6a-46f6-aa50-3334c614a137",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
}
},
"source": [
"# Getting first items of a Python 3 dict\n",
"for x in list(idx2char_dict)[0:10]:\n",
" print (\"key {}, value {} \".format(x, idx2char_dict[x]))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"key 0, value \n",
"key 1, value ' \n",
"key 2, value 0 \n",
"key 3, value 1 \n",
"key 4, value 2 \n",
"key 5, value 3 \n",
"key 6, value 4 \n",
"key 7, value 5 \n",
"key 8, value 6 \n",
"key 9, value 7 \n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DmeFqR_SOq13",
"colab_type": "text"
},
"source": [
"## 3.3 The prediction task"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NvlAErS1Owzb",
"colab_type": "text"
},
"source": [
"### Create training examples and targets\n",
"\n",
"Next divide the text into example sequences. Each input sequence will contain `seq_length` characters from the text.\n",
"\n",
"For each input sequence, the corresponding targets contain the same length of text, except shifted one character to the right.\n",
"\n",
"So break the text into chunks of `seq_length+1`. For example, say `seq_length` is 4 and our text is \"Hello\". The input sequence would be \"Hell\", and the target sequence \"ello\".\n",
"\n",
"To do this first use the `tf.data.Dataset.from_tensor_slices` function to convert the text vector into a stream of character indices."
]
},
{
"cell_type": "code",
"metadata": {
"id": "0u5cdjyvOh5r",
"colab_type": "code",
"outputId": "5f143b62-321b-413a-fcda-f5a9d607beda",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 105
}
},
"source": [
"# The maximum length sentence we want for a single input in characters\n",
"seq_length = 100\n",
"examples_per_epoch = len(text)//(seq_length+1)\n",
"\n",
"# Create training examples / targets\n",
"char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)\n",
"\n",
"\n",
"for i in char_dataset.take(5):\n",
" print(idx2char[i.numpy()])"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"<\n",
"S\n",
"T\n",
"A\n",
"R\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "LGwGPV8LO4BS",
"colab_type": "code",
"outputId": "ea6293bf-6a0d-45c4-c2c4-b62d2f0af915",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"# DEBUG\n",
"type(char_dataset)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"tensorflow.python.data.ops.dataset_ops.DatasetV1Adapter"
]
},
"metadata": {
"tags": []
},
"execution_count": 30
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LC_eXWV1Pexi",
"colab_type": "text"
},
"source": [
"The batch method lets us easily convert these individual characters to sequences of the desired size."
]
},
{
"cell_type": "code",
"metadata": {
"id": "RjKDe2rzPUeG",
"colab_type": "code",
"outputId": "7ce78561-a668-4033-c72b-00ca0fd0909b",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 105
}
},
"source": [
"sequences = char_dataset.batch(seq_length+1, drop_remainder=True)\n",
"\n",
"for item in sequences.take(5):\n",
" print(repr(''.join(idx2char[item.numpy()])))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"'<START> actually this movie was not so bad it contains action comedy and excitement there are good ac'\n",
"'tors in this film for instance doug hutchison percy from the green mile who plays bristol another wel'\n",
"'l known actor is jamie kennedy from scream and three kings the main characters are played by jamie fo'\n",
"'xx as alvin who was pretty good and also funny but the one who most surprised me was david morse as e'\n",
"'dgar clenteen he plays a different character than he usually does because in other films like the gre'\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jBOK5lVCR_lS",
"colab_type": "text"
},
"source": [
"For each sequence, duplicate and shift it to form the input and target text by using the `map` method to apply a simple function to each batch:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "cgnfFIADPp4U",
"colab_type": "code",
"colab": {}
},
"source": [
"def split_input_target(chunk):\n",
" input_text = chunk[:-1]\n",
" target_text = chunk[1:]\n",
" return input_text, target_text\n",
"\n",
"dataset = sequences.map(split_input_target)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "4VNUrQ0MSCDu",
"colab_type": "code",
"outputId": "26ce1bb1-a647-447d-b27a-431a9117f6e1",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"# debug\n",
"\n",
"type(dataset)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"tensorflow.python.data.ops.dataset_ops.DatasetV1Adapter"
]
},
"metadata": {
"tags": []
},
"execution_count": 60
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ocVNs7z4SKMH",
"colab_type": "text"
},
"source": [
"Print the first examples input and target values:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "xkA7JdUVSD-T",
"colab_type": "code",
"outputId": "0f67539b-b7c9-47f5-8c41-c80b12f13631",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 52
}
},
"source": [
"for input_example, target_example in dataset.take(1):\n",
" print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))\n",
" print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Input data: '<START> actually this movie was not so bad it contains action comedy and excitement there are good a'\n",
"Target data: 'START> actually this movie was not so bad it contains action comedy and excitement there are good ac'\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OfhInEVYScw0",
"colab_type": "text"
},
"source": [
"Each index of these vectors are processed as one time step. For the input at time step 0, the model receives the index for \"F\" and trys to predict the index for \"i\" as the next character. At the next timestep, it does the same thing but the RNN considers the previous step context in addition to the current input character."
]
},
{
"cell_type": "code",
"metadata": {
"id": "TQgGGM0sSQHW",
"colab_type": "code",
"outputId": "0ff992b3-c317-46f0-ea0d-5155d8c042df",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 283
}
},
"source": [
"for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):\n",
" print(\"Step {:4d}\".format(i))\n",
" print(\" input: {} ({:s})\".format(input_idx, repr(idx2char[input_idx])))\n",
" print(\" expected output: {} ({:s})\".format(target_idx, repr(idx2char[target_idx])))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Step 0\n",
" input: 12 ('<')\n",
" expected output: 16 ('S')\n",
"Step 1\n",
" input: 16 ('S')\n",
" expected output: 17 ('T')\n",
"Step 2\n",
" input: 17 ('T')\n",
" expected output: 14 ('A')\n",
"Step 3\n",
" input: 14 ('A')\n",
" expected output: 15 ('R')\n",
"Step 4\n",
" input: 15 ('R')\n",
" expected output: 17 ('T')\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mB67dH_NSi5d",
"colab_type": "text"
},
"source": [
"### Create training batches\n",
"\n",
"We used `tf.data` to split the text into manageable sequences. But before feeding this data into the model, we need to shuffle the data and pack it into batches."
]
},
{
"cell_type": "code",
"metadata": {
"id": "XkJsuY6LSfCb",
"colab_type": "code",
"outputId": "9a49fa6d-3206-421b-d8c4-bf0f4b085601",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"# Batch size\n",
"BATCH_SIZE = 64\n",
"\n",
"# Buffer size to shuffle the dataset\n",
"# (TF data is designed to work with possibly infinite sequences,\n",
"# so it doesn't attempt to shuffle the entire sequence in memory. Instead,\n",
"# it maintains a buffer in which it shuffles elements).\n",
"BUFFER_SIZE = 10000\n",
"\n",
"dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)\n",
"\n",
"dataset"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<DatasetV1Adapter shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>"
]
},
"metadata": {
"tags": []
},
"execution_count": 35
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Lf1oCX0BSp4S",
"colab_type": "text"
},
"source": [
"## 3.4 Build The Model"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wOMdBpncSrq-",
"colab_type": "text"
},
"source": [
"Use `tf.keras.Sequential` to define the model. For this simple example three layers are used to define our model:\n",
"\n",
"* `tf.keras.layers.Embedding`: The input layer. A trainable lookup table that will map the numbers of each character to a vector with `embedding_dim` dimensions;\n",
"* `tf.keras.layers.GRU`: A type of RNN with size `units=rnn_units` (You can also use a LSTM layer here.)\n",
"* `tf.keras.layers.Dense`: The output layer, with `vocab_size` outputs."
]
},
{
"cell_type": "code",
"metadata": {
"id": "dCJ5czfOSmQo",
"colab_type": "code",
"colab": {}
},
"source": [
"# Length of the vocabulary in chars\n",
"vocab_size = len(vocab)\n",
"\n",
"# The embedding dimension\n",
"embedding_dim = 256\n",
"\n",
"# Number of RNN units\n",
"rnn_units = 1024"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "YPpI7YF-7sKh",
"colab_type": "text"
},
"source": [
"return_sequences=True parameter:\n",
"We want to have a sequence for the output, not just a single vector as we did with normal Neural Networks, so it’s necessary that we set the return_sequences to True.\n",
"\n",
"SRC: https://chunml.github.io/ChunML.github.io/project/Creating-Text-Generator-Using-Recurrent-Neural-Network/"
]
},
{
"cell_type": "code",
"metadata": {
"id": "H125tLwNSufD",
"colab_type": "code",
"colab": {}
},
"source": [
"def build_model(vocab_size, embedding_dim, rnn_units, batch_size):\n",
" model = tf.keras.Sequential([\n",
" tf.keras.layers.Embedding(vocab_size, embedding_dim,\n",
" batch_input_shape=[batch_size, None]),\n",
" tf.keras.layers.GRU(rnn_units,\n",
" return_sequences=True,\n",
" stateful=True,\n",
" recurrent_initializer='glorot_uniform'),\n",
" tf.keras.layers.Dense(vocab_size)\n",
" ])\n",
" return model"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "xgezx5MpSxN0",
"colab_type": "code",
"colab": {}
},
"source": [
"model = build_model(\n",
" vocab_size = len(vocab),\n",
" embedding_dim=embedding_dim,\n",
" rnn_units=rnn_units,\n",
" batch_size=BATCH_SIZE)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "MSKHPTIpS3V0",
"colab_type": "text"
},
"source": [
"For each character the model looks up the embedding, runs the GRU one timestep with the embedding as input, and applies the dense layer to generate logits predicting the log-likelihood of the next character:"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gYNmA9OKS-S_",
"colab_type": "text"
},
"source": [
"### Try the model\n",
"\n",
"Now run the model to see that it behaves as expected.\n",
"\n",
"First check the shape of the output:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "-vniLs5RSzxw",
"colab_type": "code",
"outputId": "24ea7fa6-cec8-42ad-81ca-3702212b296c",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"for input_example_batch, target_example_batch in dataset.take(1):\n",
" example_batch_predictions = model(input_example_batch)\n",
" print(example_batch_predictions.shape, \"# (batch_size, sequence_length, vocab_size)\")"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"(64, 100, 82) # (batch_size, sequence_length, vocab_size)\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "JuPmQhr_TdUG",
"colab_type": "code",
"outputId": "86566a74-497a-4924-db24-deb82011a902",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 265
}
},
"source": [
"model.summary()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Model: \"sequential\"\n",
"_________________________________________________________________\n",
"Layer (type) Output Shape Param # \n",
"=================================================================\n",
"embedding (Embedding) (64, None, 256) 20992 \n",
"_________________________________________________________________\n",
"gru (GRU) (64, None, 1024) 3935232 \n",
"_________________________________________________________________\n",
"dense (Dense) (64, None, 82) 84050 \n",
"=================================================================\n",
"Total params: 4,040,274\n",
"Trainable params: 4,040,274\n",
"Non-trainable params: 0\n",
"_________________________________________________________________\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XWBhqkKyTn2Q",
"colab_type": "text"
},
"source": [
"To get actual predictions from the model we need to sample from the output distribution, to get actual character indices. This distribution is defined by the logits over the character vocabulary.\n",
"\n",
"Note: It is important to sample from this distribution as taking the argmax of the distribution can easily get the model stuck in a loop.\n",
"\n",
"Try it for the first example in the batch:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "LNIbY0kiTjcF",
"colab_type": "code",
"colab": {}
},
"source": [
"sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)\n",
"sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "UhhO7VoiTt2o",
"colab_type": "text"
},
"source": [
"This gives us, at each timestep, a prediction of the next character index:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "nzFbjD5RTqCg",
"colab_type": "code",
"outputId": "956d1a8c-66d5-44c5-c47c-8624fc155b6c",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 123
}
},
"source": [
"sampled_indices"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([78, 43, 45, 28, 68, 73, 49, 13, 41, 9, 40, 37, 66, 18, 68, 46, 76,\n",
" 36, 64, 43, 28, 51, 43, 60, 75, 40, 6, 58, 79, 31, 16, 37, 21, 23,\n",
" 47, 53, 72, 48, 80, 30, 0, 23, 77, 22, 27, 49, 3, 48, 30, 57, 80,\n",
" 14, 24, 24, 11, 18, 17, 78, 63, 65, 24, 25, 39, 30, 23, 70, 5, 54,\n",
" 31, 69, 70, 33, 70, 33, 71, 41, 72, 39, 34, 4, 42, 51, 38, 61, 45,\n",
" 32, 73, 33, 55, 30, 62, 23, 57, 11, 7, 19, 71, 67, 40, 14])"
]
},
"metadata": {
"tags": []
},
"execution_count": 42
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "btiE1aDqTzDf",
"colab_type": "text"
},
"source": [
"Decode these to see the text predicted by this untrained model:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "GcJj0LP_Tv5z",
"colab_type": "code",
"outputId": "8d4ecb34-fde6-49e6-edd9-b33a841fe1bd",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 105
}
},
"source": [
"print(\"Input: \\n\", repr(\"\".join(idx2char[input_example_batch[0]])))\n",
"print()\n",
"print(\"Next Char Predictions: \\n\", repr(\"\".join(idx2char[sampled_indices ])))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Input: \n",
" \" the abysmal elvira's haunted hills which was meant to be a take off of the old roger corman movies \"\n",
"\n",
"Next Char Predictions: \n",
" '’z\\x91kíö£>x7wtêaí\\x96–sèzk¨zäüw4â“nStdf\\x97´ô¡”m f‘ej£1¡má”Agg9aT’çéghvmfñ3»nïñpñpóxôvq2y¨uå\\x91oöpÁmæfá95bóëwA'\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3SdEY3RgT5gx",
"colab_type": "text"
},
"source": [
"### Train the model"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "v7S5Ebp7T-FY",
"colab_type": "text"
},
"source": [
"At this point the problem can be treated as a standard classification problem. Given the previous RNN state, and the input this time step, predict the class of the next character."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Th93sjCuUBEr",
"colab_type": "text"
},
"source": [
"#### Attach an optimizer, and a loss function"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3fC-_uzAUJS8",
"colab_type": "text"
},
"source": [
"The standard tf.keras.losses.sparse_categorical_crossentropy loss function works in this case because it is applied across the last dimension of the predictions.\n",
"\n",
"Because our model returns logits, we need to set the from_logits flag."
]
},
{
"cell_type": "code",
"metadata": {
"id": "0W3WiGYDT1qu",
"colab_type": "code",
"outputId": "180d3eb1-4669-407b-8ff0-21fc0eed2651",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 52
}
},
"source": [
"def loss(labels, logits):\n",
" return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)\n",
"\n",
"example_batch_loss = loss(target_example_batch, example_batch_predictions)\n",
"print(\"Prediction shape: \", example_batch_predictions.shape, \" # (batch_size, sequence_length, vocab_size)\")\n",
"print(\"scalar_loss: \", example_batch_loss.numpy().mean())"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Prediction shape: (64, 100, 82) # (batch_size, sequence_length, vocab_size)\n",
"scalar_loss: 4.408759\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BMc8aqSmUdEI",
"colab_type": "text"
},
"source": [
"Configure the training procedure using the tf.keras.Model.compile method. We'll use tf.keras.optimizers.Adam with default arguments and the loss function."
]
},
{
"cell_type": "code",
"metadata": {
"id": "qt93_E8UUK7b",
"colab_type": "code",
"colab": {}
},
"source": [
"model.compile(optimizer='adam', loss=loss)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "uuKHksqYUiOn",
"colab_type": "text"
},
"source": [
"#### Configure checkpoints"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oXmVYgWOUnBG",
"colab_type": "text"
},
"source": [
"Use a `tf.keras.callbacks.ModelCheckpoint` to ensure that checkpoints are saved during training:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "mOQ_MLJbUfQ9",
"colab_type": "code",
"colab": {}
},
"source": [
"# Directory where the checkpoints will be saved\n",
"checkpoint_dir = './training_checkpoints'\n",
"# Name of the checkpoint files\n",
"checkpoint_prefix = os.path.join(checkpoint_dir, \"ckpt_{epoch}\")\n",
"\n",
"checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(\n",
" filepath=checkpoint_prefix,\n",
" save_weights_only=True)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Mc-JWlWrUwRM",
"colab_type": "text"
},
"source": [
"#### Execute the training"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GzFOWOeeU0OQ",
"colab_type": "text"
},
"source": [
"To keep training time reasonable, use 10 epochs to train the model. In Colab, set the runtime to GPU for faster training."
]
},
{
"cell_type": "code",
"metadata": {
"id": "akrVbRJOUr9w",
"colab_type": "code",
"colab": {}
},
"source": [
"EPOCHS=2 # here set to 2 to save time"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "rWO_my8TU35l",
"colab_type": "code",
"outputId": "577956e2-c848-45b2-acb3-f614fb806a65",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 161
}
},
"source": [
"# use checkpointed model instead of running again\n",
"#history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Epoch 1/2\n",
"WARNING:tensorflow:From /tensorflow-1.15.0/python3.6/tensorflow_core/python/ops/math_grad.py:1424: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.\n",
"Instructions for updating:\n",
"Use tf.where in 2.0, which has the same broadcast rule as np.where\n",
"251/251 [==============================] - 2253s 9s/step - loss: 2.3385\n",
"Epoch 2/2\n",
"251/251 [==============================] - 2245s 9s/step - loss: 1.7205\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "p1rkAtXsVBCA",
"colab_type": "text"
},
"source": [
"#### Downloading Checkpoints"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Ve8a8IrRVF6P",
"colab_type": "text"
},
"source": [
"downloading the weight and checkpoint files (corresponding to 2 epochs)"
]
},
{
"cell_type": "code",
"metadata": {
"id": "hfboc9zKVfPM",
"colab_type": "code",
"colab": {}
},
"source": [
"files.download('/content/training_checkpoints/checkpoint')"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "oFGwCdcBVINW",
"colab_type": "text"
},
"source": [
"Index File for Checkpoint 1"
]
},
{
"cell_type": "code",
"metadata": {
"id": "42a1zyeEVKer",
"colab_type": "code",
"colab": {}
},
"source": [
"files.download('./training_checkpoints/ckpt_1.index')"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "tf5XB79BVNkb",
"colab_type": "text"
},
"source": [
"Index File for Checkpoint 2"
]
},
{
"cell_type": "code",
"metadata": {
"id": "7lXch-rlVM3o",
"colab_type": "code",
"colab": {}
},
"source": [
"files.download('./training_checkpoints/ckpt_2.index')"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "owm9Ja49VVt4",
"colab_type": "text"
},
"source": [
"Weights for Checkpoint 1"
]
},
{
"cell_type": "code",
"metadata": {
"id": "5xE96QvQVX7i",
"colab_type": "code",
"colab": {}
},
"source": [
"files.download('/content/training_checkpoints/ckpt_1.data-00000-of-00001')"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "3fNP0itcVadq",
"colab_type": "text"
},
"source": [
"Weights for Checkpoint 2"
]
},
{
"cell_type": "code",
"metadata": {
"id": "yXhNlsuPVc3L",
"colab_type": "code",
"colab": {}
},
"source": [
"files.download('./training_checkpoints/ckpt_2.data-00000-of-00001')"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "GkmaXK4RUH4G",
"colab_type": "text"
},
"source": [
"## 3.5 Use the Model"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Yaj1HrPvIPGC",
"colab_type": "text"
},
"source": [
"### Upload Checkpoint Files"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WaUtkW7DISSF",
"colab_type": "text"
},
"source": [
"Upload index and weight from prvious run of the model"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Li-aZHv5IW12",
"colab_type": "code",
"outputId": "5e1eb465-beef-4b04-9341-a327ea01b0ff",
"colab": {
"resources": {
"http://localhost:8080/nbextensions/google.colab/files.js": {
"data": "Ly8gQ29weXJpZ2h0IDIwMTcgR29vZ2xlIExMQwovLwovLyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKLy8geW91IG1heSBub3QgdXNlIHRoaXMgZmlsZSBleGNlcHQgaW4gY29tcGxpYW5jZSB3aXRoIHRoZSBMaWNlbnNlLgovLyBZb3UgbWF5IG9idGFpbiBhIGNvcHkgb2YgdGhlIExpY2Vuc2UgYXQKLy8KLy8gICAgICBodHRwOi8vd3d3LmFwYWNoZS5vcmcvbGljZW5zZXMvTElDRU5TRS0yLjAKLy8KLy8gVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQovLyBkaXN0cmlidXRlZCB1bmRlciB0aGUgTGljZW5zZSBpcyBkaXN0cmlidXRlZCBvbiBhbiAiQVMgSVMiIEJBU0lTLAovLyBXSVRIT1VUIFdBUlJBTlRJRVMgT1IgQ09ORElUSU9OUyBPRiBBTlkgS0lORCwgZWl0aGVyIGV4cHJlc3Mgb3IgaW1wbGllZC4KLy8gU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAovLyBsaW1pdGF0aW9ucyB1bmRlciB0aGUgTGljZW5zZS4KCi8qKgogKiBAZmlsZW92ZXJ2aWV3IEhlbHBlcnMgZm9yIGdvb2dsZS5jb2xhYiBQeXRob24gbW9kdWxlLgogKi8KKGZ1bmN0aW9uKHNjb3BlKSB7CmZ1bmN0aW9uIHNwYW4odGV4dCwgc3R5bGVBdHRyaWJ1dGVzID0ge30pIHsKICBjb25zdCBlbGVtZW50ID0gZG9jdW1lbnQuY3JlYXRlRWxlbWVudCgnc3BhbicpOwogIGVsZW1lbnQudGV4dENvbnRlbnQgPSB0ZXh0OwogIGZvciAoY29uc3Qga2V5IG9mIE9iamVjdC5rZXlzKHN0eWxlQXR0cmlidXRlcykpIHsKICAgIGVsZW1lbnQuc3R5bGVba2V5XSA9IHN0eWxlQXR0cmlidXRlc1trZXldOwogIH0KICByZXR1cm4gZWxlbWVudDsKfQoKLy8gTWF4IG51bWJlciBvZiBieXRlcyB3aGljaCB3aWxsIGJlIHVwbG9hZGVkIGF0IGEgdGltZS4KY29uc3QgTUFYX1BBWUxPQURfU0laRSA9IDEwMCAqIDEwMjQ7Ci8vIE1heCBhbW91bnQgb2YgdGltZSB0byBibG9jayB3YWl0aW5nIGZvciB0aGUgdXNlci4KY29uc3QgRklMRV9DSEFOR0VfVElNRU9VVF9NUyA9IDMwICogMTAwMDsKCmZ1bmN0aW9uIF91cGxvYWRGaWxlcyhpbnB1dElkLCBvdXRwdXRJZCkgewogIGNvbnN0IHN0ZXBzID0gdXBsb2FkRmlsZXNTdGVwKGlucHV0SWQsIG91dHB1dElkKTsKICBjb25zdCBvdXRwdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQob3V0cHV0SWQpOwogIC8vIENhY2hlIHN0ZXBzIG9uIHRoZSBvdXRwdXRFbGVtZW50IHRvIG1ha2UgaXQgYXZhaWxhYmxlIGZvciB0aGUgbmV4dCBjYWxsCiAgLy8gdG8gdXBsb2FkRmlsZXNDb250aW51ZSBmcm9tIFB5dGhvbi4KICBvdXRwdXRFbGVtZW50LnN0ZXBzID0gc3RlcHM7CgogIHJldHVybiBfdXBsb2FkRmlsZXNDb250aW51ZShvdXRwdXRJZCk7Cn0KCi8vIFRoaXMgaXMgcm91Z2hseSBhbiBhc3luYyBnZW5lcmF0b3IgKG5vdCBzdXBwb3J0ZWQgaW4gdGhlIGJyb3dzZXIgeWV0KSwKLy8gd2hlcmUgdGhlcmUgYXJlIG11bHRpcGxlIGFzeW5jaHJvbm91cyBzdGVwcyBhbmQgdGhlIFB5dGhvbiBzaWRlIGlzIGdvaW5nCi8vIHRvIHBvbGwgZm9yIGNvbXBsZXRpb24gb2YgZWFjaCBzdGVwLgovLyBUaGlzIHVzZXMgYSBQcm9taXNlIHRvIGJsb2NrIHRoZSBweXRob24gc2lkZSBvbiBjb21wbGV0aW9uIG9mIGVhY2ggc3RlcCwKLy8gdGhlbiBwYXNzZXMgdGhlIHJlc3VsdCBvZiB0aGUgcHJldmlvdXMgc3RlcCBhcyB0aGUgaW5wdXQgdG8gdGhlIG5leHQgc3RlcC4KZnVuY3Rpb24gX3VwbG9hZEZpbGVzQ29udGludWUob3V0cHV0SWQpIHsKICBjb25zdCBvdXRwdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQob3V0cHV0SWQpOwogIGNvbnN0IHN0ZXBzID0gb3V0cHV0RWxlbWVudC5zdGVwczsKCiAgY29uc3QgbmV4dCA9IHN0ZXBzLm5leHQob3V0cHV0RWxlbWVudC5sYXN0UHJvbWlzZVZhbHVlKTsKICByZXR1cm4gUHJvbWlzZS5yZXNvbHZlKG5leHQudmFsdWUucHJvbWlzZSkudGhlbigodmFsdWUpID0+IHsKICAgIC8vIENhY2hlIHRoZSBsYXN0IHByb21pc2UgdmFsdWUgdG8gbWFrZSBpdCBhdmFpbGFibGUgdG8gdGhlIG5leHQKICAgIC8vIHN0ZXAgb2YgdGhlIGdlbmVyYXRvci4KICAgIG91dHB1dEVsZW1lbnQubGFzdFByb21pc2VWYWx1ZSA9IHZhbHVlOwogICAgcmV0dXJuIG5leHQudmFsdWUucmVzcG9uc2U7CiAgfSk7Cn0KCi8qKgogKiBHZW5lcmF0b3IgZnVuY3Rpb24gd2hpY2ggaXMgY2FsbGVkIGJldHdlZW4gZWFjaCBhc3luYyBzdGVwIG9mIHRoZSB1cGxvYWQKICogcHJvY2Vzcy4KICogQHBhcmFtIHtzdHJpbmd9IGlucHV0SWQgRWxlbWVudCBJRCBvZiB0aGUgaW5wdXQgZmlsZSBwaWNrZXIgZWxlbWVudC4KICogQHBhcmFtIHtzdHJpbmd9IG91dHB1dElkIEVsZW1lbnQgSUQgb2YgdGhlIG91dHB1dCBkaXNwbGF5LgogKiBAcmV0dXJuIHshSXRlcmFibGU8IU9iamVjdD59IEl0ZXJhYmxlIG9mIG5leHQgc3RlcHMuCiAqLwpmdW5jdGlvbiogdXBsb2FkRmlsZXNTdGVwKGlucHV0SWQsIG91dHB1dElkKSB7CiAgY29uc3QgaW5wdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQoaW5wdXRJZCk7CiAgaW5wdXRFbGVtZW50LmRpc2FibGVkID0gZmFsc2U7CgogIGNvbnN0IG91dHB1dEVsZW1lbnQgPSBkb2N1bWVudC5nZXRFbGVtZW50QnlJZChvdXRwdXRJZCk7CiAgb3V0cHV0RWxlbWVudC5pbm5lckhUTUwgPSAnJzsKCiAgY29uc3QgcGlja2VkUHJvbWlzZSA9IG5ldyBQcm9taXNlKChyZXNvbHZlKSA9PiB7CiAgICBpbnB1dEVsZW1lbnQuYWRkRXZlbnRMaXN0ZW5lcignY2hhbmdlJywgKGUpID0+IHsKICAgICAgcmVzb2x2ZShlLnRhcmdldC5maWxlcyk7CiAgICB9KTsKICB9KTsKCiAgY29uc3QgY2FuY2VsID0gZG9jdW1lbnQuY3JlYXRlRWxlbWVudCgnYnV0dG9uJyk7CiAgaW5wdXRFbGVtZW50LnBhcmVudEVsZW1lbnQuYXBwZW5kQ2hpbGQoY2FuY2VsKTsKICBjYW5jZWwudGV4dENvbnRlbnQgPSAnQ2FuY2VsIHVwbG9hZCc7CiAgY29uc3QgY2FuY2VsUHJvbWlzZSA9IG5ldyBQcm9taXNlKChyZXNvbHZlKSA9PiB7CiAgICBjYW5jZWwub25jbGljayA9ICgpID0+IHsKICAgICAgcmVzb2x2ZShudWxsKTsKICAgIH07CiAgfSk7CgogIC8vIENhbmNlbCB1cGxvYWQgaWYgdXNlciBoYXNuJ3QgcGlja2VkIGFueXRoaW5nIGluIHRpbWVvdXQuCiAgY29uc3QgdGltZW91dFByb21pc2UgPSBuZXcgUHJvbWlzZSgocmVzb2x2ZSkgPT4gewogICAgc2V0VGltZW91dCgoKSA9PiB7CiAgICAgIHJlc29sdmUobnVsbCk7CiAgICB9LCBGSUxFX0NIQU5HRV9USU1FT1VUX01TKTsKICB9KTsKCiAgLy8gV2FpdCBmb3IgdGhlIHVzZXIgdG8gcGljayB0aGUgZmlsZXMuCiAgY29uc3QgZmlsZXMgPSB5aWVsZCB7CiAgICBwcm9taXNlOiBQcm9taXNlLnJhY2UoW3BpY2tlZFByb21pc2UsIHRpbWVvdXRQcm9taXNlLCBjYW5jZWxQcm9taXNlXSksCiAgICByZXNwb25zZTogewogICAgICBhY3Rpb246ICdzdGFydGluZycsCiAgICB9CiAgfTsKCiAgaWYgKCFmaWxlcykgewogICAgcmV0dXJuIHsKICAgICAgcmVzcG9uc2U6IHsKICAgICAgICBhY3Rpb246ICdjb21wbGV0ZScsCiAgICAgIH0KICAgIH07CiAgfQoKICBjYW5jZWwucmVtb3ZlKCk7CgogIC8vIERpc2FibGUgdGhlIGlucHV0IGVsZW1lbnQgc2luY2UgZnVydGhlciBwaWNrcyBhcmUgbm90IGFsbG93ZWQuCiAgaW5wdXRFbGVtZW50LmRpc2FibGVkID0gdHJ1ZTsKCiAgZm9yIChjb25zdCBmaWxlIG9mIGZpbGVzKSB7CiAgICBjb25zdCBsaSA9IGRvY3VtZW50LmNyZWF0ZUVsZW1lbnQoJ2xpJyk7CiAgICBsaS5hcHBlbmQoc3BhbihmaWxlLm5hbWUsIHtmb250V2VpZ2h0OiAnYm9sZCd9KSk7CiAgICBsaS5hcHBlbmQoc3BhbigKICAgICAgICBgKCR7ZmlsZS50eXBlIHx8ICduL2EnfSkgLSAke2ZpbGUuc2l6ZX0gYnl0ZXMsIGAgKwogICAgICAgIGBsYXN0IG1vZGlmaWVkOiAkewogICAgICAgICAgICBmaWxlLmxhc3RNb2RpZmllZERhdGUgPyBmaWxlLmxhc3RNb2RpZmllZERhdGUudG9Mb2NhbGVEYXRlU3RyaW5nKCkgOgogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAnbi9hJ30gLSBgKSk7CiAgICBjb25zdCBwZXJjZW50ID0gc3BhbignMCUgZG9uZScpOwogICAgbGkuYXBwZW5kQ2hpbGQocGVyY2VudCk7CgogICAgb3V0cHV0RWxlbWVudC5hcHBlbmRDaGlsZChsaSk7CgogICAgY29uc3QgZmlsZURhdGFQcm9taXNlID0gbmV3IFByb21pc2UoKHJlc29sdmUpID0+IHsKICAgICAgY29uc3QgcmVhZGVyID0gbmV3IEZpbGVSZWFkZXIoKTsKICAgICAgcmVhZGVyLm9ubG9hZCA9IChlKSA9PiB7CiAgICAgICAgcmVzb2x2ZShlLnRhcmdldC5yZXN1bHQpOwogICAgICB9OwogICAgICByZWFkZXIucmVhZEFzQXJyYXlCdWZmZXIoZmlsZSk7CiAgICB9KTsKICAgIC8vIFdhaXQgZm9yIHRoZSBkYXRhIHRvIGJlIHJlYWR5LgogICAgbGV0IGZpbGVEYXRhID0geWllbGQgewogICAgICBwcm9taXNlOiBmaWxlRGF0YVByb21pc2UsCiAgICAgIHJlc3BvbnNlOiB7CiAgICAgICAgYWN0aW9uOiAnY29udGludWUnLAogICAgICB9CiAgICB9OwoKICAgIC8vIFVzZSBhIGNodW5rZWQgc2VuZGluZyB0byBhdm9pZCBtZXNzYWdlIHNpemUgbGltaXRzLiBTZWUgYi82MjExNTY2MC4KICAgIGxldCBwb3NpdGlvbiA9IDA7CiAgICB3aGlsZSAocG9zaXRpb24gPCBmaWxlRGF0YS5ieXRlTGVuZ3RoKSB7CiAgICAgIGNvbnN0IGxlbmd0aCA9IE1hdGgubWluKGZpbGVEYXRhLmJ5dGVMZW5ndGggLSBwb3NpdGlvbiwgTUFYX1BBWUxPQURfU0laRSk7CiAgICAgIGNvbnN0IGNodW5rID0gbmV3IFVpbnQ4QXJyYXkoZmlsZURhdGEsIHBvc2l0aW9uLCBsZW5ndGgpOwogICAgICBwb3NpdGlvbiArPSBsZW5ndGg7CgogICAgICBjb25zdCBiYXNlNjQgPSBidG9hKFN0cmluZy5mcm9tQ2hhckNvZGUuYXBwbHkobnVsbCwgY2h1bmspKTsKICAgICAgeWllbGQgewogICAgICAgIHJlc3BvbnNlOiB7CiAgICAgICAgICBhY3Rpb246ICdhcHBlbmQnLAogICAgICAgICAgZmlsZTogZmlsZS5uYW1lLAogICAgICAgICAgZGF0YTogYmFzZTY0LAogICAgICAgIH0sCiAgICAgIH07CiAgICAgIHBlcmNlbnQudGV4dENvbnRlbnQgPQogICAgICAgICAgYCR7TWF0aC5yb3VuZCgocG9zaXRpb24gLyBmaWxlRGF0YS5ieXRlTGVuZ3RoKSAqIDEwMCl9JSBkb25lYDsKICAgIH0KICB9CgogIC8vIEFsbCBkb25lLgogIHlpZWxkIHsKICAgIHJlc3BvbnNlOiB7CiAgICAgIGFjdGlvbjogJ2NvbXBsZXRlJywKICAgIH0KICB9Owp9CgpzY29wZS5nb29nbGUgPSBzY29wZS5nb29nbGUgfHwge307CnNjb3BlLmdvb2dsZS5jb2xhYiA9IHNjb3BlLmdvb2dsZS5jb2xhYiB8fCB7fTsKc2NvcGUuZ29vZ2xlLmNvbGFiLl9maWxlcyA9IHsKICBfdXBsb2FkRmlsZXMsCiAgX3VwbG9hZEZpbGVzQ29udGludWUsCn07Cn0pKHNlbGYpOwo=",
"ok": true,
"headers": [
[
"content-type",
"application/javascript"
]
],
"status": 200,
"status_text": ""
}
},
"base_uri": "https://localhost:8080/",
"height": 57
}
},
"source": [
"#files.upload()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/html": [
"\n",
" <input type=\"file\" id=\"files-7c5e31c6-aab6-4375-a844-f3bfa1453738\" name=\"files[]\" multiple disabled />\n",
" <output id=\"result-7c5e31c6-aab6-4375-a844-f3bfa1453738\">\n",
" Upload widget is only available when the cell has been executed in the\n",
" current browser session. Please rerun this cell to enable.\n",
" </output>\n",
" <script src=\"/nbextensions/google.colab/files.js\"></script> "
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {
"tags": []
}
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"{}"
]
},
"metadata": {
"tags": []
},
"execution_count": 48
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4UoCMDw1uMv6",
"colab_type": "text"
},
"source": [
"### Restore the latest checkpoint"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2Sv8IkPUuPp1",
"colab_type": "text"
},
"source": [
"To keep this prediction step simple, use a batch size of 1.\n",
"\n",
"Because of the way the RNN state is passed from timestep to timestep, the model only accepts a fixed batch size once built.\n",
"\n",
"To run the model with a different `batch_size`, we need to rebuild the model and restore the weights from the checkpoint.\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "nVofydOIuS4Q",
"colab_type": "code",
"outputId": "75e2e083-dec7-4822-a3ff-55a1013bbaf9",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"tf.train.latest_checkpoint(checkpoint_dir)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"'./training_checkpoints/ckpt_2'"
]
},
"metadata": {
"tags": []
},
"execution_count": 52
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "snQ_XSox9Oco",
"colab_type": "text"
},
"source": [
"we define the network in exactly the same way, except the network weights are loaded from a checkpoint file and the network does not need to be trained."
]
},
{
"cell_type": "code",
"metadata": {
"id": "0Gkufyb3uTPL",
"colab_type": "code",
"colab": {}
},
"source": [
"model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)\n",
"\n",
"model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))\n",
"\n",
"model.build(tf.TensorShape([1, None]))"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "O330SNLruWor",
"colab_type": "code",
"outputId": "eaed1289-93a8-4ecd-f7b7-fb971d957e84",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 265
}
},
"source": [
"model.summary()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Model: \"sequential_4\"\n",
"_________________________________________________________________\n",
"Layer (type) Output Shape Param # \n",
"=================================================================\n",
"embedding_4 (Embedding) (1, None, 256) 20992 \n",
"_________________________________________________________________\n",
"gru_4 (GRU) (1, None, 1024) 3935232 \n",
"_________________________________________________________________\n",
"dense_4 (Dense) (1, None, 82) 84050 \n",
"=================================================================\n",
"Total params: 4,040,274\n",
"Trainable params: 4,040,274\n",
"Non-trainable params: 0\n",
"_________________________________________________________________\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "iqcTslpf-upI",
"colab_type": "text"
},
"source": [
"### Use TF-IDF to built list of seeds"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xCwabuNbBR64",
"colab_type": "text"
},
"source": [
"remove < START > and space from the start of review strings"
]
},
{
"cell_type": "code",
"metadata": {
"id": "EZmy-NdlAvaV",
"colab_type": "code",
"colab": {}
},
"source": [
"l_temp = [x.replace(\"<START>\", '') for x in l_strings] # remove <START> from the beginning\n",
"l_temp2 = [x.lstrip() for x in l_temp] # remove space from the start of string\n",
"#corpus = ' '.join(l_temp2)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "bdwbr_-wBJfl",
"colab_type": "code",
"outputId": "7ef3ad48-241f-40e1-b416-57b0caddeee2",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 90
}
},
"source": [
"l_temp2[:3]"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['actually this movie was not so bad it contains action comedy and excitement there are good actors in this film for instance doug hutchison percy from the green mile who plays bristol another well known actor is jamie kennedy from scream and three kings the main characters are played by jamie foxx as alvin who was pretty good and also funny but the one who most surprised me was david morse as edgar clenteen he plays a different character than he usually does because in other films like the green mile indian runner the negotiator or the langoliers he plays a very sympathetic person and in bait the plays almost the opposite a man without any emotions which was nice to see the only really negative thing about this film are the several pictures of the world trade center which makes this film perhaps look a little dated overall i thought this was a pretty good little film',\n",
" \"the matador is a strange film its main character julian played with an unusual mix of charm and unbalance by brosnan is not your typical hero julian is a hit man who is experiencing a late mid life crises having spent 22 years in the profession of cold blooded murder he now finds himself stressed out and desperately lonely and so after a chance meeting at a bar with danny greg kinnear he latches on and begins a halting awkward friendship danny the quintessential nice guy is dealing with some stuff in his own life and truth be told could use a friend as well the two make an unexpected connection and danny sticks around to hear julian's story even after learning the unsavory truth about julian's work br br matador approaches a subject not completely unheard of in cinema the anti hero assassin films like 'assassins' and 'grosse pointe blank' come to mind but matador differs in several key ways first of all the killing and gore is implied but never really shown in any detail meaning that if you are an action movie buff looking for an adrenaline rush this movie will probably disappoint you and second unlike most anti hero films matador makes no attempt to show remorse and redemption from its main character julian's job is simply presented as an 'it is what it is' kind of thing this is unusual given that 99 99 of us would consider killing for money horrific and yet this unorthodox approach is perhaps what makes the film feel authentic although we don't like to admit it almost anything could become mundane after we did it long enough maybe even murder did julian's victims deserve to die who is paying to have people killed who knows the movie never deals with these questions the focus is on julian and his stumbling shuffle into a genuine friendship if you read about someone like julian in the paper you would have a passing thought that people like him should be ripped out of society like a cancer but forced to watch his life you are drawn in by his intense humanity sympathy for the devil i guess br br brosnan's take on julian is well done and deeply unsettling he doesn't completely divorce himself from his james bond good looks and smooth charm but rather just adds disturbing quirks into the mix weird or crude remarks in the middle polite conversations and sudden shifts from suave charm to childish tantrums and sad desperate pleas for acceptance it keeps you guessing about his grasp on his sanity and how it will affect those around him it's a bit like listening to a piano player that occasionally and unexpectedly hits a wrong note while he plays but it works the films only other major role that of danny is not nearly as meaty kinnear turns in a solid if unspectacular performance as a regular joe with a regular joe life and problems br br the film doesn't really have any huge shocks or m night shyamalan twists but i wasn't able to guess the ending and it felt satisfying it doesn't have any deep philosophical or spiritual insights and yet it felt very human and it didn't have any heart pounding car chases or gun battles and yet i thought the pacing was well done and i was never bored maybe the only real message here is about the human need to reach out and make connections with one another and how those needs have no moral prerequisites even a murderer needs friends and even good people can be friends with bad people it's a comment on the strange random world we live in a good film worth seeing\",\n",
" \"i think this movie had to be fun to make it for us it was fun to watch it the actors look like they have a fun time my girlfriends like the boy actors and my boyfriends like the girl actors not very much do we get to have crazy fun with a movie that is horror make i see a lot of scary movies and i would watch this one all together once more or more because we laugh together if this actors make other scary movies i will watch them the grander mad man thats chase to kill the actors is very much a good bad man he make us laugh together the most i would give this movie a high score if you ask me br br i don't know if the market has any more of the movies with the actors but the main boy is cute the actor with the grand chest has to be not real they doesn't look to real\"]"
]
},
"metadata": {
"tags": []
},
"execution_count": 105
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "iVrQ2It8vB5C",
"colab_type": "text"
},
"source": [
"Instead of picking random indexes from the int to character dictionary, let's use the data from reduced/sampled pos review"
]
},
{
"cell_type": "code",
"metadata": {
"id": "KAZnNPnlukBI",
"colab_type": "code",
"colab": {}
},
"source": [
"pos_rev_vectorizer = TfidfVectorizer()\n"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "SNk6aWAwAxtn",
"colab_type": "code",
"colab": {}
},
"source": [
"corpus = l_temp2\n",
"\n",
"# do some cleaning up\n",
"import re\n",
"def pre_process(text):\n",
" \n",
" # lowercase\n",
" text=text.lower()\n",
" \n",
" #remove tags\n",
" text=re.sub(\"<!--?.*?-->\",\"\",text)\n",
" \n",
" # remove special characters and digits\n",
" text=re.sub(\"(\\\\d|\\\\W)+\",\" \",text)\n",
"\n",
" # there are words with accents appearing - like über','émigré', 'æsthetic\n",
" text = unicodedata.normalize(\"NFD\", text)\n",
" text = re.sub(\"[\\u0300-\\u036f]\", \"\", text)\n",
" \n",
" return text"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "Tv8UXUULLRry",
"colab_type": "code",
"colab": {}
},
"source": [
"corpus = (list(map(pre_process, corpus)))"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "yhydqOmfL-v2",
"colab_type": "code",
"outputId": "39adf17b-76fe-4182-d329-21b2ca7cdd2d",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 90
}
},
"source": [
"corpus[0:3]"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['actually this movie was not so bad it contains action comedy and excitement there are good actors in this film for instance doug hutchison percy from the green mile who plays bristol another well known actor is jamie kennedy from scream and three kings the main characters are played by jamie foxx as alvin who was pretty good and also funny but the one who most surprised me was david morse as edgar clenteen he plays a different character than he usually does because in other films like the green mile indian runner the negotiator or the langoliers he plays a very sympathetic person and in bait the plays almost the opposite a man without any emotions which was nice to see the only really negative thing about this film are the several pictures of the world trade center which makes this film perhaps look a little dated overall i thought this was a pretty good little film',\n",
" 'the matador is a strange film its main character julian played with an unusual mix of charm and unbalance by brosnan is not your typical hero julian is a hit man who is experiencing a late mid life crises having spent years in the profession of cold blooded murder he now finds himself stressed out and desperately lonely and so after a chance meeting at a bar with danny greg kinnear he latches on and begins a halting awkward friendship danny the quintessential nice guy is dealing with some stuff in his own life and truth be told could use a friend as well the two make an unexpected connection and danny sticks around to hear julian s story even after learning the unsavory truth about julian s work br br matador approaches a subject not completely unheard of in cinema the anti hero assassin films like assassins and grosse pointe blank come to mind but matador differs in several key ways first of all the killing and gore is implied but never really shown in any detail meaning that if you are an action movie buff looking for an adrenaline rush this movie will probably disappoint you and second unlike most anti hero films matador makes no attempt to show remorse and redemption from its main character julian s job is simply presented as an it is what it is kind of thing this is unusual given that of us would consider killing for money horrific and yet this unorthodox approach is perhaps what makes the film feel authentic although we don t like to admit it almost anything could become mundane after we did it long enough maybe even murder did julian s victims deserve to die who is paying to have people killed who knows the movie never deals with these questions the focus is on julian and his stumbling shuffle into a genuine friendship if you read about someone like julian in the paper you would have a passing thought that people like him should be ripped out of society like a cancer but forced to watch his life you are drawn in by his intense humanity sympathy for the devil i guess br br brosnan s take on julian is well done and deeply unsettling he doesn t completely divorce himself from his james bond good looks and smooth charm but rather just adds disturbing quirks into the mix weird or crude remarks in the middle polite conversations and sudden shifts from suave charm to childish tantrums and sad desperate pleas for acceptance it keeps you guessing about his grasp on his sanity and how it will affect those around him it s a bit like listening to a piano player that occasionally and unexpectedly hits a wrong note while he plays but it works the films only other major role that of danny is not nearly as meaty kinnear turns in a solid if unspectacular performance as a regular joe with a regular joe life and problems br br the film doesn t really have any huge shocks or m night shyamalan twists but i wasn t able to guess the ending and it felt satisfying it doesn t have any deep philosophical or spiritual insights and yet it felt very human and it didn t have any heart pounding car chases or gun battles and yet i thought the pacing was well done and i was never bored maybe the only real message here is about the human need to reach out and make connections with one another and how those needs have no moral prerequisites even a murderer needs friends and even good people can be friends with bad people it s a comment on the strange random world we live in a good film worth seeing',\n",
" 'i think this movie had to be fun to make it for us it was fun to watch it the actors look like they have a fun time my girlfriends like the boy actors and my boyfriends like the girl actors not very much do we get to have crazy fun with a movie that is horror make i see a lot of scary movies and i would watch this one all together once more or more because we laugh together if this actors make other scary movies i will watch them the grander mad man thats chase to kill the actors is very much a good bad man he make us laugh together the most i would give this movie a high score if you ask me br br i don t know if the market has any more of the movies with the actors but the main boy is cute the actor with the grand chest has to be not real they doesn t look to real']"
]
},
"metadata": {
"tags": []
},
"execution_count": 60
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "O3wVwi99Kg2C",
"colab_type": "code",
"outputId": "d8e870e1-f99e-4f1a-b8cd-1100443c1f84",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"X = pos_rev_vectorizer.fit_transform(corpus)\n",
"print(pos_rev_vectorizer.get_feature_names()[0:3])"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"['aage', 'aaker', 'aames']\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "UhZCDkhYEHr5",
"colab_type": "code",
"outputId": "ae5d5183-97b3-4d00-cacb-56d16dde1aaf",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
}
},
"source": [
"# summarize\n",
"\n",
"# Getting first items of a Python 3 dict\n",
"for x in list(pos_rev_vectorizer.vocabulary_)[0:10]:\n",
" print (\"key {}, value {} \".format(x, pos_rev_vectorizer.vocabulary_[x]))\n"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"key actually, value 166 \n",
"key this, value 18094 \n",
"key movie, value 11816 \n",
"key was, value 19555 \n",
"key not, value 12282 \n",
"key so, value 16613 \n",
"key bad, value 1269 \n",
"key it, value 9516 \n",
"key contains, value 3764 \n",
"key action, value 150 \n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FHfPdRDRId7O",
"colab_type": "text"
},
"source": [
"sort the tf-idf vocublary "
]
},
{
"cell_type": "code",
"metadata": {
"id": "bEABzrphOcAq",
"colab_type": "code",
"colab": {}
},
"source": [
"popular_terms = sorted(pos_rev_vectorizer.vocabulary_, key=lambda x: (-pos_rev_vectorizer.vocabulary_[x], x))"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "lsXpsSUTJuYr",
"colab_type": "code",
"outputId": "d0362bff-773f-431d-8a47-4c8854472a35",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
}
},
"source": [
"popular_terms[0:10]"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['æsthetic',\n",
" 'zues',\n",
" 'zucker',\n",
" 'zorro',\n",
" 'zooms',\n",
" 'zoom',\n",
" 'zoolander',\n",
" 'zooey',\n",
" 'zoo',\n",
" 'zone']"
]
},
"metadata": {
"tags": []
},
"execution_count": 64
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "u7ZEux0zPAOr",
"colab_type": "text"
},
"source": [
"### Generate Text "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eYEt94_lVut5",
"colab_type": "text"
},
"source": [
"\n",
"The following code block generates the text:\n",
"\n",
"* It Starts by choosing a start string, initializing the RNN state and setting the number of characters to generate.\n",
"\n",
"* Get the prediction distribution of the next character using the start string and the RNN state.\n",
"\n",
"* Then, use a categorical distribution to calculate the index of the predicted character. Use this predicted character as our next input to the model.\n",
"\n",
"* The RNN state returned by the model is fed back into the model so that it now has more context, instead than only one character. After predicting the next character, the modified RNN states are again fed back into the model, which is how it learns as it gets more context from the previously predicted characters.\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "avtqIQpyV0qX",
"colab_type": "code",
"colab": {}
},
"source": [
"def generate_text(model, start_string):\n",
" # Evaluation step (generating text using the learned model)\n",
"\n",
" # Number of characters to generate\n",
" num_generate = 1000\n",
"\n",
" # Converting our start string to numbers (vectorizing)\n",
" input_eval = [char2idx[s] for s in start_string]\n",
" input_eval = tf.expand_dims(input_eval, 0)\n",
"\n",
" # Empty string to store our results\n",
" text_generated = []\n",
"\n",
" # Low temperatures results in more predictable text.\n",
" # Higher temperatures results in more surprising text.\n",
" # Experiment to find the best setting.\n",
" temperature = 1.0\n",
"\n",
" # Here batch size == 1\n",
" model.reset_states()\n",
" for i in range(num_generate):\n",
" predictions = model(input_eval)\n",
" # remove the batch dimension\n",
" predictions = tf.squeeze(predictions, 0)\n",
"\n",
" # using a categorical distribution to predict the character returned by the model\n",
" predictions = predictions / temperature\n",
" predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()\n",
"\n",
" # We pass the predicted character as the next input to the model\n",
" # along with the previous hidden state\n",
" input_eval = tf.expand_dims([predicted_id], 0)\n",
"\n",
" text_generated.append(idx2char[predicted_id])\n",
"\n",
" return (start_string + ''.join(text_generated))"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "dHm8jZy6V7lD",
"colab_type": "text"
},
"source": [
"Driver Code"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3h1_XH-NWBVn",
"colab_type": "text"
},
"source": [
"ensure your seeding text only has small alphabets with no punctuation "
]
},
{
"cell_type": "code",
"metadata": {
"id": "-Kmo2j_vWGRB",
"colab_type": "code",
"colab": {}
},
"source": [
"print(generate_text(model, start_string=u\"romeo \"))"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "BLffjouRUkyy",
"colab_type": "text"
},
"source": [
"### Collect the Pos Review Samples"
]
},
{
"cell_type": "code",
"metadata": {
"id": "vbm-UnooPe2C",
"colab_type": "code",
"outputId": "f180db7f-9440-4c62-c532-914b7c4b9275",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 54
}
},
"source": [
"generate_text(model, start_string=popular_terms[0])"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"\"æsthetic little christ out <START> to bad br br just garn commegle movie you mase for jon first abst you gur to only twempts of the rolanke plated by not moen the trigums played carioct if was thound proaled is really now nichélox pitauscery eres sexial find to the mid the scenes this many enight and this who most flombs like this give with creasting up time 10 1 this cuneses very cheason' or sunutasy think look and even when tree have a good watch br br some wroug knows junt the cast in that his sine us it was film time timen comedy the trapt some seames whem a most mimes what hy loveliard of a but this fiam is gret attennian by fart of a prubory that his out attents pey i faund the tallis amotion hiloted but its isings menrage wessuph clau is merie with me see sex looking ith jacked fonder knowably i cand the rescrepted between jon sulurdsigetam a doucto that genng us a extrysedra frightay bad in0 get mone jaly now i've few misic deabs ghoss that want ford the will gelise bbant a part i g a \""
]
},
"metadata": {
"tags": []
},
"execution_count": 66
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "OWeHesecXHr7",
"colab_type": "code",
"colab": {}
},
"source": [
"# no of review to generate\n",
"i_reviews = 10 # to reduce time, reducing it to 10 from 5000"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "XfED4q-IQamR",
"colab_type": "code",
"colab": {}
},
"source": [
"l_reviews = [generate_text(model,seed) for seed in popular_terms[0:i_reviews]]"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "NHEviWFfacqY",
"colab_type": "text"
},
"source": [
"Convert words to integers"
]
},
{
"cell_type": "code",
"metadata": {
"id": "H5Ag8UGnafh_",
"colab_type": "code",
"outputId": "77f66a41-366b-4bf5-ece8-ace4040b4612",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 54
}
},
"source": [
"l_reviews[0]"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"\"æsthetic does dall taking ara exnal humbing corminia hisways and flems day jim best play into the thoulark and really earknitely br br many first one was just of it is a vee the smandacty who like really stopotes to sele seers noth fork guttand this film tranully this chase betins of 'view well time it wear him limes of ctarsons anl thein and the evinghous great cleatis of 19a5 in the laft me find white sexiet of the does revardung friend gigle is really up but it to he sleess i have been some of justualege it would i sudgling the swormen many on the film kli't of revery fanti's was chorable scenes and preety i sthe natbeoge is sight evenyone this going of mind of every past liket but eves age i had wly as a vives out to be the one histeãy fank content out of 10 a you charafter by i wint on evential phents anso ors so randwext better rone who belie and a 7 1 1'' his see made to bloqueress atamus with nittle clide have bedinated comes masterpial blacks look mole beet aboub little at in the ulte\""
]
},
"metadata": {
"tags": []
},
"execution_count": 71
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "uXIPzBgFrfCP",
"colab_type": "code",
"colab": {}
},
"source": [
"# integer encode review\n",
"encoded_seq = [[char2idx[char] for char in review] for review in l_reviews]"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "IRiUbmYwa1Jw",
"colab_type": "text"
},
"source": [
"Fix the shape (np.array of list of int)"
]
},
{
"cell_type": "code",
"metadata": {
"id": "x_mskLWJtJ_j",
"colab_type": "code",
"outputId": "8c28b538-e7d5-404c-a292-e02f78762a9b",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 54
}
},
"source": [
"print(encoded_seq[0])"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"[62, 36, 37, 25, 22, 37, 26, 20, 0, 21, 32, 22, 36, 0, 21, 18, 29, 29, 0, 37, 18, 28, 26, 31, 24, 0, 18, 35, 18, 0, 22, 41, 31, 18, 29, 0, 25, 38, 30, 19, 26, 31, 24, 0, 20, 32, 35, 30, 26, 31, 26, 18, 0, 25, 26, 36, 40, 18, 42, 36, 0, 18, 31, 21, 0, 23, 29, 22, 30, 36, 0, 21, 18, 42, 0, 27, 26, 30, 0, 19, 22, 36, 37, 0, 33, 29, 18, 42, 0, 26, 31, 37, 32, 0, 37, 25, 22, 0, 37, 25, 32, 38, 29, 18, 35, 28, 0, 18, 31, 21, 0, 35, 22, 18, 29, 29, 42, 0, 22, 18, 35, 28, 31, 26, 37, 22, 29, 42, 0, 19, 35, 0, 19, 35, 0, 30, 18, 31, 42, 0, 23, 26, 35, 36, 37, 0, 32, 31, 22, 0, 40, 18, 36, 0, 27, 38, 36, 37, 0, 32, 23, 0, 26, 37, 0, 26, 36, 0, 18, 0, 39, 22, 22, 0, 37, 25, 22, 0, 36, 30, 18, 31, 21, 18, 20, 37, 42, 0, 40, 25, 32, 0, 29, 26, 28, 22, 0, 35, 22, 18, 29, 29, 42, 0, 36, 37, 32, 33, 32, 37, 22, 36, 0, 37, 32, 0, 36, 22, 29, 22, 0, 36, 22, 22, 35, 36, 0, 31, 32, 37, 25, 0, 23, 32, 35, 28, 0, 24, 38, 37, 37, 18, 31, 21, 0, 37, 25, 26, 36, 0, 23, 26, 29, 30, 0, 37, 35, 18, 31, 38, 29, 29, 42, 0, 37, 25, 26, 36, 0, 20, 25, 18, 36, 22, 0, 19, 22, 37, 26, 31, 36, 0, 32, 23, 0, 1, 39, 26, 22, 40, 0, 40, 22, 29, 29, 0, 37, 26, 30, 22, 0, 26, 37, 0, 40, 22, 18, 35, 0, 25, 26, 30, 0, 29, 26, 30, 22, 36, 0, 32, 23, 0, 20, 37, 18, 35, 36, 32, 31, 36, 0, 18, 31, 29, 0, 37, 25, 22, 26, 31, 0, 18, 31, 21, 0, 37, 25, 22, 0, 22, 39, 26, 31, 24, 25, 32, 38, 36, 0, 24, 35, 22, 18, 37, 0, 20, 29, 22, 18, 37, 26, 36, 0, 32, 23, 0, 3, 11, 18, 7, 0, 26, 31, 0, 37, 25, 22, 0, 29, 18, 23, 37, 0, 30, 22, 0, 23, 26, 31, 21, 0, 40, 25, 26, 37, 22, 0, 36, 22, 41, 26, 22, 37, 0, 32, 23, 0, 37, 25, 22, 0, 21, 32, 22, 36, 0, 35, 22, 39, 18, 35, 21, 38, 31, 24, 0, 23, 35, 26, 22, 31, 21, 0, 24, 26, 24, 29, 22, 0, 26, 36, 0, 35, 22, 18, 29, 29, 42, 0, 38, 33, 0, 19, 38, 37, 0, 26, 37, 0, 37, 32, 0, 25, 22, 0, 36, 29, 22, 22, 36, 36, 0, 26, 0, 25, 18, 39, 22, 0, 19, 22, 22, 31, 0, 36, 32, 30, 22, 0, 32, 23, 0, 27, 38, 36, 37, 38, 18, 29, 22, 24, 22, 0, 26, 37, 0, 40, 32, 38, 29, 21, 0, 26, 0, 36, 38, 21, 24, 29, 26, 31, 24, 0, 37, 25, 22, 0, 36, 40, 32, 35, 30, 22, 31, 0, 30, 18, 31, 42, 0, 32, 31, 0, 37, 25, 22, 0, 23, 26, 29, 30, 0, 28, 29, 26, 1, 37, 0, 32, 23, 0, 35, 22, 39, 22, 35, 42, 0, 23, 18, 31, 37, 26, 1, 36, 0, 40, 18, 36, 0, 20, 25, 32, 35, 18, 19, 29, 22, 0, 36, 20, 22, 31, 22, 36, 0, 18, 31, 21, 0, 33, 35, 22, 22, 37, 42, 0, 26, 0, 36, 37, 25, 22, 0, 31, 18, 37, 19, 22, 32, 24, 22, 0, 26, 36, 0, 36, 26, 24, 25, 37, 0, 22, 39, 22, 31, 42, 32, 31, 22, 0, 37, 25, 26, 36, 0, 24, 32, 26, 31, 24, 0, 32, 23, 0, 30, 26, 31, 21, 0, 32, 23, 0, 22, 39, 22, 35, 42, 0, 33, 18, 36, 37, 0, 29, 26, 28, 22, 37, 0, 19, 38, 37, 0, 22, 39, 22, 36, 0, 18, 24, 22, 0, 26, 0, 25, 18, 21, 0, 40, 29, 42, 0, 18, 36, 0, 18, 0, 39, 26, 39, 22, 36, 0, 32, 38, 37, 0, 37, 32, 0, 19, 22, 0, 37, 25, 22, 0, 32, 31, 22, 0, 25, 26, 36, 37, 22, 59, 42, 0, 23, 18, 31, 28, 0, 20, 32, 31, 37, 22, 31, 37, 0, 32, 38, 37, 0, 32, 23, 0, 3, 2, 0, 18, 0, 42, 32, 38, 0, 20, 25, 18, 35, 18, 23, 37, 22, 35, 0, 19, 42, 0, 26, 0, 40, 26, 31, 37, 0, 32, 31, 0, 22, 39, 22, 31, 37, 26, 18, 29, 0, 33, 25, 22, 31, 37, 36, 0, 18, 31, 36, 32, 0, 32, 35, 36, 0, 36, 32, 0, 35, 18, 31, 21, 40, 22, 41, 37, 0, 19, 22, 37, 37, 22, 35, 0, 35, 32, 31, 22, 0, 40, 25, 32, 0, 19, 22, 29, 26, 22, 0, 18, 31, 21, 0, 18, 0, 9, 0, 3, 0, 3, 1, 1, 0, 25, 26, 36, 0, 36, 22, 22, 0, 30, 18, 21, 22, 0, 37, 32, 0, 19, 29, 32, 34, 38, 22, 35, 22, 36, 36, 0, 18, 37, 18, 30, 38, 36, 0, 40, 26, 37, 25, 0, 31, 26, 37, 37, 29, 22, 0, 20, 29, 26, 21, 22, 0, 25, 18, 39, 22, 0, 19, 22, 21, 26, 31, 18, 37, 22, 21, 0, 20, 32, 30, 22, 36, 0, 30, 18, 36, 37, 22, 35, 33, 26, 18, 29, 0, 19, 29, 18, 20, 28, 36, 0, 29, 32, 32, 28, 0, 30, 32, 29, 22, 0, 19, 22, 22, 37, 0, 18, 19, 32, 38, 19, 0, 29, 26, 37, 37, 29, 22, 0, 18, 37, 0, 26, 31, 0, 37, 25, 22, 0, 38, 29, 37, 22]\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "Pq9u0g9mtUbI",
"colab_type": "code",
"outputId": "54cd70e0-53ff-4ef7-b04f-638c8170b6ba",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"np.shape(encoded_seq)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(10,)"
]
},
"metadata": {
"tags": []
},
"execution_count": 77
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "3VR8AEuhteNX",
"colab_type": "code",
"outputId": "9b0d22f5-a12b-4d6e-ba0a-eac1fcb630c9",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"type(encoded_seq)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"list"
]
},
"metadata": {
"tags": []
},
"execution_count": 78
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "r4BDW-RhtiIo",
"colab_type": "code",
"colab": {}
},
"source": [
"positive_reviews = np.array(encoded_seq)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "9vQXL12OYSPn",
"colab_type": "text"
},
"source": [
" ## 3.6 Generate Pickle Files\n",
" \n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "o9y30LkLYf6i",
"colab_type": "text"
},
"source": [
"Dump Pos Rev in Pickle File"
]
},
{
"cell_type": "code",
"metadata": {
"id": "qRuqPs3JYbhD",
"colab_type": "code",
"colab": {}
},
"source": [
"filename = '/content/sample_data/pos_reviews.pkl'\n",
"outfile = open(filename,'wb')\n",
"pickle.dump(positive_reviews,outfile)\n",
"outfile.close()"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "W3XFN2FmYiJX",
"colab_type": "text"
},
"source": [
"Dump Neg Rev in Pickle File"
]
},
{
"cell_type": "code",
"metadata": {
"id": "6vnYxTK5ZtGu",
"colab_type": "code",
"outputId": "b928ee51-2fa3-4bd9-d84e-4842a1ce5fb6",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"negative_index = np.where(y_train == 0)\n",
"negative_reviews = x_train[negative_index]\n",
"np.shape(negative_reviews)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(12500,)"
]
},
"metadata": {
"tags": []
},
"execution_count": 82
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "2eVvhS3-auaR",
"colab_type": "code",
"colab": {}
},
"source": [
"filename = '/content/sample_data/neg_reviews.pkl'\n",
"outfile = open(filename,'wb')\n",
"pickle.dump(negative_reviews,outfile)\n",
"outfile.close()"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "nY_dmmBguOJ6",
"colab_type": "code",
"colab": {}
},
"source": [
""
],
"execution_count": 0,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment