Skip to content

Instantly share code, notes, and snippets.

@georgehc
Last active December 7, 2022 04:55
Show Gist options
  • Save georgehc/8f26c4c13e1274d64eaa3fba8d03cba7 to your computer and use it in GitHub Desktop.
Save georgehc/8f26c4c13e1274d64eaa3fba8d03cba7 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 95-865: Sentiment Analysis with IMDb Reviews\n",
"\n",
"Author: George H. Chen (georgechen [at symbol] cmu.edu)\n",
"\n",
"This demo shows how to train an LSTM model for sentiment analysis with IMDb reviews. This is a binary classification task: for each review, we classify it as having positive or negative sentiment."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import random\n",
"import os\n",
"\n",
"os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8' # to help make code deterministic\n",
"\n",
"from glob import glob\n",
"\n",
"import torch\n",
"torch.use_deterministic_algorithms(True) # to help make code deterministic\n",
"torch.backends.cudnn.benchmark = False # to help make code deterministic\n",
"import torch.nn as nn\n",
"from torchinfo import summary\n",
"\n",
"np.random.seed(0) # to help make code deterministic\n",
"torch.manual_seed(0) # to help make code deterministic\n",
"random.seed(0) # to help make code deterministic\n",
"\n",
"from UDA_pytorch_utils import UDA_pytorch_classifier_fit, \\\n",
" UDA_plot_train_val_accuracy_vs_epoch, UDA_pytorch_classifier_predict, \\\n",
" UDA_compute_accuracy, UDA_get_rnn_last_time_step_outputs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load the dataset\n",
"\n",
"Here, we downloaded the IMDb dataset from: http://ai.stanford.edu/~amaas/data/sentiment/\n",
"\n",
"We place the file `aclImdb_v1.tar.gz` into `./data/` and uncompress the file within that directory so that after uncompressing, you should have access to the directories `./data/aclImdb/train`, `./data/aclImdb/test`, and other files such as the \"README\" file `./data/aclImdb/README`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"train_dataset = []\n",
"\n",
"for filename in sorted(glob('./data/aclImdb/train/pos/*.txt')):\n",
" with open(filename, 'r', encoding='utf-8') as f:\n",
" train_dataset.append((f.read(), 1)) # 1 means `positive` sentiment\n",
"\n",
"for filename in sorted(glob('./data/aclImdb/train/neg/*.txt')):\n",
" with open(filename, 'r', encoding='utf-8') as f:\n",
" train_dataset.append((f.read(), 0)) # 0 means `negative` sentiment"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# proper training data points: 20000\n",
"# validation data points: 5000\n"
]
}
],
"source": [
"proper_train_size = int(len(train_dataset) * 0.8)\n",
"val_size = len(train_dataset) - proper_train_size\n",
"print('# proper training data points:', proper_train_size)\n",
"print('# validation data points:', val_size)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"proper_train_dataset, val_dataset = torch.utils.data.random_split(train_dataset,\n",
" [proper_train_size,\n",
" val_size])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(\"Master cinéaste Alain Resnais likes to work with those actors who are a part of his family.In this film too we see Resnais' family members like Pierre Arditi, Sabine Azema, André Dussolier and Fanny Ardant dealing with serious themes like death,religion,suicide,love and their overall implications on our daily lives.The formal nature of relationship shared by these people is evident as even friends, they address each other using a formal you.In 1984,while making L'amour à mort,Resnais dealt with time,memory and space to unravel the mysteries of a fundamental question of human existence :Is love stronger than death ? It was 16 years ago in 1968 that Resnais made a somewhat similar film Je t'aime Je t'aime which was also about love and memories.Message of this film is loud and clear :true and deep love can even put science to shame as dead lovers regain their lost lives leaving doctors to care for their reputation.L'amour à mort is like a game which is not at all didactic.It is a film in which the musical score is in perfect tandem with its images.This is one of the reasons why this film can easily be grasped.\",\n",
" 1)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"proper_train_dataset[0] # this is a tuple of the format (text, label)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"from torchtext.data import get_tokenizer"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/george/anaconda3_UDA/lib/python3.9/site-packages/torch/cuda/__init__.py:497: UserWarning: Can't initialize NVML\n",
" warnings.warn(\"Can't initialize NVML\")\n"
]
}
],
"source": [
"tokenizer = get_tokenizer('spacy', language='en_core_web_sm')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Master',\n",
" 'cinéaste',\n",
" 'Alain',\n",
" 'Resnais',\n",
" 'likes',\n",
" 'to',\n",
" 'work',\n",
" 'with',\n",
" 'those',\n",
" 'actors',\n",
" 'who',\n",
" 'are',\n",
" 'a',\n",
" 'part',\n",
" 'of',\n",
" 'his',\n",
" 'family',\n",
" '.',\n",
" 'In',\n",
" 'this',\n",
" 'film',\n",
" 'too',\n",
" 'we',\n",
" 'see',\n",
" 'Resnais',\n",
" \"'\",\n",
" 'family',\n",
" 'members',\n",
" 'like',\n",
" 'Pierre',\n",
" 'Arditi',\n",
" ',',\n",
" 'Sabine',\n",
" 'Azema',\n",
" ',',\n",
" 'André',\n",
" 'Dussolier',\n",
" 'and',\n",
" 'Fanny',\n",
" 'Ardant',\n",
" 'dealing',\n",
" 'with',\n",
" 'serious',\n",
" 'themes',\n",
" 'like',\n",
" 'death',\n",
" ',',\n",
" 'religion',\n",
" ',',\n",
" 'suicide',\n",
" ',',\n",
" 'love',\n",
" 'and',\n",
" 'their',\n",
" 'overall',\n",
" 'implications',\n",
" 'on',\n",
" 'our',\n",
" 'daily',\n",
" 'lives',\n",
" '.',\n",
" 'The',\n",
" 'formal',\n",
" 'nature',\n",
" 'of',\n",
" 'relationship',\n",
" 'shared',\n",
" 'by',\n",
" 'these',\n",
" 'people',\n",
" 'is',\n",
" 'evident',\n",
" 'as',\n",
" 'even',\n",
" 'friends',\n",
" ',',\n",
" 'they',\n",
" 'address',\n",
" 'each',\n",
" 'other',\n",
" 'using',\n",
" 'a',\n",
" 'formal',\n",
" 'you',\n",
" '.',\n",
" 'In',\n",
" '1984,while',\n",
" 'making',\n",
" \"L'amour\",\n",
" 'à',\n",
" 'mort',\n",
" ',',\n",
" 'Resnais',\n",
" 'dealt',\n",
" 'with',\n",
" 'time',\n",
" ',',\n",
" 'memory',\n",
" 'and',\n",
" 'space',\n",
" 'to',\n",
" 'unravel',\n",
" 'the',\n",
" 'mysteries',\n",
" 'of',\n",
" 'a',\n",
" 'fundamental',\n",
" 'question',\n",
" 'of',\n",
" 'human',\n",
" 'existence',\n",
" ':',\n",
" 'Is',\n",
" 'love',\n",
" 'stronger',\n",
" 'than',\n",
" 'death',\n",
" '?',\n",
" 'It',\n",
" 'was',\n",
" '16',\n",
" 'years',\n",
" 'ago',\n",
" 'in',\n",
" '1968',\n",
" 'that',\n",
" 'Resnais',\n",
" 'made',\n",
" 'a',\n",
" 'somewhat',\n",
" 'similar',\n",
" 'film',\n",
" 'Je',\n",
" \"t'aime\",\n",
" 'Je',\n",
" \"t'aime\",\n",
" 'which',\n",
" 'was',\n",
" 'also',\n",
" 'about',\n",
" 'love',\n",
" 'and',\n",
" 'memories',\n",
" '.',\n",
" 'Message',\n",
" 'of',\n",
" 'this',\n",
" 'film',\n",
" 'is',\n",
" 'loud',\n",
" 'and',\n",
" 'clear',\n",
" ':',\n",
" 'true',\n",
" 'and',\n",
" 'deep',\n",
" 'love',\n",
" 'can',\n",
" 'even',\n",
" 'put',\n",
" 'science',\n",
" 'to',\n",
" 'shame',\n",
" 'as',\n",
" 'dead',\n",
" 'lovers',\n",
" 'regain',\n",
" 'their',\n",
" 'lost',\n",
" 'lives',\n",
" 'leaving',\n",
" 'doctors',\n",
" 'to',\n",
" 'care',\n",
" 'for',\n",
" 'their',\n",
" 'reputation',\n",
" '.',\n",
" \"L'amour\",\n",
" 'à',\n",
" 'mort',\n",
" 'is',\n",
" 'like',\n",
" 'a',\n",
" 'game',\n",
" 'which',\n",
" 'is',\n",
" 'not',\n",
" 'at',\n",
" 'all',\n",
" 'didactic',\n",
" '.',\n",
" 'It',\n",
" 'is',\n",
" 'a',\n",
" 'film',\n",
" 'in',\n",
" 'which',\n",
" 'the',\n",
" 'musical',\n",
" 'score',\n",
" 'is',\n",
" 'in',\n",
" 'perfect',\n",
" 'tandem',\n",
" 'with',\n",
" 'its',\n",
" 'images',\n",
" '.',\n",
" 'This',\n",
" 'is',\n",
" 'one',\n",
" 'of',\n",
" 'the',\n",
" 'reasons',\n",
" 'why',\n",
" 'this',\n",
" 'film',\n",
" 'can',\n",
" 'easily',\n",
" 'be',\n",
" 'grasped',\n",
" '.']"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokenizer(proper_train_dataset[0][0])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"proper_train_dataset_as_tokens_without_labels = [tokenizer(text) for text, _ in proper_train_dataset]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"from torchtext.vocab import build_vocab_from_iterator"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"vocab = build_vocab_from_iterator(proper_train_dataset_as_tokens_without_labels,\n",
" specials=[\"<unk>\"])"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"vocab.set_default_index(vocab[\"<unk>\"])"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[3752,\n",
" 87615,\n",
" 12966,\n",
" 17640,\n",
" 1240,\n",
" 7,\n",
" 178,\n",
" 20,\n",
" 172,\n",
" 173,\n",
" 43,\n",
" 31,\n",
" 5,\n",
" 194,\n",
" 6,\n",
" 34,\n",
" 250,\n",
" 3,\n",
" 158,\n",
" 15,\n",
" 23,\n",
" 115,\n",
" 98,\n",
" 78,\n",
" 17640,\n",
" 49,\n",
" 250,\n",
" 1093,\n",
" 46,\n",
" 6444,\n",
" 63116,\n",
" 2,\n",
" 39058,\n",
" 63412,\n",
" 2,\n",
" 16573,\n",
" 32452,\n",
" 4,\n",
" 13479,\n",
" 36868,\n",
" 2001,\n",
" 20,\n",
" 642,\n",
" 1339,\n",
" 46,\n",
" 419,\n",
" 2,\n",
" 2416,\n",
" 2,\n",
" 1804,\n",
" 2,\n",
" 142,\n",
" 4,\n",
" 79,\n",
" 872,\n",
" 11585,\n",
" 27,\n",
" 304,\n",
" 3595,\n",
" 494,\n",
" 3,\n",
" 24,\n",
" 12327,\n",
" 917,\n",
" 6,\n",
" 653,\n",
" 5512,\n",
" 40,\n",
" 168,\n",
" 99,\n",
" 8,\n",
" 3863,\n",
" 19,\n",
" 80,\n",
" 399,\n",
" 2,\n",
" 44,\n",
" 5998,\n",
" 287,\n",
" 96,\n",
" 835,\n",
" 5,\n",
" 12327,\n",
" 30,\n",
" 3,\n",
" 158,\n",
" 61332,\n",
" 255,\n",
" 47326,\n",
" 15138,\n",
" 54119,\n",
" 2,\n",
" 17640,\n",
" 3446,\n",
" 20,\n",
" 67,\n",
" 2,\n",
" 1964,\n",
" 4,\n",
" 1060,\n",
" 7,\n",
" 9490,\n",
" 1,\n",
" 4811,\n",
" 6,\n",
" 5,\n",
" 9632,\n",
" 923,\n",
" 6,\n",
" 448,\n",
" 2108,\n",
" 90,\n",
" 863,\n",
" 142,\n",
" 3669,\n",
" 86,\n",
" 419,\n",
" 57,\n",
" 53,\n",
" 18,\n",
" 3290,\n",
" 177,\n",
" 636,\n",
" 9,\n",
" 5389,\n",
" 12,\n",
" 17640,\n",
" 103,\n",
" 5,\n",
" 648,\n",
" 747,\n",
" 23,\n",
" 14617,\n",
" 14452,\n",
" 14617,\n",
" 14452,\n",
" 69,\n",
" 18,\n",
" 107,\n",
" 50,\n",
" 142,\n",
" 4,\n",
" 1941,\n",
" 3,\n",
" 29251,\n",
" 6,\n",
" 15,\n",
" 23,\n",
" 8,\n",
" 1337,\n",
" 4,\n",
" 799,\n",
" 90,\n",
" 331,\n",
" 4,\n",
" 1013,\n",
" 142,\n",
" 68,\n",
" 80,\n",
" 291,\n",
" 1289,\n",
" 7,\n",
" 1041,\n",
" 19,\n",
" 504,\n",
" 1882,\n",
" 10141,\n",
" 79,\n",
" 501,\n",
" 494,\n",
" 1257,\n",
" 6468,\n",
" 7,\n",
" 491,\n",
" 22,\n",
" 79,\n",
" 2764,\n",
" 3,\n",
" 47326,\n",
" 15138,\n",
" 54119,\n",
" 8,\n",
" 46,\n",
" 5,\n",
" 549,\n",
" 69,\n",
" 8,\n",
" 32,\n",
" 39,\n",
" 41,\n",
" 18811,\n",
" 3,\n",
" 53,\n",
" 8,\n",
" 5,\n",
" 23,\n",
" 9,\n",
" 69,\n",
" 1,\n",
" 668,\n",
" 656,\n",
" 8,\n",
" 9,\n",
" 454,\n",
" 23528,\n",
" 20,\n",
" 112,\n",
" 1276,\n",
" 3,\n",
" 65,\n",
" 8,\n",
" 37,\n",
" 6,\n",
" 1,\n",
" 1051,\n",
" 190,\n",
" 15,\n",
" 23,\n",
" 68,\n",
" 754,\n",
" 35,\n",
" 21483,\n",
" 3]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# we can now convert any piece of text first into tokens and then from tokens into indices into the vocabulary\n",
"vocab(tokenizer(proper_train_dataset[0][0]))"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Master'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# we can also look up what any specific word index refers to\n",
"vocab.lookup_token(3752)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"proper_train_encoded = [vocab(tokens) for tokens in proper_train_dataset_as_tokens_without_labels]\n",
"# note that another way to have written the above line is to instead write the line below (but this would repeat the work of tokenization which we already did):\n",
"# proper_train_encoded = [vocab(tokenizer(text)) for text, label in proper_train_dataset]"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[3752,\n",
" 87615,\n",
" 12966,\n",
" 17640,\n",
" 1240,\n",
" 7,\n",
" 178,\n",
" 20,\n",
" 172,\n",
" 173,\n",
" 43,\n",
" 31,\n",
" 5,\n",
" 194,\n",
" 6,\n",
" 34,\n",
" 250,\n",
" 3,\n",
" 158,\n",
" 15,\n",
" 23,\n",
" 115,\n",
" 98,\n",
" 78,\n",
" 17640,\n",
" 49,\n",
" 250,\n",
" 1093,\n",
" 46,\n",
" 6444,\n",
" 63116,\n",
" 2,\n",
" 39058,\n",
" 63412,\n",
" 2,\n",
" 16573,\n",
" 32452,\n",
" 4,\n",
" 13479,\n",
" 36868,\n",
" 2001,\n",
" 20,\n",
" 642,\n",
" 1339,\n",
" 46,\n",
" 419,\n",
" 2,\n",
" 2416,\n",
" 2,\n",
" 1804,\n",
" 2,\n",
" 142,\n",
" 4,\n",
" 79,\n",
" 872,\n",
" 11585,\n",
" 27,\n",
" 304,\n",
" 3595,\n",
" 494,\n",
" 3,\n",
" 24,\n",
" 12327,\n",
" 917,\n",
" 6,\n",
" 653,\n",
" 5512,\n",
" 40,\n",
" 168,\n",
" 99,\n",
" 8,\n",
" 3863,\n",
" 19,\n",
" 80,\n",
" 399,\n",
" 2,\n",
" 44,\n",
" 5998,\n",
" 287,\n",
" 96,\n",
" 835,\n",
" 5,\n",
" 12327,\n",
" 30,\n",
" 3,\n",
" 158,\n",
" 61332,\n",
" 255,\n",
" 47326,\n",
" 15138,\n",
" 54119,\n",
" 2,\n",
" 17640,\n",
" 3446,\n",
" 20,\n",
" 67,\n",
" 2,\n",
" 1964,\n",
" 4,\n",
" 1060,\n",
" 7,\n",
" 9490,\n",
" 1,\n",
" 4811,\n",
" 6,\n",
" 5,\n",
" 9632,\n",
" 923,\n",
" 6,\n",
" 448,\n",
" 2108,\n",
" 90,\n",
" 863,\n",
" 142,\n",
" 3669,\n",
" 86,\n",
" 419,\n",
" 57,\n",
" 53,\n",
" 18,\n",
" 3290,\n",
" 177,\n",
" 636,\n",
" 9,\n",
" 5389,\n",
" 12,\n",
" 17640,\n",
" 103,\n",
" 5,\n",
" 648,\n",
" 747,\n",
" 23,\n",
" 14617,\n",
" 14452,\n",
" 14617,\n",
" 14452,\n",
" 69,\n",
" 18,\n",
" 107,\n",
" 50,\n",
" 142,\n",
" 4,\n",
" 1941,\n",
" 3,\n",
" 29251,\n",
" 6,\n",
" 15,\n",
" 23,\n",
" 8,\n",
" 1337,\n",
" 4,\n",
" 799,\n",
" 90,\n",
" 331,\n",
" 4,\n",
" 1013,\n",
" 142,\n",
" 68,\n",
" 80,\n",
" 291,\n",
" 1289,\n",
" 7,\n",
" 1041,\n",
" 19,\n",
" 504,\n",
" 1882,\n",
" 10141,\n",
" 79,\n",
" 501,\n",
" 494,\n",
" 1257,\n",
" 6468,\n",
" 7,\n",
" 491,\n",
" 22,\n",
" 79,\n",
" 2764,\n",
" 3,\n",
" 47326,\n",
" 15138,\n",
" 54119,\n",
" 8,\n",
" 46,\n",
" 5,\n",
" 549,\n",
" 69,\n",
" 8,\n",
" 32,\n",
" 39,\n",
" 41,\n",
" 18811,\n",
" 3,\n",
" 53,\n",
" 8,\n",
" 5,\n",
" 23,\n",
" 9,\n",
" 69,\n",
" 1,\n",
" 668,\n",
" 656,\n",
" 8,\n",
" 9,\n",
" 454,\n",
" 23528,\n",
" 20,\n",
" 112,\n",
" 1276,\n",
" 3,\n",
" 65,\n",
" 8,\n",
" 37,\n",
" 6,\n",
" 1,\n",
" 1051,\n",
" 190,\n",
" 15,\n",
" 23,\n",
" 68,\n",
" 754,\n",
" 35,\n",
" 21483,\n",
" 3]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"proper_train_encoded[0]"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Master', 'cinéaste', 'Alain', 'Resnais', 'likes', 'to', 'work', 'with', 'those', 'actors', 'who', 'are', 'a', 'part', 'of', 'his', 'family', '.', 'In', 'this', 'film', 'too', 'we', 'see', 'Resnais', \"'\", 'family', 'members', 'like', 'Pierre', 'Arditi', ',', 'Sabine', 'Azema', ',', 'André', 'Dussolier', 'and', 'Fanny', 'Ardant', 'dealing', 'with', 'serious', 'themes', 'like', 'death', ',', 'religion', ',', 'suicide', ',', 'love', 'and', 'their', 'overall', 'implications', 'on', 'our', 'daily', 'lives', '.', 'The', 'formal', 'nature', 'of', 'relationship', 'shared', 'by', 'these', 'people', 'is', 'evident', 'as', 'even', 'friends', ',', 'they', 'address', 'each', 'other', 'using', 'a', 'formal', 'you', '.', 'In', '1984,while', 'making', \"L'amour\", 'à', 'mort', ',', 'Resnais', 'dealt', 'with', 'time', ',', 'memory', 'and', 'space', 'to', 'unravel', 'the', 'mysteries', 'of', 'a', 'fundamental', 'question', 'of', 'human', 'existence', ':', 'Is', 'love', 'stronger', 'than', 'death', '?', 'It', 'was', '16', 'years', 'ago', 'in', '1968', 'that', 'Resnais', 'made', 'a', 'somewhat', 'similar', 'film', 'Je', \"t'aime\", 'Je', \"t'aime\", 'which', 'was', 'also', 'about', 'love', 'and', 'memories', '.', 'Message', 'of', 'this', 'film', 'is', 'loud', 'and', 'clear', ':', 'true', 'and', 'deep', 'love', 'can', 'even', 'put', 'science', 'to', 'shame', 'as', 'dead', 'lovers', 'regain', 'their', 'lost', 'lives', 'leaving', 'doctors', 'to', 'care', 'for', 'their', 'reputation', '.', \"L'amour\", 'à', 'mort', 'is', 'like', 'a', 'game', 'which', 'is', 'not', 'at', 'all', 'didactic', '.', 'It', 'is', 'a', 'film', 'in', 'which', 'the', 'musical', 'score', 'is', 'in', 'perfect', 'tandem', 'with', 'its', 'images', '.', 'This', 'is', 'one', 'of', 'the', 'reasons', 'why', 'this', 'film', 'can', 'easily', 'be', 'grasped', '.']\n"
]
}
],
"source": [
"# we can reconstruct any original review from the encoded version of the review\n",
"print([vocab.lookup_token(word_idx) for word_idx in proper_train_encoded[0]])"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"proper_train_labels = [label for text, label in proper_train_dataset]"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"val_encoded = [vocab(tokenizer(text)) for text, label in val_dataset]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"val_labels = [label for text, label in val_dataset]"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"proper_train_dataset_encoded = list(zip(proper_train_encoded, proper_train_labels))\n",
"val_dataset_encoded = list(zip(val_encoded, val_labels))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setting up a recurrent neural net for sentiment analysis that uses pre-trained word embeddings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We first load in pre-trained GloVe embeddings only for tokens that we encountered in the proper training data."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"from torchtext.vocab import GloVe\n",
"pretrained_embedding = GloVe(name='6B', dim=100)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"torchtext.vocab.vectors.GloVe"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(pretrained_embedding)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"tensor([ 0.2309, 0.2828, 0.6318, -0.5941, -0.5860, 0.6326, 0.2440, -0.1411,\n",
" 0.0608, -0.7898, -0.2910, 0.1429, 0.7227, 0.2043, 0.1407, 0.9876,\n",
" 0.5253, 0.0975, 0.8822, 0.5122, 0.4020, 0.2117, -0.0131, -0.7162,\n",
" 0.5539, 1.1452, -0.8804, -0.5022, -0.2281, 0.0239, 0.1072, 0.0837,\n",
" 0.5501, 0.5848, 0.7582, 0.4571, -0.2800, 0.2522, 0.6896, -0.6097,\n",
" 0.1958, 0.0442, -0.3114, -0.6883, -0.2272, 0.4618, -0.7716, 0.1021,\n",
" 0.5564, 0.0674, -0.5721, 0.2374, 0.4717, 0.8277, -0.2926, -1.3422,\n",
" -0.0993, 0.2814, 0.4160, 0.1058, 0.6220, 0.8950, -0.2345, 0.5135,\n",
" 0.9938, 1.1846, -0.1636, 0.2065, 0.7385, 0.2406, -0.9647, 0.1348,\n",
" -0.0072, 0.3302, -0.1236, 0.2719, -0.4095, 0.0219, -0.6069, 0.4076,\n",
" 0.1957, -0.4180, 0.1864, -0.0327, -0.7857, -0.1385, 0.0440, -0.0844,\n",
" 0.0491, 0.2410, 0.4527, -0.1868, 0.4618, 0.0891, -0.1819, -0.0152,\n",
" -0.7368, -0.1453, 0.1510, -0.7149])"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pretrained_embedding['cat']\n",
"# note that if you ask for a word embedding for a word that GloVe does not keep\n",
"# track of, you'll get all zeros"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"embedding_matrix = torch.zeros(len(vocab), pretrained_embedding.dim)\n",
"for i, token in enumerate(vocab.lookup_tokens(range(len(vocab)))):\n",
" embedding_matrix[i] = pretrained_embedding[token]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The next code cell constructs a PyTorch recurrent neural net model for sentiment analysis. Unfortunately, the code involved doesn't readily work with ``nn.Sequential`` that we have previously been using (for somewhat complicated reasons; in short, it has to do with an input batch of time series possibly having varying lengths, and accounting for varying lengths properly is not easy to do with ``nn.Sequential``). We will instead build the PyTorch model using another standard approach by creating a Python class that inherits from the `nn.Module` class.\n",
"\n",
"To illustrate how this works, here was how we created the multilayer perceptron model for MNIST digits in the previous demo:\n",
"\n",
"```\n",
"deeper_model = nn.Sequential(nn.Flatten(),\n",
" nn.Linear(in_features=784, out_features=512),\n",
" nn.ReLU(),\n",
" nn.Linear(in_features=512, out_features=10))\n",
"```\n",
"\n",
"An alternative way to code the same model is as follows:\n",
"\n",
"```\n",
"class DeeperModel(nn.Module):\n",
" def __init__(self, num_in_features, num_intermediate_features, num_out_features):\n",
" super().__init__()\n",
" self.flatten = nn.Flatten()\n",
" self.linear1 = nn.Linear(num_in_features, num_intermediate_features)\n",
" self.relu = nn.ReLU()\n",
" self.linear2 = nn.Linear(num_intermediate_features, num_out_features)\n",
"\n",
" def forward(self, inputs):\n",
" flatten_output = self.flatten(inputs)\n",
" linear1_output = self.linear1(flatten_output)\n",
" relu_output = self.relu(linear1_output)\n",
" linear2_output = self.linear2(relu_output)\n",
" return linear2_output\n",
"\n",
"deeper_model = DeeperModel(784, 512, 10)\n",
"```\n",
"\n",
"**Importantly, in the above code, the `forward` function specifies how the neural net processes input data. Note that the only argument it takes (aside from `self`) is the input data (`inputs`), which for MNIST digits we saw will be of the format (batch size, 1, 28, 28). Consequently, the example input data batch supplied to the `summary` function only needs this 4D table.**\n",
"\n",
"The code below creates the PyTorch neural net model corresponding to the architecture:\n",
"\n",
"1. `Embedding` layer (for every time series: convert the word index at every time step into a 100-dimensional GloVe word embedding)\n",
"2. `LSTM` layer with 64 output nodes (for every time series: put the time series through the LSTM's `for` loop and grab only the last time step's output, which has 64 numbers)\n",
"3. `Linear` layer with 2 output nodes (now every input data point to the linear layer is just a 1D table of 64 numbers, which this linear layer converts to 2 output numbers corresponding to the two classes)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"class EmbeddingLSTMLinearModel(nn.Module):\n",
" def __init__(self, embedding_matrix, num_lstm_output_nodes, num_final_output_nodes):\n",
" super().__init__()\n",
" self.embedding_layer = nn.Embedding.from_pretrained(embedding_matrix)\n",
" self.lstm_layer = nn.LSTM(embedding_matrix.shape[1], num_lstm_output_nodes)\n",
" self.linear_layer = nn.Linear(num_lstm_output_nodes, num_final_output_nodes)\n",
"\n",
" def forward(self, text_encodings, lengths):\n",
" embeddings = self.embedding_layer(text_encodings)\n",
"\n",
" rnn_last_time_step_outputs = \\\n",
" UDA_get_rnn_last_time_step_outputs(embeddings, lengths, self.lstm_layer)\n",
"\n",
" return self.linear_layer(rnn_last_time_step_outputs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Notice that the `forward` function takes _two_ inputs (aside from `self`): `text_encodings` and `lengths`.** Keep in mind that `forward` takes in a batch of input time series, and these input time series could have different lengths. These different lengths are precisely what is stored in `lengths` as a 1D table of integers. Meanwhile, `text_encodings` is a 2D table where the number of rows is the maximum number of time steps in the input batch of time series, and the number of columns is the number of time series in the input batch. See the lecture slides for how `text_encodings` gets filled in (basically we pad all the time series in the batch to be of the same length as the longest time series; the padded entries will of course get ignored by the LSTM layer since it will know the correct length to use per time series). **When we give this neural net model an example data batch using the `summary` function, we need to specify two inputs corresponding to `text_encodings` and `lengths`.**"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"simple_lstm_model = EmbeddingLSTMLinearModel(embedding_matrix, 64, 2)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"==========================================================================================\n",
"Layer (type:depth-idx) Output Shape Param #\n",
"==========================================================================================\n",
"EmbeddingLSTMLinearModel [5, 2] --\n",
"├─Embedding: 1-1 [7, 5, 100] (10,814,200)\n",
"├─LSTM: 1-2 [18, 64] 42,496\n",
"├─Linear: 1-3 [5, 2] 130\n",
"==========================================================================================\n",
"Total params: 10,856,826\n",
"Trainable params: 42,626\n",
"Non-trainable params: 10,814,200\n",
"Total mult-adds (M): 124.66\n",
"==========================================================================================\n",
"Input size (MB): 0.00\n",
"Forward/backward pass size (MB): 0.04\n",
"Params size (MB): 43.43\n",
"Estimated Total Size (MB): 43.47\n",
"=========================================================================================="
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# example where there are 5 input time series of lengths 3, 2, 5, 1, 7;\n",
"# we specify these time series using a 2D table that is padded and a\n",
"# 1D table of lengths (see lecture slides for details)\n",
"summary(simple_lstm_model,\n",
" input_data=[torch.zeros((7, 5), dtype=torch.long),\n",
" torch.tensor([3, 2, 5, 1, 7], dtype=torch.long)])\n",
"\n",
"# note: the LSTM's output is in a compressed format (called a \"packed sequence\")\n",
"# that appears to put all the outputs of all the time series together (the output\n",
"# shape in this case appears to be 18 by 64) but it actually does keep track of\n",
"# which of the time steps correspond to which input time series (i.e., it knows\n",
"# that 3 of the rows correspond to the 0-th time series, 2 of the rows correspond\n",
"# to the 1st time series, 5 of the rows correspond to the 2nd time series, 1 row\n",
"# corresponds to the 3rd time series, and 7 rows correspond to the 4th time\n",
"# series); my helper code automatically maps these rows back to the correct\n",
"# format so that the final linear layer recognizes that there are 5 input data\n",
"# points"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"os.makedirs('./saved_model_checkpoints', exist_ok=True)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" 0%| | 0/10 [00:00<?, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" Train accuracy: 0.7784\n",
" Validation accuracy: 0.7780\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" 10%|████████████████▊ | 1/10 [00:14<02:08, 14.27s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" Train accuracy: 0.8530\n",
" Validation accuracy: 0.8456\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" 20%|█████████████████████████████████▌ | 2/10 [00:28<01:53, 14.25s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" Train accuracy: 0.8858\n",
" Validation accuracy: 0.8654\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" 30%|██████████████████████████████████████████████████▍ | 3/10 [00:42<01:39, 14.21s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" Train accuracy: 0.9025\n",
" Validation accuracy: 0.8704\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" 40%|███████████████████████████████████████████████████████████████████▏ | 4/10 [00:56<01:25, 14.22s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" Train accuracy: 0.9155\n",
" Validation accuracy: 0.8716\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" 50%|████████████████████████████████████████████████████████████████████████████████████ | 5/10 [01:11<01:11, 14.27s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" Train accuracy: 0.9301\n",
" Validation accuracy: 0.8770\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" 60%|████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 6/10 [01:25<00:56, 14.24s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" Train accuracy: 0.9284\n",
" Validation accuracy: 0.8666\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" 70%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 7/10 [01:39<00:42, 14.30s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" Train accuracy: 0.9562\n",
" Validation accuracy: 0.8776\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" 80%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 8/10 [01:53<00:28, 14.24s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" Train accuracy: 0.9637\n",
" Validation accuracy: 0.8744\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" 90%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 9/10 [02:08<00:14, 14.20s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" Train accuracy: 0.9705\n",
" Validation accuracy: 0.8728\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [02:22<00:00, 14.23s/it]\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"num_epochs = 10 # during optimization, how many times we look at training data\n",
"batch_size = 64 # during optimization, how many training data to use at each step\n",
"learning_rate = 0.005 # during optimization, how much we nudge our solution at each step\n",
"\n",
"train_accuracies, val_accuracies = \\\n",
" UDA_pytorch_classifier_fit(simple_lstm_model,\n",
" torch.optim.Adam(simple_lstm_model.parameters(),\n",
" lr=learning_rate),\n",
" nn.CrossEntropyLoss(), # includes softmax\n",
" proper_train_dataset_encoded, val_dataset_encoded,\n",
" num_epochs, batch_size,\n",
" rnn=True,\n",
" save_epoch_checkpoint_prefix='./saved_model_checkpoints/imdb_lstm')\n",
"\n",
"UDA_plot_train_val_accuracy_vs_epoch(train_accuracies, val_accuracies)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The model at the end of epoch 8 achieved the highest validation accuracy: 0.877600\n"
]
},
{
"data": {
"text/plain": [
"<All keys matched successfully>"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"best_epoch_idx = np.argmax(val_accuracies)\n",
"print('The model at the end of epoch %d achieved the highest validation accuracy: %f'\n",
" % (best_epoch_idx + 1, val_accuracies[best_epoch_idx]))\n",
"simple_lstm_model.load_state_dict(torch.load('./saved_model_checkpoints/imdb_lstm_epoch%d.pt' % (best_epoch_idx + 1)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Finally evaluate on test data"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"test_dataset = []\n",
"\n",
"for filename in sorted(glob('./data/aclImdb/test/pos/*.txt')):\n",
" with open(filename, 'r', encoding='utf-8') as f:\n",
" test_dataset.append((f.read(), 1)) # 1 means `positive` sentiment\n",
"\n",
"for filename in sorted(glob('./data/aclImdb/test/neg/*.txt')):\n",
" with open(filename, 'r', encoding='utf-8') as f:\n",
" test_dataset.append((f.read(), 0)) # 0 means `negative` sentiment"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"test_encoded = [vocab(tokenizer(text)) for text, label in test_dataset]"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"test_labels = [label for text, label in test_dataset]"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"predicted_test_labels = UDA_pytorch_classifier_predict(simple_lstm_model,\n",
" test_encoded,\n",
" rnn=True)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Test accuracy: 0.87848\n"
]
}
],
"source": [
"print('Test accuracy:', UDA_compute_accuracy(predicted_test_labels, test_labels))"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"tensor([1], device='cuda:0')"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"UDA_pytorch_classifier_predict(simple_lstm_model,\n",
" [vocab(tokenizer('this movie rocks'))],\n",
" rnn=True)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"tensor([0], device='cuda:0')"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"UDA_pytorch_classifier_predict(simple_lstm_model,\n",
" [vocab(tokenizer('this movie sucks'))],\n",
" rnn=True)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"tensor([0], device='cuda:0')"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"UDA_pytorch_classifier_predict(simple_lstm_model,\n",
" [vocab(tokenizer('this sucks'))],\n",
" rnn=True)"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment