Skip to content

Instantly share code, notes, and snippets.

@ritakurban
Created November 25, 2019 19:01
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save ritakurban/c9ebcbfa0be45952c99ccd199b57af3d to your computer and use it in GitHub Desktop.
Save ritakurban/c9ebcbfa0be45952c99ccd199b57af3d to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "rZ9VsnQv4wPm"
},
"source": [
"# Convolutional Sentiment Analysis\n",
"The notebook is based on this [tutorial](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/4%20-%20Convolutional%20Sentiment%20Analysis.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "rf-Rkgdr4wPq"
},
"source": [
"### Preparing Data"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"# Install packages\n",
"import torch\n",
"from torchtext import data\n",
"from torchtext import datasets\n",
"import torch.nn as nn\n",
"import torch.nn.functional as F\n",
"import torch.optim as optim\n",
"import random\n",
"import numpy as np\n",
"import spacy\n",
"import time\n",
"import matplotlib.pyplot as plt\n",
"\n",
"#Load English language\n",
"nlp = spacy.load('en')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "UmHvoEZS4wPr"
},
"outputs": [],
"source": [
"# Load data from torchtext (identical to what we have in Kaggle)\n",
"train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)\n",
"train_data, valid_data = train_data.split()\n",
"\n",
"# Initiate class instances with tokenizers\n",
"TEXT = data.Field(tokenize = 'spacy', batch_first = True)\n",
"LABEL = data.LabelField(dtype = torch.float)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "uiRr2TaX4wPz"
},
"outputs": [],
"source": [
"# Select only the most important 30000 words\n",
"MAX_VOCAB_SIZE = 30_000\n",
"\n",
"# Build vocabularies\n",
"TEXT.build_vocab(train_data, \n",
" max_size = MAX_VOCAB_SIZE, \n",
" # Load pretrained embeddings\n",
" vectors = \"glove.6B.100d\", \n",
" # Set unknown vectors\n",
" unk_init = torch.Tensor.normal_)\n",
"\n",
"LABEL.build_vocab(train_data)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "prLYAyhI4wP4"
},
"outputs": [],
"source": [
"BATCH_SIZE = 64\n",
"\n",
"device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
"\n",
"# Create PyTorch iterators to use in training/evaluation/testing\n",
"train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(\n",
" (train_data, valid_data, test_data), \n",
" batch_size = BATCH_SIZE, \n",
" device = device)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "v_87VK9S4wP7"
},
"source": [
"### Build Model\n",
"\n",
"We visualize our words in 2 dimensions, each word along one axis and the elements of the word embedding that corresponds to this word aross the other dimension.\n",
"\n",
"In the model, we will have different sizes of filters, heights of 1, 2, 3, 4 and 5, with 100 of each of them. The intuition is that we will be looking for the occurence of different bi-grams, tri-grams, 4-grams and 5-grams that are relevant for analysing sentiment of movie reviews.\n",
"\n",
"The idea behind max-pooling is that the maximum value is the \"most important\" feature for determining the sentiment of the review, which corresponds to the \"most important\" n-gram within the review that is identified through backprop.\n",
"\n",
"After getting 400 different n-grams, we concatenate them together into a single vector and pass them through a linear layer to predict the sentiment. We can think of the weights of this linear layer as \"weighting up the evidence\" from each of the 500 n-grams and making a final decision. "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "TLvnIV5s4wQC"
},
"outputs": [],
"source": [
"class CNN_Text(nn.Module):\n",
" ''' Define network architecture and forward path. '''\n",
" def __init__(self, vocab_size, \n",
" vector_size, n_filters, \n",
" filter_sizes, output_dim, \n",
" dropout, pad_idx):\n",
" \n",
" super().__init__()\n",
" # Create word embeddings from the input words \n",
" self.embedding = nn.Embedding(vocab_size, vector_size, \n",
" padding_idx = pad_idx)\n",
" \n",
" # Specify convolutions with filters of different sizes (fs)\n",
" self.convs = nn.ModuleList([nn.Conv2d(in_channels = 1, \n",
" out_channels = n_filters, \n",
" kernel_size = (fs, vector_size)) \n",
" for fs in filter_sizes])\n",
" \n",
" # Add a fully connected layer for final predicitons\n",
" self.linear = nn.Linear(len(filter_sizes) * n_filters, output_dim)\n",
" \n",
" # Drop some of the nodes to increase robustness in training\n",
" self.dropout = nn.Dropout(dropout)\n",
" \n",
" \n",
" \n",
" def forward(self, text):\n",
" '''Forward path of the network.''' \n",
" # Get word embeddings and formt them for convolutions\n",
" embedded = self.embedding(text).unsqueeze(1)\n",
" \n",
" # Perform convolutions and apply activation functions\n",
" conved = [F.relu(conv(embedded)).squeeze(3) for conv in self.convs]\n",
" \n",
" # Pooling layer to reduce dimensionality \n",
" pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]\n",
" \n",
" # Dropout layer\n",
" cat = self.dropout(torch.cat(pooled, dim = 1))\n",
" return self.linear(cat)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize Pre-Trained Model"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "SRJUKMJl4wQI"
},
"outputs": [],
"source": [
"# Vocabulary size\n",
"INPUT_DIM = len(TEXT.vocab)\n",
"\n",
"# Vector size (lower-dimensional repr. of each word)\n",
"EMBEDDING_DIM = 100\n",
"\n",
"# Number of filters\n",
"N_FILTERS = 100\n",
"\n",
"# N-grams that we want to analuze using filters\n",
"FILTER_SIZES = [1, 2, 3, 4, 5]\n",
"\n",
"# Output of the linear layer (prob of a negative review)\n",
"OUTPUT_DIM = 1\n",
"\n",
"# Proportion of units to drop\n",
"DROPOUT = 0.5"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Initialize model and load pre-trained embeddings\n",
"model = CNN_Text(INPUT_DIM, EMBEDDING_DIM, \n",
" N_FILTERS, FILTER_SIZES, \n",
" OUTPUT_DIM, DROPOUT, PAD_IDX)\n",
"\n",
"model.embedding.weight.data.copy_(TEXT.vocab.vectors)\n",
"\n",
"# Zero the initial weights of the UNKnown and padding tokens.\n",
"UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]\n",
"\n",
"# The string token used as padding. Default: “<pad>”.\n",
"PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]\n",
"\n",
"model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)\n",
"model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)\n",
"model = model.to(device)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "-qmSokzB4wQc"
},
"source": [
"## Train & Evaluate Model"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "1F6-f5aJ4wQh"
},
"outputs": [],
"source": [
"# Helper functions\n",
"def accuracy(preds, y):\n",
" \"\"\" Return accuracy per batch. \"\"\"\n",
" correct = (torch.round(torch.sigmoid(preds)) == y).float() \n",
" return correct.sum() / len(correct)\n",
"\n",
"def epoch_time(start_time, end_time):\n",
" '''Track training time. '''\n",
" elapsed_time = end_time - start_time\n",
" elapsed_mins = int(elapsed_time / 60)\n",
" elapsed_secs = int(elapsed_time - (elapsed_mins * 60))\n",
" return elapsed_mins, elapsed_secs"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "SobuEXbx4wQp"
},
"outputs": [],
"source": [
"def train(model, iterator, optimizer, criterion):\n",
" '''Train the model with specified data, optimizer, and loss function. '''\n",
" epoch_loss = 0\n",
" epoch_acc = 0\n",
" \n",
" model.train()\n",
" \n",
" for batch in iterator:\n",
" \n",
" # Reset the gradient to not use them in multiple passes \n",
" optimizer.zero_grad()\n",
" \n",
" predictions = model(batch.text).squeeze(1)\n",
" \n",
" loss = criterion(predictions, batch.label)\n",
" \n",
" acc = accuracy(predictions, batch.label)\n",
" \n",
" # Backprop\n",
" loss.backward()\n",
" \n",
" # Optimize the weights\n",
" optimizer.step()\n",
" \n",
" # Record accuracy and loss\n",
" epoch_loss += loss.item()\n",
" epoch_acc += acc.item()\n",
" \n",
" return epoch_loss / len(iterator), epoch_acc / len(iterator)\n",
"\n",
"\n",
"def evaluate(model, iterator, criterion):\n",
" '''Evaluate model performance. '''\n",
" epoch_loss = 0\n",
" epoch_acc = 0\n",
" \n",
" # Turm off dropout while evaluating\n",
" model.eval()\n",
" \n",
" # No need to backprop in eval\n",
" with torch.no_grad():\n",
" \n",
" for batch in iterator:\n",
"\n",
" predictions = model(batch.text).squeeze(1)\n",
" \n",
" loss = criterion(predictions, batch.label)\n",
" \n",
" acc = accuracy(predictions, batch.label)\n",
"\n",
" epoch_loss += loss.item()\n",
" epoch_acc += acc.item()\n",
" \n",
" return epoch_loss / len(iterator), epoch_acc / len(iterator)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# Network optimizer\n",
"optimizer = optim.Adam(model.parameters())\n",
"\n",
"# Loss function\n",
"criterion = nn.BCEWithLogitsLoss()\n",
"\n",
"model = model.to(device)\n",
"criterion = criterion.to(device)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch: 1 | Epoch Time: 17m 41s\n",
"\tTrain Loss: 0.432 | Train Acc: 80.36%\n",
"\t Val. Loss: 0.370 | Val. Acc: 83.82%\n",
"Epoch: 2 | Epoch Time: 17m 45s\n",
"\tTrain Loss: 0.313 | Train Acc: 86.76%\n",
"\t Val. Loss: 0.322 | Val. Acc: 86.13%\n",
"Epoch: 3 | Epoch Time: 18m 48s\n",
"\tTrain Loss: 0.227 | Train Acc: 90.81%\n",
"\t Val. Loss: 0.306 | Val. Acc: 87.00%\n",
"Epoch: 4 | Epoch Time: 19m 41s\n",
"\tTrain Loss: 0.161 | Train Acc: 93.89%\n",
"\t Val. Loss: 0.312 | Val. Acc: 87.49%\n",
"Epoch: 5 | Epoch Time: 18m 59s\n",
"\tTrain Loss: 0.117 | Train Acc: 95.79%\n",
"\t Val. Loss: 0.333 | Val. Acc: 87.10%\n",
"Epoch: 6 | Epoch Time: 78m 44s\n",
"\tTrain Loss: 0.081 | Train Acc: 97.20%\n",
"\t Val. Loss: 0.349 | Val. Acc: 87.66%\n",
"Epoch: 7 | Epoch Time: 46m 50s\n",
"\tTrain Loss: 0.052 | Train Acc: 98.36%\n",
"\t Val. Loss: 0.379 | Val. Acc: 87.49%\n",
"Epoch: 8 | Epoch Time: 55m 27s\n",
"\tTrain Loss: 0.039 | Train Acc: 98.76%\n",
"\t Val. Loss: 0.414 | Val. Acc: 87.31%\n",
"Epoch: 9 | Epoch Time: 21m 37s\n",
"\tTrain Loss: 0.030 | Train Acc: 99.18%\n",
"\t Val. Loss: 0.440 | Val. Acc: 87.17%\n",
"Epoch: 10 | Epoch Time: 18m 9s\n",
"\tTrain Loss: 0.023 | Train Acc: 99.27%\n",
"\t Val. Loss: 0.468 | Val. Acc: 87.30%\n"
]
}
],
"source": [
"# Training loop\n",
"N_EPOCHS = 10\n",
"\n",
"best_valid_loss = float('inf')\n",
"val_loss = []\n",
"val_acc = []\n",
"tr_loss = []\n",
"tr_acc = []\n",
"\n",
"for epoch in range(N_EPOCHS):\n",
" \n",
" # Calculate training time\n",
" start_time = time.time()\n",
" \n",
" # Get epoch losses and accuracies \n",
" train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n",
" valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)\n",
" \n",
" end_time = time.time()\n",
" epoch_mins, epoch_secs = epoch_time(start_time, end_time)\n",
" \n",
" # Save training metrics\n",
" val_loss.append(valid_loss)\n",
" val_acc.append(valid_acc)\n",
" tr_loss.append(train_loss)\n",
" tr_acc.append(train_acc)\n",
" \n",
" if valid_loss < best_valid_loss:\n",
" best_valid_loss = valid_loss\n",
" torch.save(model.state_dict(), 'CNN-model.pt')\n",
" \n",
" print(f'Epoch: {epoch+1:2} | Epoch Time: {epoch_mins}m {epoch_secs}s')\n",
" print(f'\\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')\n",
" print(f'\\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1080x360 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Plot accuracy and loss\n",
"fig, ax = plt.subplots(1, 2, figsize=(15,5))\n",
"ax[0].plot(val_loss, label='Validation loss')\n",
"ax[0].plot(tr_loss, label='Training loss')\n",
"ax[0].set_title('Losses')\n",
"ax[0].set_xlabel('Epoch')\n",
"ax[0].set_ylabel('Loss')\n",
"ax[0].legend()\n",
"ax[1].plot(val_acc, label='Validation accuracy')\n",
"ax[1].plot(tr_acc, label='Training accuracy')\n",
"ax[1].set_title('Accuracies')\n",
"ax[1].set_xlabel('Epoch')\n",
"ax[1].set_ylabel('Loss')\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "yPFKQRti4wRD",
"outputId": "01d1694f-a0a5-462c-a486-a9eb27934622"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Test Loss: 0.336 | Test Acc: 85.43%\n"
]
}
],
"source": [
"# Evaluate model on test data\n",
"model.load_state_dict(torch.load('CNN-model.pt'))\n",
"\n",
"test_loss, test_acc = evaluate(model, test_iterator, criterion)\n",
"\n",
"print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "GyU2AeeV4wRH"
},
"source": [
"### Check performance with arbitrary sentences"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "sjP-gqt54wRI"
},
"outputs": [],
"source": [
"def sentiment(model, sentence, min_len = 5):\n",
" '''Predict user-defined review sentiment.'''\n",
" model.eval()\n",
" tokenized = [tok.text for tok in nlp.tokenizer(sentence)]\n",
" if len(tokenized) < min_len:\n",
" tokenized += ['<pad>'] * (min_len - len(tokenized))\n",
" # Map words to word embeddings\n",
" indexed = [TEXT.vocab.stoi[t] for t in tokenized]\n",
" tensor = torch.LongTensor(indexed).to(device)\n",
" tensor = tensor.unsqueeze(0)\n",
" # Get predicitons\n",
" prediction = torch.sigmoid(model(tensor))\n",
" return prediction.item()"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "yG4QmntM4wRK",
"outputId": "11b1195b-9887-4de9-e16a-46293d566bdd"
},
"outputs": [
{
"data": {
"text/plain": [
"[0.0067734261974692345, 0.49260634183883667, 0.9711275100708008]"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"reviews = ['This is the best movie I have ever watched!', \n",
" 'This is an okay movie', \n",
" 'This was a waste of time! I hated this movie.']\n",
"scores = [sentiment(model, review) for review in reviews]\n",
"scores"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0.0500052347779274, 0.47646334767341614, 0.9872849583625793]"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tricky_reviews = ['This is not the best movie I have ever watched!', \n",
" 'Some would say it is an okay movie, but I found it terrific.', \n",
" 'This was a waste of time! I did not like this movie.']\n",
"scores = [sentiment(model, review) for review in tricky_reviews]\n",
"scores"
]
}
],
"metadata": {
"colab": {
"name": "4 - Convolutional Sentiment Analysis.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "evamodels",
"language": "python",
"name": "evamodels"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment