Mrpatekful/language_modelling.ipynb

## language_modelling.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "language_modelling.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/Mrpatekful/0200a43b7340d177154c280df95fa8f7/language_modelling.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AV33Xf9Z3u2_",
        "colab_type": "text"
      },
      "source": [
        "# Language modelling techniques\n",
        "\n",
        "---\n",
        "\n",
        "In this notebook I attempt to implement several techniques, which improve the performance of language models and neural language processing algorithms in general.  I will be using the pytorch and torchtext libraries for my experiments, since these frameworks provide excelent tools to rapidly prototype neural network models and input pipelines. This notebook is mainly aimed at measuring the effects of applying *mixture of softmaxes* from  reference *1* and *2*. \n",
        "\n",
        "---\n",
        "\n",
        "## References\n",
        "\n",
        "\n",
        "1.   **[Breaking the Softmax Bottleneck: A High-Rank RNN Language Model](https://arxiv.org/pdf/1711.03953.pdf)**\n",
        "2.   **[Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling](https://arxiv.org/pdf/1611.01462.pdf)**\n",
        "3.   **[Using the Output Embedding to Improve Language Models](https://arxiv.org/pdf/1608.05859.pdf)**\n",
        "4.   **[Regularizing and Optimizing LSTM Language Models](https://arxiv.org/pdf/1708.02182.pdf)**\n",
        "5.  **[Regularization of Neural Networks using DropConnect](https://cs.nyu.edu/~wanli/dropc/dropc.pdf)**\n",
        "6.  **[An Analysis of Neural Language Modeling at Multiple Scales](https://arxiv.org/pdf/1803.08240.pdf)**\n",
        "7.  **[Rethinking the Inception Architecture for Computer Vision](https://arxiv.org/pdf/1512.00567.pdf)**\n",
        "8.  **[Improving Neural Language Models with Weight Norm Initialization and\n",
        "Regularization](https://www.aclweb.org/anthology/W18-6310)**\n",
        "9.  **[A Theoretically Grounded Application of Dropout in\n",
        "Recurrent Neural Networks](https://arxiv.org/pdf/1512.05287.pdf)**\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VXQIMqchUjsy",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import torch\n",
        "import math\n",
        "import numpy as np\n",
        "import random"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "NN0QaXxMUlgh",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# setting seed to make the results reproducible\n",
        "torch.manual_seed(0)\n",
        "torch.cuda.manual_seed(0)\n",
        "np.random.seed(0)\n",
        "random.seed(0)\n",
        "\n",
        "torch.backends.cudnn.deterministic = True\n",
        "torch.backends.cudnn.benchmark = False"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "H5_02NzL--si",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fc8LuifXCMZ9",
        "colab_type": "text"
      },
      "source": [
        "## Input pipeline\n",
        "\n",
        "The techniques are tested on word level lanugage modelling on Penn Treebank dataset from torchtext library. The hyperparameters like BPTT length are taken from **[An Analysis of Neural Language Modeling at Multiple Scales](https://arxiv.org/pdf/1803.08240.pdf)**."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "JuZ-7OEGsm4t",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from torchtext.datasets import PennTreebank"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "RZ9MJERvZ0UF",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "BPTT_LEN = 70\n",
        "BATCH_SIZE = 12"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "hG9O9_sYESsw",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "train, valid, test = PennTreebank.iters(\n",
        "  bptt_len=BPTT_LEN, batch_size=BATCH_SIZE, device=device)\n",
        "# saving vocab reference, since it will be needed later\n",
        "vocab = train.dataset.fields['text'].vocab"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "E46VF75OGVzO",
        "colab_type": "text"
      },
      "source": [
        "## Training\n",
        "\n",
        "The metrics, which are tracked are the token level loss, perplexity and f1-score.  ***TODO*** Activation regularization *(AR)* and temporal activation reguralization *(TAR)* are also to be employed from **[Regularizing and Optimizing LSTM Language Models](https://arxiv.org/pdf/1708.02182.pdf)**."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "akXLBWxqHUOI",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import copy\n",
        "\n",
        "import seaborn as sns\n",
        "import matplotlib.pyplot as plt\n",
        "%matplotlib inline\n",
        "\n",
        "sns.set()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "GrYU8d8fFQQh",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from torch.nn.functional import one_hot\n",
        "from tqdm import tqdm_notebook as tqdm\n",
        "from collections import OrderedDict"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "9l-GokNU17ce",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def compute_f1_score(preds, targets):\n",
        "  \"\"\"\n",
        "  Calculates the f1-score.\n",
        "  \"\"\"\n",
        "  true_pred_indices = torch.nonzero(preds).squeeze()\n",
        "  eps = 1e-6 # for numerical stability\n",
        "  \n",
        "  true_positives = targets[true_pred_indices].sum().float()\n",
        "  precision = true_positives / (true_pred_indices.size(0) + eps)\n",
        "  \n",
        "  recall = true_positives / targets.sum().float()\n",
        "  f1_score = 2 * (precision * recall) / (precision + recall + eps)\n",
        "  \n",
        "  return f1_score.item()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "kW1hMfXSKnHI",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def count_parameters(model):\n",
        "  \"\"\"\n",
        "  Counts the number of parameres.\n",
        "  \"\"\"\n",
        "  return sum(p.numel() for p in model.parameters() if p.requires_grad)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "CBaGGmtZGF8_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def plot_history(histories, ignore_keys=None):\n",
        "  \"\"\"\n",
        "  Visualizes the history of training.\n",
        "  \"\"\"\n",
        "  if ignore_keys is None:\n",
        "    ignore_keys = []\n",
        "  num_metrics = len(list(histories.values())[0])\n",
        "  fig, axes = plt.subplots(\n",
        "      num_metrics, 1, sharex=True, \n",
        "      figsize=(10, 5 * num_metrics))\n",
        "     \n",
        "  for name, history in histories.items():\n",
        "    for ax, (metric, splits) in zip(axes, history.items()):\n",
        "      for (key, values), style in zip(splits.items(), ['--', '-', ':']):\n",
        "        if key not in ignore_keys:\n",
        "          ax.plot(np.array(values), label='{}_{}'.format(name, key), \n",
        "                  linestyle=style)\n",
        "      ax.set_title(metric)\n",
        "      ax.legend()\n",
        "  \n",
        "  plt.show()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ZmZVfPzlGVMm",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def train_and_evaluate(model, criterion, optimizer, max_patience):\n",
        "  \"\"\"\n",
        "  Runs the model through the train and eval data.\n",
        "  \"\"\"\n",
        "  model = model.to(device)\n",
        "  \n",
        "  # tracking metrics for each epoch to visualize progress\n",
        "  train_losses, train_f1_scores, train_perplexities = [], [], []\n",
        "  valid_losses, valid_f1_scores, valid_perplexities = [], [], []\n",
        "  \n",
        "  # perplexity is used as early stopping metric\n",
        "  best_valid_perplexity = np.inf\n",
        "  patience = max_patience\n",
        "  best_model = copy.deepcopy(model)\n",
        "  \n",
        "  reduce_lr = torch.optim.lr_scheduler.ReduceLROnPlateau(\n",
        "    optimizer, 'min')\n",
        "  \n",
        "  while True:\n",
        "    losses, f1_scores, perplexities = [], [], []\n",
        "    \n",
        "    # performing train loop\n",
        "    model.train()\n",
        "    loop = tqdm(train)\n",
        "    hidden = None\n",
        "    \n",
        "    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(\n",
        "      optimizer, len(train), eta_min=1e-6)\n",
        "\n",
        "    for batch in loop:\n",
        "      optimizer.zero_grad()\n",
        "      # NOTE data is processed and provided in format\n",
        "      # seq_len x batch_size x features\n",
        "      text = batch.text\n",
        "      # targets are simply the inputs shifted by one step\n",
        "      inputs, targets = text[:-1], text[1:]\n",
        "      outputs = model(inputs, hidden)\n",
        "      log_probs, hidden, *_ = outputs\n",
        "      \n",
        "      loss, ppl, f1_score = compute_loss(\n",
        "        log_probs, targets, criterion)\n",
        "            \n",
        "      losses.append(loss.item())\n",
        "      perplexities.append(ppl)\n",
        "      f1_scores.append(f1_score)\n",
        "        \n",
        "      loss.backward()\n",
        "      \n",
        "      # gradient clipping to prevent exploding gradient\n",
        "      torch.nn.utils.clip_grad_norm_(model.parameters(), 0.25)\n",
        "      \n",
        "      optimizer.step()\n",
        "      scheduler.step()\n",
        "      \n",
        "      loop.set_postfix(ordered_dict=OrderedDict(\n",
        "        loss=loss.item(), ppl=ppl, f1=f1_score))\n",
        "    \n",
        "    # saving train metrics\n",
        "    train_losses.append(sum(losses) / len(losses))\n",
        "    train_perplexities.append(sum(perplexities) / len(perplexities))\n",
        "    train_f1_scores.append(sum(f1_scores) / len(f1_scores))\n",
        "    \n",
        "    losses, f1_scores, perplexities = [], [], []\n",
        "    \n",
        "    # performing validation loop\n",
        "    model.eval()\n",
        "    loop = tqdm(valid)\n",
        "    hidden = None\n",
        "\n",
        "    with torch.no_grad():\n",
        "      for batch in loop:\n",
        "        text = batch.text\n",
        "        inputs, targets = text[:-1], text[1:]\n",
        "        \n",
        "        log_probs, hidden = model(inputs, hidden)\n",
        "        loss, ppl, f1_score = compute_loss(\n",
        "          log_probs, targets, criterion)\n",
        "        \n",
        "        losses.append(loss.item())\n",
        "        perplexities.append(ppl)\n",
        "        f1_scores.append(f1_score)\n",
        "        \n",
        "        loop.set_postfix(ordered_dict=OrderedDict(\n",
        "          loss=loss.item(), ppl=ppl, f1=f1_score))\n",
        "        \n",
        "    valid_perplexity = sum(perplexities) / len(perplexities)\n",
        "    reduce_lr.step(valid_perplexity)\n",
        "    \n",
        "    # saving valid metrics\n",
        "    valid_losses.append(sum(losses) / len(losses))\n",
        "    valid_perplexities.append(valid_perplexity)\n",
        "    valid_f1_scores.append(sum(f1_scores) / len(f1_scores))\n",
        "    \n",
        "    print('avg val ppl: {}'.format(valid_perplexity))\n",
        "    \n",
        "    # performing early stopping if necessary\n",
        "    if valid_perplexity < best_valid_perplexity:\n",
        "      best_valid_perplexity = valid_perplexity\n",
        "      best_model = copy.deepcopy(model)\n",
        "      patience = max_patience\n",
        "    else:\n",
        "      patience -= 1\n",
        "      if patience == 0:\n",
        "        break\n",
        "        \n",
        "  # performing testing with the best model\n",
        "  losses, f1_scores, perplexities = [], [], []\n",
        "  hidden = None\n",
        "\n",
        "  with torch.no_grad():    \n",
        "    for batch in test:\n",
        "      text = batch.text\n",
        "      inputs, targets = text[:-1], text[1:]\n",
        "      log_probs, hidden = best_model(inputs, hidden)\n",
        "      loss, ppl, f1_score = compute_loss(\n",
        "        log_probs, targets, criterion)\n",
        "      \n",
        "      losses.append(loss.item())\n",
        "      perplexities.append(ppl)\n",
        "      f1_scores.append(f1_score)\n",
        "  \n",
        "  # saving test metrics\n",
        "  test_loss = sum(losses) / len(losses)\n",
        "  test_ppl = sum(perplexities) / len(perplexities)\n",
        "  test_f1_score = sum(f1_scores) / len(f1_scores)\n",
        "  \n",
        "  print('avg test ppl: {}'.format(test_ppl))\n",
        "  \n",
        "  # preparing metrics for plotting\n",
        "  history = {\n",
        "    'loss': {\n",
        "      'train': train_losses,\n",
        "      'valid': valid_losses\n",
        "    },\n",
        "    'f1': {\n",
        "      'train': train_f1_scores,\n",
        "      'valid': valid_f1_scores\n",
        "    },\n",
        "    'ppl': {\n",
        "      'train': train_perplexities,\n",
        "      'valid': valid_perplexities\n",
        "    }\n",
        "  }\n",
        "      \n",
        "  return test_loss, test_ppl, test_f1_score, history"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "troyqPJeHSM6",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def compute_loss(log_probs, targets, criterion):\n",
        "  \"\"\"\n",
        "  Computes the crossentropy loss and f1-score.\n",
        "  \"\"\"\n",
        "  log_probs_view = log_probs.contiguous().view(\n",
        "    -1, log_probs.size(-1))\n",
        "  targets_view = targets.contiguous().view(-1)\n",
        "  \n",
        "  notpad = targets.ne(vocab.stoi['<pad>'])\n",
        "  target_tokens = notpad.long().sum().item()\n",
        "  \n",
        "  ppl = torch.exp(torch.nn.functional.nll_loss(\n",
        "    log_probs_view, \n",
        "    targets_view, \n",
        "    ignore_index=vocab.stoi['<pad>'], \n",
        "    reduction='sum') / target_tokens).item()\n",
        "  \n",
        "  loss = criterion(log_probs_view, targets_view)\n",
        "  loss = loss / target_tokens\n",
        "  \n",
        "  _, preds = log_probs.max(-1)\n",
        "    \n",
        "  # projecting predictions to a binary decision task\n",
        "  # over the vocabulary for calculating f1-score\n",
        "  targets_binary = one_hot(targets, log_probs.size(-1))\n",
        "  preds_binary = one_hot(preds, log_probs.size(-1))\n",
        "  \n",
        "  targets_binary_view_np = targets_binary.view(-1)\n",
        "  preds_binary_view_np = preds_binary.view(-1)\n",
        "  \n",
        "  f1_score = compute_f1_score(\n",
        "      preds_binary_view_np, targets_binary_view_np)\n",
        "  \n",
        "  return loss, ppl, f1_score"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "540PKGtxEYjo",
        "colab_type": "text"
      },
      "source": [
        "---\n",
        "\n",
        "## Baseline\n",
        "\n",
        "The implementation is partially based on \n",
        "pytorch word modelling [example](https://github.com/pytorch/examples/tree/master/word_language_model) and [salesforce/awd-lstm-lm](https://github.com/salesforce/awd-lstm-lm), while basic hyperparameters are from **[An Analysis of Neural Language Modeling at Multiple Scales](https://arxiv.org/pdf/1803.08240.pdf)**."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "RDadjQ4qS087",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# setting some hyper parameters\n",
        "EMBEDDING_DIM = 480\n",
        "\n",
        "HIDDEN_DIM = 620\n",
        "LAST_HIDDEN_DIM = 960\n",
        "\n",
        "PATIENCE = 3\n",
        "\n",
        "histories = {}"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "FeaItcEL29sb",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "class BaselineModel(torch.nn.Module):\n",
        "  \"\"\"\n",
        "  Baseline language model.\n",
        "  \"\"\"\n",
        "  \n",
        "  def __init__(self, hidden_dim, \n",
        "               last_hidden_dim, embedding_dim, vocab_size):\n",
        "    super().__init__()\n",
        "    \n",
        "    self.embedding = torch.nn.Embedding(\n",
        "      num_embeddings=vocab_size,\n",
        "      embedding_dim=embedding_dim)\n",
        "    self.embedding.weight.data.uniform_(-0.1, 0.1)\n",
        "    \n",
        "    self.i_dropout = torch.nn.Dropout(p=0.4)\n",
        "    self.h_dropout = torch.nn.Dropout(p=0.4)\n",
        "    \n",
        "    # creating rnn as module list so custom\n",
        "    # dropout can be applied between layers\n",
        "    self.rnn = torch.nn.ModuleList([\n",
        "      torch.nn.GRU(\n",
        "        input_size=embedding_dim,\n",
        "        hidden_size=hidden_dim),\n",
        "      torch.nn.GRU(\n",
        "        input_size=hidden_dim,\n",
        "        hidden_size=hidden_dim),\n",
        "      torch.nn.GRU(\n",
        "        input_size=hidden_dim,\n",
        "        hidden_size=last_hidden_dim)\n",
        "    ])\n",
        "        \n",
        "    self.project = torch.nn.Linear(\n",
        "      in_features=last_hidden_dim,\n",
        "      out_features=embedding_dim)\n",
        "    \n",
        "    # initializing output layer like this\n",
        "    # instead of torch.nn.Linear so\n",
        "    # shared embedding can be implemented easily\n",
        "    self.vocab_size = vocab_size\n",
        "    \n",
        "    self.output_bias = torch.nn.Parameter(\n",
        "      torch.zeros((vocab_size, )))\n",
        "    \n",
        "    self.output_weights = torch.nn.Parameter(\n",
        "      torch.empty((vocab_size, embedding_dim)).uniform_(\n",
        "        -0.1, 0.1))\n",
        "    \n",
        "  def forward(self, inputs, prev_hidden):\n",
        "    embeds = self.i_dropout(self.embedding(inputs))\n",
        "    \n",
        "    hidden_states = []\n",
        "    outputs = embeds\n",
        "    for idx, layer in enumerate(self.rnn):\n",
        "      outputs, hidden_state = layer(\n",
        "        outputs, (prev_hidden[idx]\n",
        "          if prev_hidden is not None else None))\n",
        "      hidden_states.append(hidden_state.detach())\n",
        "\n",
        "      outputs = self.h_dropout(outputs)\n",
        "    \n",
        "    projected = self.project(outputs)\n",
        "    logits = torch.nn.functional.linear(\n",
        "      projected, self.output_weights, self.output_bias)\n",
        "    \n",
        "    log_probs = torch.nn.functional.log_softmax(\n",
        "      logits, dim=-1)\n",
        "    \n",
        "    return log_probs, hidden_states"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QIiYIbYN3rlQ",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "baseline_model = BaselineModel(\n",
        "  hidden_dim=HIDDEN_DIM,\n",
        "  last_hidden_dim=LAST_HIDDEN_DIM,\n",
        "  embedding_dim=EMBEDDING_DIM,\n",
        "  vocab_size=len(vocab))\n",
        "\n",
        "print('Num. params: {}'.format(count_parameters(baseline_model)))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "U5DusalkntIS",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "cross_entropy = torch.nn.NLLLoss(\n",
        "  ignore_index=vocab.stoi['<pad>'], reduction='sum')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "9rvl9wrBrExf",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "sgd = torch.optim.SGD(\n",
        "  baseline_model.parameters(), lr=1, weight_decay=1.2e-6, \n",
        "  momentum=0.9, nesterov=True)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2V-lc-Sn3rui",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "loss, ppl, f1_score, baseline_history = train_and_evaluate(\n",
        "  model=baseline_model, \n",
        "  criterion=cross_entropy,\n",
        "  optimizer=sgd,\n",
        "  max_patience=PATIENCE)\n",
        "\n",
        "histories.update({'baseline': baseline_history})"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VX3E2z2sLGDP",
        "colab_type": "text"
      },
      "source": [
        "`avg test ppl: 112.64192423194346`\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7SS5NN8w2UrO",
        "colab_type": "text"
      },
      "source": [
        "---\n",
        "\n",
        "## Locked dropout\n",
        "\n",
        "This implementation is a slightly modified version from [pytorch-nlp](https://github.com/PetrochukM/PyTorch-NLP/blob/master/torchnlp/nn/lock_dropout.py). In standard dropout, a new binary dropout mask is sampled\n",
        "each and every time the dropout function is called. Variational dropout or locked dropout from **[A Theoretically Grounded Application of Dropout in\n",
        "Recurrent Neural Networks](https://arxiv.org/pdf/1512.05287.pdf)** is a special version of this, which can be applied to sequential data.  It calculates and applies the same dropout mask to every time step."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Sf8gLPCp_RRv",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "class LockedDropout(torch.nn.Module):\n",
        "  \"\"\"\n",
        "  Applies the same dropout mask to every time step.\n",
        "  \"\"\"\n",
        "  \n",
        "  def __init__(self, p):\n",
        "    super().__init__()\n",
        "    self.p = p\n",
        "      \n",
        "  def forward(self, x):\n",
        "    \"\"\"\n",
        "    Applies dropout, assumes the inputs are provided\n",
        "    in sequence-first format.\n",
        "    \"\"\"\n",
        "    if self.training and self.p:\n",
        "      mask = x.new_empty(\n",
        "        1, x.size(1), x.size(2), requires_grad=False).bernoulli_(\n",
        "          1 - self.p)\n",
        "      mask = mask.div_(1 - self.p)\n",
        "      mask = mask.expand_as(x)\n",
        "      x = x * mask\n",
        "      \n",
        "    return x"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "g77jaxAy3qs_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "class LockedDropoutModel(BaselineModel):\n",
        "  \n",
        "  def __init__(self, *args, **kwargs):\n",
        "    super().__init__(*args, **kwargs)\n",
        "    \n",
        "    self.i_dropout = LockedDropout(0.4)\n",
        "    self.h_dropout = LockedDropout(0.4)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1jUpnqrs8gZk",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "locked_dropout_model = LockedDropoutModel(\n",
        "  hidden_dim=HIDDEN_DIM,\n",
        "  last_hidden_dim=LAST_HIDDEN_DIM,\n",
        "  embedding_dim=EMBEDDING_DIM,\n",
        "  vocab_size=len(vocab))\n",
        "\n",
        "print('Num. params: {}'.format(count_parameters(\n",
        "  locked_dropout_model)))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "u3n8qFgMS909",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "cross_entropy = torch.nn.NLLLoss(\n",
        "  ignore_index=vocab.stoi['<pad>'], reduction='sum').to(device)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "N2or8w-3S6uT",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "sgd = torch.optim.SGD(\n",
        "  locked_dropout_model.parameters(), lr=1, weight_decay=1.2e-6,\n",
        "  momentum=0.9, nesterov=True)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8E7MtMIDTCdP",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "loss, ppl, f1_score, locked_dropout_history = train_and_evaluate(\n",
        "  model=locked_dropout_model, \n",
        "  criterion=cross_entropy,\n",
        "  optimizer=sgd,\n",
        "  max_patience=PATIENCE)\n",
        "\n",
        "histories.update(\n",
        "  {'locked_dropout': locked_dropout_history})"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eVEw2MNnLbjH",
        "colab_type": "text"
      },
      "source": [
        "`avg test ppl: 114.5805522648975`"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mDj2YIK_OFis",
        "colab_type": "text"
      },
      "source": [
        "---\n",
        "\n",
        "## Weight Dropout\n",
        "\n",
        "DropConnect from **[Regularization of Neural Networks using DropConnect](https://cs.nyu.edu/~wanli/dropc/dropc.pdf)** is the generalization of Dropout in which\n",
        "each connection, rather than each output unit, can\n",
        "be dropped with probability 1 − p. The implementation is taken from [salesforce/awd-lstm-lm](https://github.com/salesforce/awd-lstm-lm) and partly from [pytorch-nlp](https://github.com/PetrochukM/PyTorch-NLP/blob/master/torchnlp/nn/weight_drop.py)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QzJR73zHOF-Q",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "class WeightDropout(torch.nn.Module):\n",
        "  \"\"\"\n",
        "  Implements the weight dropping dropout from\n",
        "  http://yann.lecun.com/exdb/publis/pdf/wan-icml-13.pdf.\n",
        "  \"\"\"\n",
        "\n",
        "  def __init__(self, module, weights, prob):\n",
        "    super().__init__()\n",
        "\n",
        "    self.module = module\n",
        "    self.dropout = torch.nn.Dropout(p=prob)\n",
        "    self.weights = weights\n",
        "    self.module.flatten_parameters = self.flatten_parameters\n",
        "    \n",
        "    for name in self.weights:\n",
        "      weight = getattr(self.module, name)\n",
        "      del self.module._parameters[name]\n",
        "      self.module.register_parameter(\n",
        "        name + '_raw', torch.nn.Parameter(weight))\n",
        "\n",
        "  def flatten_parameters(self, *args, **kwargs):\n",
        "    \"\"\"Do nothing.\"\"\"\n",
        "\n",
        "  def forward(self, *args, **kwargs):\n",
        "    \"\"\"\n",
        "    Drops the weights of the wrapped module and calls its\n",
        "    `forward` on the provided inputs.\n",
        "    \"\"\"\n",
        "    for name in self.weights:\n",
        "      weight = getattr(self.module, name + '_raw')\n",
        "      dropped_weight = self.dropout(weight)\n",
        "      setattr(self.module, name, \n",
        "        torch.nn.Parameter(dropped_weight))\n",
        "\n",
        "    outputs = self.module(*args, **kwargs)\n",
        "\n",
        "    return outputs"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "qMIZeEHQb_Ny",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "class WeightDropoutModel(BaselineModel):\n",
        "  \n",
        "  def __init__(self, *args, **kwargs):\n",
        "    super().__init__(*args, **kwargs)\n",
        "    \n",
        "    self.rnn = torch.nn.ModuleList([\n",
        "        WeightDropout(rnn, weights=['weight_hh_l0'], prob=0.5)\n",
        "        for rnn in self.rnn\n",
        "    ])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "175Sg6R2c7Y4",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "weight_dropout_model = WeightDropoutModel(\n",
        "  hidden_dim=HIDDEN_DIM,\n",
        "  last_hidden_dim=LAST_HIDDEN_DIM,\n",
        "  embedding_dim=EMBEDDING_DIM,\n",
        "  vocab_size=len(vocab))\n",
        "\n",
        "print('Num. params: {}'.format(\n",
        "  count_parameters(weight_dropout_model)))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Bp1Z351yVTID",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "cross_entropy = torch.nn.NLLLoss(\n",
        "  ignore_index=vocab.stoi['<pad>'], reduction='sum').to(device)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "yeZvjsv2Va9J",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "sgd = torch.optim.SGD(\n",
        "  weight_dropout_model.parameters(), lr=1, weight_decay=1.2e-6, \n",
        "  momentum=0.9, nesterov=True)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "hP9GuW2HVdq6",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "loss, ppl, f1_score, weight_dropout_history = train_and_evaluate(\n",
        "  model=weight_dropout_model, \n",
        "  criterion=cross_entropy,\n",
        "  optimizer=sgd,\n",
        "  max_patience=PATIENCE)\n",
        "\n",
        "histories.update(\n",
        "  {'weight_drop': weight_dropout_history})"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vtSZjbA4PZf8",
        "colab_type": "text"
      },
      "source": [
        "`avg test ppl: 113.677391822892`"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "k8So7EeihEDA",
        "colab_type": "text"
      },
      "source": [
        "---\n",
        "\n",
        "## Label smoothing\n",
        "\n",
        "Label smoothing from \n",
        "**[Rethinking the Inception Architecture for Computer Vision](https://arxiv.org/pdf/1512.00567.pdf)** is implemented using the KL div loss. Instead of using a one-hot target distribution, distribution is created that has confidence of the correct word and the rest of the smoothing mass distributed throughout the vocabulary. Label smoothing decreases perplexity, which is not desirable for language modelling, but it is useful for most nlp tasks with large vocabulary, since it improves accuracy at test time.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "DVEInnCLIexa",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def create_criterion(pad_idx, trg_vocab_size, device, \n",
        "                     smoothing=0.1):\n",
        "  \"\"\"\n",
        "  Creates the loss for the seq2seq model.\n",
        "  \"\"\"\n",
        "  confidence = 1.0 - smoothing\n",
        "  smoothed = torch.full((trg_vocab_size, ), \n",
        "    smoothing / (trg_vocab_size - 2)).to(device)\n",
        "  smoothed[pad_idx] = 0\n",
        "\n",
        "  def label_smoothing(outputs, targets):\n",
        "    \"\"\"\n",
        "    Applies label smoothing.\n",
        "    \"\"\"\n",
        "    smoothed_targets = smoothed.repeat(targets.size(0), 1)\n",
        "    smoothed_targets.scatter_(\n",
        "        1, targets.unsqueeze(1), confidence)\n",
        "    smoothed_targets.masked_fill_(\n",
        "        (targets == pad_idx).unsqueeze(1), 0)\n",
        "\n",
        "    return torch.nn.functional.kl_div(\n",
        "        outputs, smoothed_targets, reduction='sum')\n",
        "\n",
        "  return label_smoothing"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "MPOHgxciwxvZ",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "label_smoothing_model = BaselineModel(\n",
        "  hidden_dim=HIDDEN_DIM,\n",
        "  embedding_dim=EMBEDDING_DIM,\n",
        "  last_hidden_dim=LAST_HIDDEN_DIM,\n",
        "  vocab_size=len(vocab))\n",
        "\n",
        "print('Num. params: {}'.format(count_parameters(\n",
        "  label_smoothing_model)))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ZK-LTPY-2aFW",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "label_smoothed_loss = create_criterion(\n",
        "  vocab.stoi['<pad>'], len(vocab), device)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8TgB3Na3sQzi",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "sgd = torch.optim.SGD(\n",
        "  label_smoothing_model.parameters(), \n",
        "  weight_decay=1.2e-6, momentum=0.9, lr=1, nesterov=True)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "YhOrXvz2cSBL",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "loss, ppl, f1_score, label_smoothing_history = train_and_evaluate(\n",
        "  model=label_smoothing_model, \n",
        "  criterion=label_smoothed_loss,\n",
        "  optimizer=sgd,\n",
        "  max_patience=PATIENCE)\n",
        "\n",
        "histories.update(\n",
        "  {'label_smoothing': label_smoothing_history})"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "W9s3AVr73hvk",
        "colab_type": "text"
      },
      "source": [
        "`avg test ppl: 113.3859038690124`"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "92fpEXP6hO9S",
        "colab_type": "text"
      },
      "source": [
        "---\n",
        "\n",
        "## Shared embedding\n",
        "\n",
        "According to **[Using the Output Embedding to Improve Language Models](https://arxiv.org/pdf/1608.05859.pdf)** \"*Tying the\n",
        "input and output embeddings leads to an improvement in the perplexity of various language models. This is true both when using dropout or when\n",
        "not using it.*\" Other advante of the technique is that it can also drastically reduce the number of parameters in the model."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "gYeiRzgq3J8L",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "class SharedEmbeddingModel(BaselineModel):\n",
        "  \"\"\"\n",
        "  Shared embedding language model.\n",
        "  \"\"\"\n",
        "  \n",
        "  def __init__(self, *args, **kwargs):\n",
        "    super().__init__(*args, **kwargs)\n",
        "    \n",
        "    # output layer is the embedding weight layer\n",
        "    self.out_weights = self.embedding.weight"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "3m-rDpXuCBIc",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "shared_embedding_model = SharedEmbeddingModel(\n",
        "  hidden_dim=HIDDEN_DIM,\n",
        "  embedding_dim=EMBEDDING_DIM,\n",
        "  last_hidden_dim=LAST_HIDDEN_DIM,\n",
        "  vocab_size=len(vocab))\n",
        "\n",
        "print('Num. params: {}'.format(count_parameters(\n",
        "  shared_embedding_model)))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "12Lng7J-m9hB",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "cross_entropy = torch.nn.NLLLoss(\n",
        "  ignore_index=vocab.stoi['<pad>'], reduction='sum')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "YYOCTdofm_0w",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "sgd = torch.optim.SGD(\n",
        "  shared_embedding_model.parameters(), lr=1, weight_decay=1.2e-6,\n",
        "  momentum=0.9, nesterov=True)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QxVKX9tonBXQ",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "loss, ppl, f1_score, shared_embedding_history = train_and_evaluate(\n",
        "  model=shared_embedding_model, \n",
        "  criterion=cross_entropy, \n",
        "  optimizer=sgd,\n",
        "  max_patience=PATIENCE)\n",
        "\n",
        "histories.update(\n",
        "  {'shared_embedding': shared_embedding_history})"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0OreuA3SiLUK",
        "colab_type": "text"
      },
      "source": [
        "`avg test ppl: 111.9372638740925`"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BDkukKB2uE4O",
        "colab_type": "text"
      },
      "source": [
        "---\n",
        "\n",
        "## Embedding dropout"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "on95TmQQuHih",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def embedding_dropout(embed, inputs, training, p=0.1):\n",
        "  \"\"\"\n",
        "  Applies dropout to the embedding layer based on\n",
        "  https://arxiv.org/pdf/1512.05287.pdf. The code is\n",
        "  based on salesforce/awd-lstm-lm.\n",
        "  \"\"\"\n",
        "  if not training:\n",
        "    masked_embed_weight = embed.weight\n",
        "  elif p:\n",
        "    mask = embed.weight.new_empty((embed.weight.size(0), 1))\n",
        "    mask.bernoulli_(1 - p).expand_as(embed.weight) / (1 - p)\n",
        "    masked_embed_weight = mask * embed.weight\n",
        "  else:\n",
        "    masked_embed_weight = embed.weight\n",
        "\n",
        "  return torch.nn.functional.embedding(\n",
        "    inputs, masked_embed_weight,\n",
        "    embed.padding_idx, embed.max_norm, embed.norm_type,\n",
        "    embed.scale_grad_by_freq, embed.sparse)\n",
        "\n",
        "\n",
        "class EmbeddedDropout(torch.nn.Embedding):\n",
        "\n",
        "  def __init__(self, *args, p=0.1, **kwargs):\n",
        "    super().__init__(*args, **kwargs)\n",
        "    self.p = p\n",
        "\n",
        "  def forward(self, input):\n",
        "    return embedding_dropout(\n",
        "      self, input, self.training, self.p)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Thrvy4I7ftrH",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "class EmbeddedDropoutModel(BaselineModel):\n",
        "  \"\"\"\n",
        "  Shared embedding language model.\n",
        "  \"\"\"\n",
        "  \n",
        "  def __init__(self, hidden_dim, \n",
        "    last_hidden_dim, embedding_dim, vocab_size):\n",
        "    super().__init__(hidden_dim, \n",
        "      last_hidden_dim, embedding_dim, vocab_size)\n",
        "    \n",
        "    self.embedding = EmbeddedDropout(\n",
        "      num_embeddings=vocab_size,\n",
        "      embedding_dim=embedding_dim)\n",
        "    self.embedding.weight.data.uniform_(-0.1, 0.1)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "5kzzOn_nhxmq",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "embedded_dropout_model = EmbeddedDropoutModel(\n",
        "  hidden_dim=HIDDEN_DIM,\n",
        "  last_hidden_dim=LAST_HIDDEN_DIM,\n",
        "  embedding_dim=EMBEDDING_DIM,\n",
        "  vocab_size=len(vocab))\n",
        "\n",
        "print('Num. params: {}'.format(count_parameters(\n",
        "  embedded_dropout_model)))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "KTn10wMofte-",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "cross_entropy = torch.nn.NLLLoss(\n",
        "  ignore_index=vocab.stoi['<pad>'], reduction='sum')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "6O3nNwTBh0Wg",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "sgd = torch.optim.SGD(\n",
        "  embedded_dropout_model.parameters(), lr=1, weight_decay=1.2e-6,\n",
        "  momentum=0.9, nesterov=True)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "dfX_KO0Lh4N9",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "loss, ppl, f1_score, embedded_dropout_history = train_and_evaluate(\n",
        "  model=embedded_dropout_model, \n",
        "  criterion=cross_entropy, \n",
        "  optimizer=sgd,\n",
        "  max_patience=PATIENCE)\n",
        "\n",
        "histories.update(\n",
        "  {'embedded_dropout': embedded_dropout_history})"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kUw-NXpDKHbB",
        "colab_type": "text"
      },
      "source": [
        "`avg test ppl: 109.95696790290602`"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NsAhwaY_dHal",
        "colab_type": "text"
      },
      "source": [
        "---\n",
        "\n",
        "## Mixture of softmaxes\n",
        "\n",
        "The mixture of softmaxes from **[Breaking the Softmax Bottleneck: A High-Rank RNN Language Model](https://arxiv.org/pdf/1711.03953.pdf)** can be seen as jointly training an ensemble of \n",
        "K\n",
        " different models with minimal overhead. An ensemble is just where you train multiple models and average their predictions, a practice which usually outperforms any single model. Jointly training is when all of these ensembled models are trained against a single loss, allowing the models to work together to avoid deficiencies."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "yVeAIjT6NPjy",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "NUM_SOFTMAXES = 15"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "sOB_gfnehxbb",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def create_mos_cls(base):\n",
        "  \"\"\"\n",
        "  Function to conveninelty create a subclass with\n",
        "  MoS functionality.\n",
        "  \"\"\"\n",
        "  class MixtureOfSoftmaxesModel(base):\n",
        "    \"\"\"\n",
        "    Mixture of softmaxes language model.\n",
        "    \"\"\"\n",
        "    def __init__(self, hidden_dim, \n",
        "               last_hidden_dim, embedding_dim, vocab_size, \n",
        "                 num_softmax):\n",
        "      super().__init__(hidden_dim, \n",
        "               last_hidden_dim, embedding_dim, vocab_size)\n",
        "\n",
        "      self.num_softmax = num_softmax\n",
        "      self.embedding_dim = embedding_dim\n",
        "\n",
        "      self.prior = torch.nn.Linear(\n",
        "          in_features=last_hidden_dim, \n",
        "          out_features=num_softmax, \n",
        "          bias=False)\n",
        "\n",
        "      self.latent = torch.nn.Linear(\n",
        "          in_features=last_hidden_dim, \n",
        "          out_features=num_softmax * embedding_dim)\n",
        "\n",
        "    def forward(self, inputs, prev_hidden=None):\n",
        "      embeds = self.i_dropout(self.embedding(inputs))\n",
        "      outputs = embeds\n",
        "      hidden_states = []\n",
        "      for idx, layer in enumerate(self.rnn):\n",
        "        outputs, hidden_state = layer(\n",
        "          outputs, (prev_hidden[idx]\n",
        "            if prev_hidden is not None else None))\n",
        "        hidden_states.append(hidden_state.detach())\n",
        "\n",
        "        outputs = self.h_dropout(outputs)\n",
        "\n",
        "      # comuting different softmaxes\n",
        "      latents = self.h_dropout(torch.tanh(self.latent(outputs)))\n",
        "      latents = latents.view(-1, self.embedding_dim)\n",
        "\n",
        "      contexts = torch.nn.functional.linear(\n",
        "        latents, self.output_weights, self.output_bias)\n",
        "\n",
        "      priors = torch.nn.functional.softmax(\n",
        "        self.prior(outputs).view(-1, self.num_softmax), dim=-1)\n",
        "      # adjusting dims to correct size for multiplication\n",
        "      priors = priors.unsqueeze(2)\n",
        "\n",
        "      softmaxes = torch.nn.functional.softmax(\n",
        "        contexts, dim=-1)\n",
        "      softmaxes = softmaxes.view(\n",
        "        -1, self.num_softmax, self.vocab_size)\n",
        "\n",
        "      log_probs = (softmaxes * priors).sum(1).view(\n",
        "        inputs.size(0), -1, self.vocab_size).add_(1e-8).log()\n",
        "\n",
        "      return log_probs, hidden_states\n",
        "    \n",
        "  return MixtureOfSoftmaxesModel"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8Fo9xwzth0B7",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "mixture_of_softmaxes_model = create_mos_cls(base=BaselineModel)(\n",
        "  hidden_dim=HIDDEN_DIM,\n",
        "  last_hidden_dim=LAST_HIDDEN_DIM,\n",
        "  embedding_dim=EMBEDDING_DIM,\n",
        "  vocab_size=len(vocab),\n",
        "  num_softmax=NUM_SOFTMAXES)\n",
        "\n",
        "print('Num. params: {}'.format(count_parameters(\n",
        "  mixture_of_softmaxes_model)))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "oU94oat8xAFt",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "cross_entropy = torch.nn.NLLLoss(\n",
        "  ignore_index=vocab.stoi['<pad>'], reduction='sum')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pYnTWpOlsdK4",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "sgd = torch.optim.SGD(\n",
        "  mixture_of_softmaxes_model.parameters(), lr=1, weight_decay=1.2e-6,\n",
        "  momentum=0.9, nesterov=True)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "xAbqiqs2xPSN",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "loss, ppl, f1_score, mixture_of_softmaxes_history = train_and_evaluate(\n",
        "  model=mixture_of_softmaxes_model, \n",
        "  criterion=cross_entropy, \n",
        "  optimizer=sgd,\n",
        "  max_patience=PATIENCE)\n",
        "\n",
        "histories.update(\n",
        "  {'mixture_of_softmaxes': mixture_of_softmaxes_history})"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "i83DmqyhhSqB",
        "colab_type": "text"
      },
      "source": [
        "---\n",
        "\n",
        "## Mixture of contexts\n",
        "\n",
        "Another possible approach is to directly mix the context vectors (or logits) before taking the Softmax, rather than mixing the probabilities afterwards as in MoS. Mixtrure of contexts however is actually identical to vanilla softmax and is similarly a low-rank solution."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-1RuGs5T3bkQ",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "class MixtureOfContextsModel(BaselineModel):\n",
        "  \"\"\"\n",
        "  Mixture of contexts language model.\n",
        "  \"\"\"\n",
        "  \n",
        "  def __init__(self, hidden_dim, embedding_dim, vocab_size, \n",
        "                 num_softmax):\n",
        "    super().__init__(hidden_dim, embedding_dim, vocab_size)\n",
        "\n",
        "    self.num_softmax = num_softmax\n",
        "    self.embedding_dim = embedding_dim\n",
        "\n",
        "    self.prior = torch.nn.Linear(\n",
        "        in_features=hidden_dim, \n",
        "        out_features=num_softmax, \n",
        "        bias=False)\n",
        "\n",
        "    self.latent = torch.nn.Linear(\n",
        "        in_features=hidden_dim, \n",
        "          out_features=num_softmax * embedding_dim)\n",
        "    \n",
        "  def forward(self, inputs, hidden=None):\n",
        "    emb = self.inp_dropout(self.embedding(inputs))\n",
        "    outputs, hidden = self.gru(emb, hidden)\n",
        "    \n",
        "    # creating different softmaxes from contexts\n",
        "    latents = self.lat_dropout(torch.tanh(self.latent(outputs)))\n",
        "    latents = latents.view(-1, self.embedding_dim)\n",
        "    \n",
        "    contexts = torch.nn.functional.linear(\n",
        "      latents, self.out_weights, self.out_bias)\n",
        "    \n",
        "    contexts = contexts.view(\n",
        "      -1, self.num_softmax, self.vocab_size)\n",
        "   \n",
        "    priors = torch.nn.functional.softmax(\n",
        "      self.prior(outputs).view(-1, self.num_softmax), dim=-1)\n",
        "    # adjusting dims to correct size for multiplication\n",
        "    priors = priors.unsqueeze(2)\n",
        "    \n",
        "    # calculating mixture of contexts\n",
        "    logits = (contexts * priors).sum(1).view(\n",
        "      emb.size(0), -1, self.vocab_size)\n",
        "    \n",
        "    log_probs = torch.nn.functional.log_softmax(logits, dim=-1)\n",
        "    \n",
        "    return log_probs, hidden"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2vC6OZ3I3U-7",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "moc_model = MixtureOfContextsLanguageModel(\n",
        "  hidden_dim=HIDDEN_DIM,\n",
        "  embedding_dim=EMBEDDING_DIM,\n",
        "  vocab_size=len(vocab),\n",
        "  num_softmax=NUM_SOFTMAXES)\n",
        "\n",
        "print('Num. params: {}'.format(count_parameters(\n",
        "  moc_model)))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "DIWHQ6quAcVX",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "cross_entropy = torch.nn.NLLLoss(\n",
        "  ignore_index=vocab.stoi['<pad>'], reduction='sum').to(device)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "MX77XjpUsY52",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "sgd = torch.optim.SGD(\n",
        "  moc_model.parameters(), lr=20, weight_decay=1.2e-6,\n",
        "  nesterov=True, momentum=0.9)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "5bT4dwDOAeWR",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "loss, ppl, f1_score, moc_history = train_and_evaluate(\n",
        "  model=moc_model,\n",
        "  criterion=cross_entropy, \n",
        "  optimizer=sgd,\n",
        "  max_patience=PATIENCE)\n",
        "\n",
        "histories.update({'moc': moc_history})"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EvL8odyJ72u7",
        "colab_type": "text"
      },
      "source": [
        "## Comparison"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "29s0gcjh3Av-",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "plot_history(histories, ignore_keys=['train'])"
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}