iam-abbas/colbert-ai-v2.ipynb

## colbert-ai-v2.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Colbert AI v2.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true,
      "machine_shape": "hm",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/iam-abbas/b93961bc9468e375f1b75a1a6e47610c/colbert-ai-v2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VtqZti9Dgy-t",
        "colab_type": "text"
      },
      "source": [
        "# Colbert-AI v2.0\n",
        "\n",
        "##### ***Using Pytorch, Transformers and Open-AI's GPT-2***"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4AwTP3LnhKvI",
        "colab_type": "text"
      },
      "source": [
        "*Installing Transformers*"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rVa_ehwbL0Ie",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!pip install -q transformers"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vyjKiHPShbnu",
        "colab_type": "text"
      },
      "source": [
        "*importing all the required modules*"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TXfcA0yoL3zX",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import torch\n",
        "import numpy as np\n",
        "from transformers import GPT2Tokenizer, GPT2LMHeadModel\n",
        "from transformers import AdamW, get_linear_schedule_with_warmup\n",
        "from torch.utils.data import Dataset\n",
        "from torch.utils.data import Dataset, DataLoader\n",
        "import os"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "s4O01xrJhiV4",
        "colab_type": "text"
      },
      "source": [
        "#### **Choosing a Model**\n",
        "##### Transformes has 4 models\n",
        "![Image by Jay Alammar from post The Illustrated GPT-2](https://i.imgur.com/yrIxPVX.png)\n",
        "\n",
        "**Model Names**:\n",
        "- `gpt2-small` (124M Model)\n",
        "- `gpt2-medium` (345M Model)\n",
        "- `gpt2-large` (774M Model)\n",
        "- `gpt2-xl` (1558M Model)\n",
        "\n",
        "*In our case we focused on making a lighter model so we used medium*"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8xxWwXYIL-Md",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "device = 'cpu'\n",
        "if torch.cuda.is_available():\n",
        "    device = 'cuda'\n",
        "    \n",
        "tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')\n",
        "model = GPT2LMHeadModel.from_pretrained('gpt2-medium')\n",
        "model = model.to(device)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MvTgsO8tioua",
        "colab_type": "text"
      },
      "source": [
        "`choose_from_top`:\n",
        "- Function to first select topN tokens from the probability list and then based on the selected N word distribution\n",
        "\n",
        "`generate_text`:\n",
        "- At each prediction step, GPT2 model needs to know all of the previous sequence elements to predict the next one. Below is a function that will tokenize the starting input text, and then in a loop, one new token is predicted at each step and is added to the sequence, which will be fed into the model in the next step. In the end, the token list is decoded back into a text."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zmSba37LMA_D",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def choose_from_top(probs, n=10):\n",
        "    ind = np.argpartition(probs, -n)[-n:]\n",
        "    top_prob = probs[ind]\n",
        "    top_prob = top_prob / np.sum(top_prob)\n",
        "    choice = np.random.choice(n, 1, p = top_prob)\n",
        "    token_id = ind[choice][0]\n",
        "    return int(token_id)\n",
        "\n",
        "def generate_text(input_str, text_len = 100):\n",
        "    cur_ids = torch.tensor(tokenizer.encode(input_str)).unsqueeze(0).long().to(device)\n",
        "    model.eval()\n",
        "    with torch.no_grad():\n",
        "        for i in range(text_len):\n",
        "            outputs = model(cur_ids, labels=cur_ids)\n",
        "            loss, logits = outputs[:2]\n",
        "            softmax_logits = torch.softmax(logits[0,-1], dim=0) \n",
        "            next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=10)\n",
        "            cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim = 1) # Add the last word\n",
        "        output_list = list(cur_ids.squeeze().to('cpu').numpy())\n",
        "        output_text = tokenizer.decode(output_list)\n",
        "        print(output_text)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9ygC5umoi_jl",
        "colab_type": "text"
      },
      "source": [
        "### Generating The Text"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8UzA5kg9MJaP",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "generate_text(\"Donald Trump visits India and \")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nvw8SI31jDMU",
        "colab_type": "text"
      },
      "source": [
        "## **Fine-tuning GPT-2 on Captions Dataset from YouTube**\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "c67acXwnOCGw",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "class Text_Corpus(Dataset):\n",
        "    def __init__(self, dataset_path = '/content/drive/My Drive'):\n",
        "        super().__init__()\n",
        "        corpus_path = os.path.join(dataset_path, 'captions.txt')\n",
        "        self.token_list = []\n",
        "        self.end_of_text_token = \"<|endoftext|>\"\n",
        "\n",
        "        with open(corpus_path) as f:\n",
        "            data = f.read()\n",
        "            self.token_list = data.split(\"<|endoftext|>\")\n",
        "\n",
        "        for i in range(len(self.token_list)):\n",
        "          self.token_list[i] = self.end_of_text_token+self.token_list[i]+self.end_of_text_token\n",
        "\n",
        "    def __len__(self):\n",
        "        return len(self.token_list)\n",
        "\n",
        "    def __getitem__(self, item):\n",
        "        return self.token_list[item]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hzvrSS3oji0f",
        "colab_type": "text"
      },
      "source": [
        "*Loading the dataset from `Text_Corpus`*"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zZenXS-uQ2f9",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "dataset = Text_Corpus()\n",
        "print(\"Number of Tokens Found:\", len(dataset))\n",
        "data_loader = DataLoader(dataset, batch_size=1, shuffle=True)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vEsGh26OjpJr",
        "colab_type": "text"
      },
      "source": [
        "*Assigning Parameterts (EPOCH, Batch Size, etc)*"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "sORYPpeyQ8K3",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "BATCH_SIZE = 1\n",
        "EPOCHS = 30\n",
        "LEARNING_RATE = 1e-5\n",
        "WARMUP_STEPS = 10000\n",
        "MAX_SEQ_LEN = 550"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xwrLLywrkLK3",
        "colab_type": "text"
      },
      "source": [
        "### *Training the Model for 30 Epochs*"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "PN_gIOiNQ8qb",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "model = model.to(device)\n",
        "model.train()\n",
        "optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)\n",
        "scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps = -1)\n",
        "text_count = 0\n",
        "sum_loss = 0.0\n",
        "batch_count = 0\n",
        "\n",
        "tmp_text_tens = None\n",
        "\n",
        "for epoch in range(EPOCHS):\n",
        "\n",
        "    print(f\"EPOCH {epoch} started \" + '=' * 30)\n",
        "    for idx,text in enumerate(data_loader):\n",
        "            \n",
        "        text_tens = torch.tensor(tokenizer.encode(text[0])).unsqueeze(0).to(device)\n",
        "\n",
        "        if text_tens.size()[1] > MAX_SEQ_LEN:\n",
        "            continue\n",
        "        if not torch.is_tensor(tmp_text_tens):\n",
        "            tmp_text_tens = text_tens\n",
        "            continue\n",
        "        else:\n",
        "            if tmp_text_tens.size()[1] + text_tens.size()[1] > MAX_SEQ_LEN:\n",
        "                work_text_tens = tmp_text_tens\n",
        "                tmp_text_tens = text_tens\n",
        "            else:\n",
        "                tmp_text_tens = torch.cat([tmp_text_tens, text_tens[:,1:]], dim=1)               \n",
        "                continue\n",
        "                          \n",
        "        outputs = model(work_text_tens, labels=work_text_tens)\n",
        "        loss, logits = outputs[:2]                        \n",
        "        loss.backward()\n",
        "        sum_loss = sum_loss + loss.detach().data                    \n",
        "        text_count = text_count + 1\n",
        "\n",
        "        if text_count == BATCH_SIZE:\n",
        "            text_count = 0    \n",
        "            batch_count += 1\n",
        "            optimizer.step()\n",
        "            scheduler.step() \n",
        "            optimizer.zero_grad()\n",
        "            model.zero_grad()\n",
        "            \n",
        "        if batch_count == 1000:\n",
        "            print(f\"sum loss {sum_loss}\")\n",
        "            batch_count = 0\n",
        "            sum_loss = 0.0"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Y3sBnNo4kTpR",
        "colab_type": "text"
      },
      "source": [
        "### **Generating 20 Samples**"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_c98C7nacGLa",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "model.eval()\n",
        "with torch.no_grad():\n",
        "    \n",
        "    for text_idx in range(20):\n",
        "\n",
        "        cur_ids = torch.tensor(tokenizer.encode(\"<|startoftext|>START:\")).unsqueeze(0).to(device)\n",
        "        \n",
        "        for i in range(250):\n",
        "            outputs = model(cur_ids, labels=cur_ids)\n",
        "            loss, logits = outputs[:2]\n",
        "            softmax_logits = torch.softmax(logits[0,-1], dim=0)\n",
        "            if i < 2:\n",
        "                n = 15\n",
        "            else:\n",
        "                n = 3\n",
        "            next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=n)\n",
        "            cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim = 1)\n",
        "            if next_token_id in tokenizer.encode('<|endoftext|>'):\n",
        "                break\n",
        "            \n",
        "        output_list = list(cur_ids.squeeze().to('cpu').numpy())\n",
        "        output_text = tokenizer.decode(output_list)\n",
        "        print(f\"SAMPLE {text_idx}: {output_text.capitalize()} \\n\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JZUi_hmZkhg9",
        "colab_type": "text"
      },
      "source": [
        "## **Contributors**\n",
        "- [Abbas Mohammed](https://github.com/iam-abbas) *(iam-abbas on github)*\n",
        "- [Shubham Rao](https://github.com/cshubhamrao) *(cshubhamrao on github)*\n",
        "\n",
        "**Mentions:-**\n",
        "- [Martins Frolovs](https://towardsdatascience.com/teaching-gpt-2-a-sense-of-humor-fine-tuning-large-transformer-models-on-a-single-gpu-in-pytorch-59e8cec40912)"
      ]
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "Colbert AI v2.ipynb",
	"provenance": [],
	"collapsed_sections": [],
	"toc_visible": true,
	"machine_shape": "hm",
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	},
	"accelerator": "GPU"
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/iam-abbas/b93961bc9468e375f1b75a1a6e47610c/colbert-ai-v2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "VtqZti9Dgy-t",
	"colab_type": "text"
	},
	"source": [
	"# Colbert-AI v2.0\n",
	"\n",
	"##### *Using Pytorch, Transformers and Open-AI's GPT-2*"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "4AwTP3LnhKvI",
	"colab_type": "text"
	},
	"source": [
	"Installing Transformers"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "rVa_ehwbL0Ie",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"!pip install -q transformers"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "vyjKiHPShbnu",
	"colab_type": "text"
	},
	"source": [
	"importing all the required modules"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "TXfcA0yoL3zX",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"import torch\n",
	"import numpy as np\n",
	"from transformers import GPT2Tokenizer, GPT2LMHeadModel\n",
	"from transformers import AdamW, get_linear_schedule_with_warmup\n",
	"from torch.utils.data import Dataset\n",
	"from torch.utils.data import Dataset, DataLoader\n",
	"import os"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "s4O01xrJhiV4",
	"colab_type": "text"
	},
	"source": [
	"#### Choosing a Model\n",
	"##### Transformes has 4 models\n",
	"![Image by Jay Alammar from post The Illustrated GPT-2](https://i.imgur.com/yrIxPVX.png)\n",
	"\n",
	"Model Names:\n",
	"- `gpt2-small` (124M Model)\n",
	"- `gpt2-medium` (345M Model)\n",
	"- `gpt2-large` (774M Model)\n",
	"- `gpt2-xl` (1558M Model)\n",
	"\n",
	"In our case we focused on making a lighter model so we used medium"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "8xxWwXYIL-Md",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"device = 'cpu'\n",
	"if torch.cuda.is_available():\n",
	" device = 'cuda'\n",
	" \n",
	"tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')\n",
	"model = GPT2LMHeadModel.from_pretrained('gpt2-medium')\n",
	"model = model.to(device)"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "MvTgsO8tioua",
	"colab_type": "text"
	},
	"source": [
	"`choose_from_top`:\n",
	"- Function to first select topN tokens from the probability list and then based on the selected N word distribution\n",
	"\n",
	"`generate_text`:\n",
	"- At each prediction step, GPT2 model needs to know all of the previous sequence elements to predict the next one. Below is a function that will tokenize the starting input text, and then in a loop, one new token is predicted at each step and is added to the sequence, which will be fed into the model in the next step. In the end, the token list is decoded back into a text."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "zmSba37LMA_D",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"def choose_from_top(probs, n=10):\n",
	" ind = np.argpartition(probs, -n)[-n:]\n",
	" top_prob = probs[ind]\n",
	" top_prob = top_prob / np.sum(top_prob)\n",
	" choice = np.random.choice(n, 1, p = top_prob)\n",
	" token_id = ind[choice][0]\n",
	" return int(token_id)\n",
	"\n",
	"def generate_text(input_str, text_len = 100):\n",
	" cur_ids = torch.tensor(tokenizer.encode(input_str)).unsqueeze(0).long().to(device)\n",
	" model.eval()\n",
	" with torch.no_grad():\n",
	" for i in range(text_len):\n",
	" outputs = model(cur_ids, labels=cur_ids)\n",
	" loss, logits = outputs[:2]\n",
	" softmax_logits = torch.softmax(logits[0,-1], dim=0) \n",
	" next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=10)\n",
	" cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim = 1) # Add the last word\n",
	" output_list = list(cur_ids.squeeze().to('cpu').numpy())\n",
	" output_text = tokenizer.decode(output_list)\n",
	" print(output_text)"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "9ygC5umoi_jl",
	"colab_type": "text"
	},
	"source": [
	"### Generating The Text"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "8UzA5kg9MJaP",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"generate_text(\"Donald Trump visits India and \")"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "nvw8SI31jDMU",
	"colab_type": "text"
	},
	"source": [
	"## Fine-tuning GPT-2 on Captions Dataset from YouTube\n"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "c67acXwnOCGw",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"class Text_Corpus(Dataset):\n",
	" def __init__(self, dataset_path = '/content/drive/My Drive'):\n",
	" super().__init__()\n",
	" corpus_path = os.path.join(dataset_path, 'captions.txt')\n",
	" self.token_list = []\n",
	" self.end_of_text_token = \"<\|endoftext\|>\"\n",
	"\n",
	" with open(corpus_path) as f:\n",
	" data = f.read()\n",
	" self.token_list = data.split(\"<\|endoftext\|>\")\n",
	"\n",
	" for i in range(len(self.token_list)):\n",
	" self.token_list[i] = self.end_of_text_token+self.token_list[i]+self.end_of_text_token\n",
	"\n",
	" def __len__(self):\n",
	" return len(self.token_list)\n",
	"\n",
	" def __getitem__(self, item):\n",
	" return self.token_list[item]"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "hzvrSS3oji0f",
	"colab_type": "text"
	},
	"source": [
	"Loading the dataset from `Text_Corpus`"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "zZenXS-uQ2f9",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"dataset = Text_Corpus()\n",
	"print(\"Number of Tokens Found:\", len(dataset))\n",
	"data_loader = DataLoader(dataset, batch_size=1, shuffle=True)"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "vEsGh26OjpJr",
	"colab_type": "text"
	},
	"source": [
	"Assigning Parameterts (EPOCH, Batch Size, etc)"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "sORYPpeyQ8K3",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"BATCH_SIZE = 1\n",
	"EPOCHS = 30\n",
	"LEARNING_RATE = 1e-5\n",
	"WARMUP_STEPS = 10000\n",
	"MAX_SEQ_LEN = 550"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "xwrLLywrkLK3",
	"colab_type": "text"
	},
	"source": [
	"### Training the Model for 30 Epochs"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "PN_gIOiNQ8qb",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"model = model.to(device)\n",
	"model.train()\n",
	"optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)\n",
	"scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps = -1)\n",
	"text_count = 0\n",
	"sum_loss = 0.0\n",
	"batch_count = 0\n",
	"\n",
	"tmp_text_tens = None\n",
	"\n",
	"for epoch in range(EPOCHS):\n",
	"\n",
	" print(f\"EPOCH {epoch} started \" + '=' * 30)\n",
	" for idx,text in enumerate(data_loader):\n",
	" \n",
	" text_tens = torch.tensor(tokenizer.encode(text[0])).unsqueeze(0).to(device)\n",
	"\n",
	" if text_tens.size()[1] > MAX_SEQ_LEN:\n",
	" continue\n",
	" if not torch.is_tensor(tmp_text_tens):\n",
	" tmp_text_tens = text_tens\n",
	" continue\n",
	" else:\n",
	" if tmp_text_tens.size()[1] + text_tens.size()[1] > MAX_SEQ_LEN:\n",
	" work_text_tens = tmp_text_tens\n",
	" tmp_text_tens = text_tens\n",
	" else:\n",
	" tmp_text_tens = torch.cat([tmp_text_tens, text_tens[:,1:]], dim=1) \n",
	" continue\n",
	" \n",
	" outputs = model(work_text_tens, labels=work_text_tens)\n",
	" loss, logits = outputs[:2] \n",
	" loss.backward()\n",
	" sum_loss = sum_loss + loss.detach().data \n",
	" text_count = text_count + 1\n",
	"\n",
	" if text_count == BATCH_SIZE:\n",
	" text_count = 0 \n",
	" batch_count += 1\n",
	" optimizer.step()\n",
	" scheduler.step() \n",
	" optimizer.zero_grad()\n",
	" model.zero_grad()\n",
	" \n",
	" if batch_count == 1000:\n",
	" print(f\"sum loss {sum_loss}\")\n",
	" batch_count = 0\n",
	" sum_loss = 0.0"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "Y3sBnNo4kTpR",
	"colab_type": "text"
	},
	"source": [
	"### Generating 20 Samples"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "_c98C7nacGLa",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"model.eval()\n",
	"with torch.no_grad():\n",
	" \n",
	" for text_idx in range(20):\n",
	"\n",
	" cur_ids = torch.tensor(tokenizer.encode(\"<\|startoftext\|>START:\")).unsqueeze(0).to(device)\n",
	" \n",
	" for i in range(250):\n",
	" outputs = model(cur_ids, labels=cur_ids)\n",
	" loss, logits = outputs[:2]\n",
	" softmax_logits = torch.softmax(logits[0,-1], dim=0)\n",
	" if i < 2:\n",
	" n = 15\n",
	" else:\n",
	" n = 3\n",
	" next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=n)\n",
	" cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim = 1)\n",
	" if next_token_id in tokenizer.encode('<\|endoftext\|>'):\n",
	" break\n",
	" \n",
	" output_list = list(cur_ids.squeeze().to('cpu').numpy())\n",
	" output_text = tokenizer.decode(output_list)\n",
	" print(f\"SAMPLE {text_idx}: {output_text.capitalize()} \\n\")"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "JZUi_hmZkhg9",
	"colab_type": "text"
	},
	"source": [
	"## Contributors\n",
	"- [Abbas Mohammed](https://github.com/iam-abbas) (iam-abbas on github)\n",
	"- [Shubham Rao](https://github.com/cshubhamrao) (cshubhamrao on github)\n",
	"\n",
	"Mentions:-\n",
	"- [Martins Frolovs](https://towardsdatascience.com/teaching-gpt-2-a-sense-of-humor-fine-tuning-large-transformer-models-on-a-single-gpu-in-pytorch-59e8cec40912)"
	]
	}
	]
	}