Skip to content

Instantly share code, notes, and snippets.

@iam-abbas
Created February 26, 2020 08:27
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save iam-abbas/b93961bc9468e375f1b75a1a6e47610c to your computer and use it in GitHub Desktop.
Save iam-abbas/b93961bc9468e375f1b75a1a6e47610c to your computer and use it in GitHub Desktop.
Colbert AI v2.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Colbert AI v2.ipynb",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true,
"machine_shape": "hm",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"accelerator": "GPU"
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/iam-abbas/b93961bc9468e375f1b75a1a6e47610c/colbert-ai-v2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VtqZti9Dgy-t",
"colab_type": "text"
},
"source": [
"# Colbert-AI v2.0\n",
"\n",
"##### ***Using Pytorch, Transformers and Open-AI's GPT-2***"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4AwTP3LnhKvI",
"colab_type": "text"
},
"source": [
"*Installing Transformers*"
]
},
{
"cell_type": "code",
"metadata": {
"id": "rVa_ehwbL0Ie",
"colab_type": "code",
"colab": {}
},
"source": [
"!pip install -q transformers"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "vyjKiHPShbnu",
"colab_type": "text"
},
"source": [
"*importing all the required modules*"
]
},
{
"cell_type": "code",
"metadata": {
"id": "TXfcA0yoL3zX",
"colab_type": "code",
"colab": {}
},
"source": [
"import torch\n",
"import numpy as np\n",
"from transformers import GPT2Tokenizer, GPT2LMHeadModel\n",
"from transformers import AdamW, get_linear_schedule_with_warmup\n",
"from torch.utils.data import Dataset\n",
"from torch.utils.data import Dataset, DataLoader\n",
"import os"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "s4O01xrJhiV4",
"colab_type": "text"
},
"source": [
"#### **Choosing a Model**\n",
"##### Transformes has 4 models\n",
"![Image by Jay Alammar from post The Illustrated GPT-2](https://i.imgur.com/yrIxPVX.png)\n",
"\n",
"**Model Names**:\n",
"- `gpt2-small` (124M Model)\n",
"- `gpt2-medium` (345M Model)\n",
"- `gpt2-large` (774M Model)\n",
"- `gpt2-xl` (1558M Model)\n",
"\n",
"*In our case we focused on making a lighter model so we used medium*"
]
},
{
"cell_type": "code",
"metadata": {
"id": "8xxWwXYIL-Md",
"colab_type": "code",
"colab": {}
},
"source": [
"device = 'cpu'\n",
"if torch.cuda.is_available():\n",
" device = 'cuda'\n",
" \n",
"tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')\n",
"model = GPT2LMHeadModel.from_pretrained('gpt2-medium')\n",
"model = model.to(device)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "MvTgsO8tioua",
"colab_type": "text"
},
"source": [
"`choose_from_top`:\n",
"- Function to first select topN tokens from the probability list and then based on the selected N word distribution\n",
"\n",
"`generate_text`:\n",
"- At each prediction step, GPT2 model needs to know all of the previous sequence elements to predict the next one. Below is a function that will tokenize the starting input text, and then in a loop, one new token is predicted at each step and is added to the sequence, which will be fed into the model in the next step. In the end, the token list is decoded back into a text."
]
},
{
"cell_type": "code",
"metadata": {
"id": "zmSba37LMA_D",
"colab_type": "code",
"colab": {}
},
"source": [
"def choose_from_top(probs, n=10):\n",
" ind = np.argpartition(probs, -n)[-n:]\n",
" top_prob = probs[ind]\n",
" top_prob = top_prob / np.sum(top_prob)\n",
" choice = np.random.choice(n, 1, p = top_prob)\n",
" token_id = ind[choice][0]\n",
" return int(token_id)\n",
"\n",
"def generate_text(input_str, text_len = 100):\n",
" cur_ids = torch.tensor(tokenizer.encode(input_str)).unsqueeze(0).long().to(device)\n",
" model.eval()\n",
" with torch.no_grad():\n",
" for i in range(text_len):\n",
" outputs = model(cur_ids, labels=cur_ids)\n",
" loss, logits = outputs[:2]\n",
" softmax_logits = torch.softmax(logits[0,-1], dim=0) \n",
" next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=10)\n",
" cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim = 1) # Add the last word\n",
" output_list = list(cur_ids.squeeze().to('cpu').numpy())\n",
" output_text = tokenizer.decode(output_list)\n",
" print(output_text)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "9ygC5umoi_jl",
"colab_type": "text"
},
"source": [
"### Generating The Text"
]
},
{
"cell_type": "code",
"metadata": {
"id": "8UzA5kg9MJaP",
"colab_type": "code",
"colab": {}
},
"source": [
"generate_text(\"Donald Trump visits India and \")"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "nvw8SI31jDMU",
"colab_type": "text"
},
"source": [
"## **Fine-tuning GPT-2 on Captions Dataset from YouTube**\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "c67acXwnOCGw",
"colab_type": "code",
"colab": {}
},
"source": [
"class Text_Corpus(Dataset):\n",
" def __init__(self, dataset_path = '/content/drive/My Drive'):\n",
" super().__init__()\n",
" corpus_path = os.path.join(dataset_path, 'captions.txt')\n",
" self.token_list = []\n",
" self.end_of_text_token = \"<|endoftext|>\"\n",
"\n",
" with open(corpus_path) as f:\n",
" data = f.read()\n",
" self.token_list = data.split(\"<|endoftext|>\")\n",
"\n",
" for i in range(len(self.token_list)):\n",
" self.token_list[i] = self.end_of_text_token+self.token_list[i]+self.end_of_text_token\n",
"\n",
" def __len__(self):\n",
" return len(self.token_list)\n",
"\n",
" def __getitem__(self, item):\n",
" return self.token_list[item]"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "hzvrSS3oji0f",
"colab_type": "text"
},
"source": [
"*Loading the dataset from `Text_Corpus`*"
]
},
{
"cell_type": "code",
"metadata": {
"id": "zZenXS-uQ2f9",
"colab_type": "code",
"colab": {}
},
"source": [
"dataset = Text_Corpus()\n",
"print(\"Number of Tokens Found:\", len(dataset))\n",
"data_loader = DataLoader(dataset, batch_size=1, shuffle=True)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "vEsGh26OjpJr",
"colab_type": "text"
},
"source": [
"*Assigning Parameterts (EPOCH, Batch Size, etc)*"
]
},
{
"cell_type": "code",
"metadata": {
"id": "sORYPpeyQ8K3",
"colab_type": "code",
"colab": {}
},
"source": [
"BATCH_SIZE = 1\n",
"EPOCHS = 30\n",
"LEARNING_RATE = 1e-5\n",
"WARMUP_STEPS = 10000\n",
"MAX_SEQ_LEN = 550"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "xwrLLywrkLK3",
"colab_type": "text"
},
"source": [
"### *Training the Model for 30 Epochs*"
]
},
{
"cell_type": "code",
"metadata": {
"id": "PN_gIOiNQ8qb",
"colab_type": "code",
"colab": {}
},
"source": [
"model = model.to(device)\n",
"model.train()\n",
"optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)\n",
"scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps = -1)\n",
"text_count = 0\n",
"sum_loss = 0.0\n",
"batch_count = 0\n",
"\n",
"tmp_text_tens = None\n",
"\n",
"for epoch in range(EPOCHS):\n",
"\n",
" print(f\"EPOCH {epoch} started \" + '=' * 30)\n",
" for idx,text in enumerate(data_loader):\n",
" \n",
" text_tens = torch.tensor(tokenizer.encode(text[0])).unsqueeze(0).to(device)\n",
"\n",
" if text_tens.size()[1] > MAX_SEQ_LEN:\n",
" continue\n",
" if not torch.is_tensor(tmp_text_tens):\n",
" tmp_text_tens = text_tens\n",
" continue\n",
" else:\n",
" if tmp_text_tens.size()[1] + text_tens.size()[1] > MAX_SEQ_LEN:\n",
" work_text_tens = tmp_text_tens\n",
" tmp_text_tens = text_tens\n",
" else:\n",
" tmp_text_tens = torch.cat([tmp_text_tens, text_tens[:,1:]], dim=1) \n",
" continue\n",
" \n",
" outputs = model(work_text_tens, labels=work_text_tens)\n",
" loss, logits = outputs[:2] \n",
" loss.backward()\n",
" sum_loss = sum_loss + loss.detach().data \n",
" text_count = text_count + 1\n",
"\n",
" if text_count == BATCH_SIZE:\n",
" text_count = 0 \n",
" batch_count += 1\n",
" optimizer.step()\n",
" scheduler.step() \n",
" optimizer.zero_grad()\n",
" model.zero_grad()\n",
" \n",
" if batch_count == 1000:\n",
" print(f\"sum loss {sum_loss}\")\n",
" batch_count = 0\n",
" sum_loss = 0.0"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Y3sBnNo4kTpR",
"colab_type": "text"
},
"source": [
"### **Generating 20 Samples**"
]
},
{
"cell_type": "code",
"metadata": {
"id": "_c98C7nacGLa",
"colab_type": "code",
"colab": {}
},
"source": [
"model.eval()\n",
"with torch.no_grad():\n",
" \n",
" for text_idx in range(20):\n",
"\n",
" cur_ids = torch.tensor(tokenizer.encode(\"<|startoftext|>START:\")).unsqueeze(0).to(device)\n",
" \n",
" for i in range(250):\n",
" outputs = model(cur_ids, labels=cur_ids)\n",
" loss, logits = outputs[:2]\n",
" softmax_logits = torch.softmax(logits[0,-1], dim=0)\n",
" if i < 2:\n",
" n = 15\n",
" else:\n",
" n = 3\n",
" next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=n)\n",
" cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim = 1)\n",
" if next_token_id in tokenizer.encode('<|endoftext|>'):\n",
" break\n",
" \n",
" output_list = list(cur_ids.squeeze().to('cpu').numpy())\n",
" output_text = tokenizer.decode(output_list)\n",
" print(f\"SAMPLE {text_idx}: {output_text.capitalize()} \\n\")"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "JZUi_hmZkhg9",
"colab_type": "text"
},
"source": [
"## **Contributors**\n",
"- [Abbas Mohammed](https://github.com/iam-abbas) *(iam-abbas on github)*\n",
"- [Shubham Rao](https://github.com/cshubhamrao) *(cshubhamrao on github)*\n",
"\n",
"**Mentions:-**\n",
"- [Martins Frolovs](https://towardsdatascience.com/teaching-gpt-2-a-sense-of-humor-fine-tuning-large-transformer-models-on-a-single-gpu-in-pytorch-59e8cec40912)"
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment