mf1024/GPT2-medium_text_gen.ipynb

## GPT2-medium_text_gen.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Generating text with a pre-trained GPT2 in PyTorch\n",
    "\n",
    "This notebook was created as a part of a blog post - [Fine-tuning large Transformer models on a single GPU in PyTorch - Teaching GPT-2 a sense of humor](https://mf1024.github.io/2019/11/12/Fun-With-GPT-2/).\n",
    "\n",
    "In this notebook, I will use a pre-trained medium-sized GPT2 model from the [huggingface](https://github.com/huggingface/transformers) to generate some text.\n",
    "\n",
    "The easiest way to use huggingface transformer libraries is to install their pip package *transformers*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!pip install transformers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import logging\n",
    "logging.getLogger().setLevel(logging.CRITICAL)\n",
    "\n",
    "import torch\n",
    "import numpy as np\n",
    "\n",
    "from transformers import GPT2Tokenizer, GPT2LMHeadModel\n",
    "\n",
    "device = 'cpu'\n",
    "if torch.cuda.is_available():\n",
    "    device = 'cuda'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Models and classes\n",
    "\n",
    "I use the [GPT2LMHeadModel](https://github.com/huggingface/transformers/blob/master/transformers/modeling_gpt2.py#L491) module for the language model, which is [GPT2Model](https://github.com/huggingface/transformers/blob/master/transformers/modeling_gpt2.py#L326), with an additional linear layer that uses input embedding layer weights to do the inverse operation of the embedding layer - to create logits vector for the dictionary from outputs of the GPT2.\n",
    "\n",
    "[GPT2Tokenizer](https://github.com/huggingface/transformers/blob/master/transformers/tokenization_gpt2.py#L106) is a byte-code pair encoder that will transform input text input into input tokens that the huggingface transformers were trained on. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')\n",
    "model = GPT2LMHeadModel.from_pretrained('gpt2-medium')\n",
    "model = model.to(device)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Function to first select topN tokens from the probability list and then based on the selected N word distribution\n",
    "# get random token ID\n",
    "def choose_from_top(probs, n=5):\n",
    "    ind = np.argpartition(probs, -n)[-n:]\n",
    "    top_prob = probs[ind]\n",
    "    top_prob = top_prob / np.sum(top_prob) # Normalize\n",
    "    choice = np.random.choice(n, 1, p = top_prob)\n",
    "    token_id = ind[choice][0]\n",
    "    return int(token_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text generation\n",
    "\n",
    "At each prediction step, GPT2 model needs to know all of the previous sequence elements to predict the next one. Below is a function that will tokenize the starting input text, and then in a loop, one new token is predicted at each step and is added to the sequence, which will be fed into the model in the next step. In the end, the token list is decoded back into a text. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def generate_some_text(input_str, text_len = 250):\n",
    "\n",
    "    cur_ids = torch.tensor(tokenizer.encode(input_str)).unsqueeze(0).long().to(device)\n",
    "\n",
    "    model.eval()\n",
    "    with torch.no_grad():\n",
    "\n",
    "        for i in range(text_len):\n",
    "            outputs = model(cur_ids, labels=cur_ids)\n",
    "            loss, logits = outputs[:2]\n",
    "            softmax_logits = torch.softmax(logits[0,-1], dim=0) #Take the first(only one) batch and the last predicted embedding\n",
    "            next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=10) #Randomly(from the given probability distribution) choose the next word from the top n words\n",
    "            cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim = 1) # Add the last word\n",
    "\n",
    "        output_list = list(cur_ids.squeeze().to('cpu').numpy())\n",
    "        output_text = tokenizer.decode(output_list)\n",
    "        print(output_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating the text\n",
    "\n",
    "I will give thre different sentence beginnings to the GPT2 and let it generate the rest:\n",
    "\n",
    "\n",
    "***1. The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work… when you go to church… when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth…***\n",
    "\n",
    "***2. Artificial general intelligence is…***\n",
    "\n",
    "***3. The Godfather: “I’m going to make him an offer he can’t refuse.”…***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work... when you go to church... when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth. It is the world of technology and information that is now being used to enslave you for a lifetime and to destroy you when you are young, as we know is going to happen. This technology, the Matrix, was created for this purpose. It is now being used by the Illuminati. And it's the world we live in that's being used to enslave us. You see, this world is not a dream. It is real, and if we want to know why, we need to look inside our own souls. It's called \"love.\" It is the most powerful force in the world, and it's the only force that can save us from the Matrix. Love is the power that allows you to love someone in your life. Love is also your power of resistance. If you don't love someone, you can't be with someone. But love is also your power of love. The power of love is what makes you love. And it is what gives you strength, and it is what can help you survive the matrix. Love is the only force that has the power to save the Matrix. If we don't love one another, no matter how much pain that may cause us, we will die, and the world will\n"
     ]
    }
   ],
   "source": [
    "generate_some_text(\" The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work... when you go to church... when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth. \")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Artificial general intelligence is the most likely future of the human race; it's a science which is not just possible but inevitable.\"\n",
      "\n",
      "The AI industry, like most in technology, is based on a belief system of \"the future is here\". We've seen that with Google and Apple.\n",
      "\n",
      "But this time round, we're not going to be able to see this future in the near future; we're going to have artificial intelligence as we know it today in a very short period of time. And that's not going to be good news for anyone in the industry, and certainly not the people who are making the AI.\n",
      "\n",
      "The AI industry is currently in a very precarious place; it's very fragile, and very vulnerable to the kind of changes that are happening in the world today.\n",
      "\n",
      "The biggest threat is going to be the rise of artificial intelligent technologies. AI is a very powerful concept, but it's not going to be able to completely replace the people who are making it, or the people that are using it, or the technology that it can use.\n",
      "\n",
      "I think what's going to happen is that as more and more AI becomes available, the people who are developing these systems are going to have to start asking themselves, are\n"
     ]
    }
   ],
   "source": [
    "generate_some_text(\" Artificial general intelligence is \")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Godfather: \"I'm going to make him an offer he can't refuse.\"\n",
      "\n",
      "The Godfather: \"What? What is it? He has to be a good boy? A good boy that doesn't want to be killed? Is the offer good?\"\n",
      "\n",
      "The Godfather: \"He's a bad boy, isn't he.\"\n",
      "\n",
      "The Godfather: \"You're a good boy!\"\n",
      "\n",
      "The Godfather: \"He's an idiot. He won't be able to understand what's going on!\"\n",
      "\n",
      "The Godfather: \"You know, I never said you would be able to understand what's going on! I said you would be able to take him to a friend's house.\"\n",
      "\n",
      "The Godfather: \"I don't understand! You mean you'll never understand what's going on? What's happening to me?\"\n",
      "\n",
      "The Godfather: \"That's the only way I can explain it to him. He's not going to be able to understand it either if I tell him what I know. He won't be able even to comprehend a thing if I tell him what it is.\"\n",
      "\n",
      "The Godfather: \"Well, you know, I've seen it all. I don't know what he will do. And, if he does, what's\n"
     ]
    }
   ],
   "source": [
    "generate_some_text(\" The Godfather: \\\"I'm going to make him an offer he can't refuse.\\\" \")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	" "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Generating text with a pre-trained GPT2 in PyTorch\n",
	"\n",
	"This notebook was created as a part of a blog post - [Fine-tuning large Transformer models on a single GPU in PyTorch - Teaching GPT-2 a sense of humor](https://mf1024.github.io/2019/11/12/Fun-With-GPT-2/).\n",
	"\n",
	"In this notebook, I will use a pre-trained medium-sized GPT2 model from the [huggingface](https://github.com/huggingface/transformers) to generate some text.\n",
	"\n",
	"The easiest way to use huggingface transformer libraries is to install their pip package transformers."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"!pip install transformers"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import logging\n",
	"logging.getLogger().setLevel(logging.CRITICAL)\n",
	"\n",
	"import torch\n",
	"import numpy as np\n",
	"\n",
	"from transformers import GPT2Tokenizer, GPT2LMHeadModel\n",
	"\n",
	"device = 'cpu'\n",
	"if torch.cuda.is_available():\n",
	" device = 'cuda'"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Models and classes\n",
	"\n",
	"I use the [GPT2LMHeadModel](https://github.com/huggingface/transformers/blob/master/transformers/modeling_gpt2.py#L491) module for the language model, which is [GPT2Model](https://github.com/huggingface/transformers/blob/master/transformers/modeling_gpt2.py#L326), with an additional linear layer that uses input embedding layer weights to do the inverse operation of the embedding layer - to create logits vector for the dictionary from outputs of the GPT2.\n",
	"\n",
	"[GPT2Tokenizer](https://github.com/huggingface/transformers/blob/master/transformers/tokenization_gpt2.py#L106) is a byte-code pair encoder that will transform input text input into input tokens that the huggingface transformers were trained on. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')\n",
	"model = GPT2LMHeadModel.from_pretrained('gpt2-medium')\n",
	"model = model.to(device)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# Function to first select topN tokens from the probability list and then based on the selected N word distribution\n",
	"# get random token ID\n",
	"def choose_from_top(probs, n=5):\n",
	" ind = np.argpartition(probs, -n)[-n:]\n",
	" top_prob = probs[ind]\n",
	" top_prob = top_prob / np.sum(top_prob) # Normalize\n",
	" choice = np.random.choice(n, 1, p = top_prob)\n",
	" token_id = ind[choice][0]\n",
	" return int(token_id)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Text generation\n",
	"\n",
	"At each prediction step, GPT2 model needs to know all of the previous sequence elements to predict the next one. Below is a function that will tokenize the starting input text, and then in a loop, one new token is predicted at each step and is added to the sequence, which will be fed into the model in the next step. In the end, the token list is decoded back into a text. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"def generate_some_text(input_str, text_len = 250):\n",
	"\n",
	" cur_ids = torch.tensor(tokenizer.encode(input_str)).unsqueeze(0).long().to(device)\n",
	"\n",
	" model.eval()\n",
	" with torch.no_grad():\n",
	"\n",
	" for i in range(text_len):\n",
	" outputs = model(cur_ids, labels=cur_ids)\n",
	" loss, logits = outputs[:2]\n",
	" softmax_logits = torch.softmax(logits[0,-1], dim=0) #Take the first(only one) batch and the last predicted embedding\n",
	" next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=10) #Randomly(from the given probability distribution) choose the next word from the top n words\n",
	" cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim = 1) # Add the last word\n",
	"\n",
	" output_list = list(cur_ids.squeeze().to('cpu').numpy())\n",
	" output_text = tokenizer.decode(output_list)\n",
	" print(output_text)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Generating the text\n",
	"\n",
	"I will give thre different sentence beginnings to the GPT2 and let it generate the rest:\n",
	"\n",
	"\n",
	"*1. The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work… when you go to church… when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth…*\n",
	"\n",
	"*2. Artificial general intelligence is…*\n",
	"\n",
	"*3. The Godfather: “I’m going to make him an offer he can’t refuse.”…*"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 35,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work... when you go to church... when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth. It is the world of technology and information that is now being used to enslave you for a lifetime and to destroy you when you are young, as we know is going to happen. This technology, the Matrix, was created for this purpose. It is now being used by the Illuminati. And it's the world we live in that's being used to enslave us. You see, this world is not a dream. It is real, and if we want to know why, we need to look inside our own souls. It's called \"love.\" It is the most powerful force in the world, and it's the only force that can save us from the Matrix. Love is the power that allows you to love someone in your life. Love is also your power of resistance. If you don't love someone, you can't be with someone. But love is also your power of love. The power of love is what makes you love. And it is what gives you strength, and it is what can help you survive the matrix. Love is the only force that has the power to save the Matrix. If we don't love one another, no matter how much pain that may cause us, we will die, and the world will\n"
	]
	}
	],
	"source": [
	"generate_some_text(\" The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work... when you go to church... when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth. \")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 45,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Artificial general intelligence is the most likely future of the human race; it's a science which is not just possible but inevitable.\"\n",
	"\n",
	"The AI industry, like most in technology, is based on a belief system of \"the future is here\". We've seen that with Google and Apple.\n",
	"\n",
	"But this time round, we're not going to be able to see this future in the near future; we're going to have artificial intelligence as we know it today in a very short period of time. And that's not going to be good news for anyone in the industry, and certainly not the people who are making the AI.\n",
	"\n",
	"The AI industry is currently in a very precarious place; it's very fragile, and very vulnerable to the kind of changes that are happening in the world today.\n",
	"\n",
	"The biggest threat is going to be the rise of artificial intelligent technologies. AI is a very powerful concept, but it's not going to be able to completely replace the people who are making it, or the people that are using it, or the technology that it can use.\n",
	"\n",
	"I think what's going to happen is that as more and more AI becomes available, the people who are developing these systems are going to have to start asking themselves, are\n"
	]
	}
	],
	"source": [
	"generate_some_text(\" Artificial general intelligence is \")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 59,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"The Godfather: \"I'm going to make him an offer he can't refuse.\"\n",
	"\n",
	"The Godfather: \"What? What is it? He has to be a good boy? A good boy that doesn't want to be killed? Is the offer good?\"\n",
	"\n",
	"The Godfather: \"He's a bad boy, isn't he.\"\n",
	"\n",
	"The Godfather: \"You're a good boy!\"\n",
	"\n",
	"The Godfather: \"He's an idiot. He won't be able to understand what's going on!\"\n",
	"\n",
	"The Godfather: \"You know, I never said you would be able to understand what's going on! I said you would be able to take him to a friend's house.\"\n",
	"\n",
	"The Godfather: \"I don't understand! You mean you'll never understand what's going on? What's happening to me?\"\n",
	"\n",
	"The Godfather: \"That's the only way I can explain it to him. He's not going to be able to understand it either if I tell him what I know. He won't be able even to comprehend a thing if I tell him what it is.\"\n",
	"\n",
	"The Godfather: \"Well, you know, I've seen it all. I don't know what he will do. And, if he does, what's\n"
	]
	}
	],
	"source": [
	"generate_some_text(\" The Godfather: \\\"I'm going to make him an offer he can't refuse.\\\" \")"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.14"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}