Skip to content

Instantly share code, notes, and snippets.

@Aunsiels
Last active October 28, 2022 07:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Aunsiels/5a2675b383659f1345079fc40aea5b32 to your computer and use it in GitHub Desktop.
Save Aunsiels/5a2675b383659f1345079fc40aea5b32 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"# Journée Recherche - Octobre 2022 - Language Models And Text Generation"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"In this lab, we will learn how to create a simple bigram language model and how to\n",
"generate text from it.\n",
"\n",
"We will also have a quick look at GPT2, a neural language model.\n",
"\n",
"Before we start, we need to install some libraries."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": true,
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"!pip3 install datasets nltk transformers torch ipywidgets\n",
"import os\n",
"os._exit(00)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Data Loading\n",
"\n",
"This section loads a dataset of children books. You have nothing in particular to do, just\n",
"have a look at the code.\n",
"\n",
"We use a library called [Pandas](https://pandas.pydata.org/). It is a very popular\n",
"library to manipulate data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from datasets import load_dataset\n",
"\n",
"dataset = load_dataset(\"Aunsiels/InfantBooks\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"df = dataset[\"train\"].to_pandas()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# A sample of the data\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Data Preprocessing\n",
"\n",
"Now that the data is loaded, we need to preprocess it to create our bigram language model.\n",
"We first need to split our sentences into words. Why can't we simply cut using the space\n",
"as a delimiter?\n",
"\n",
"Splitting a sentence into a list (of words here) is called tokenization. It is a crucial\n",
"step as the goal of our language model is to predict the next token.\n",
"\n",
"We will use a library called NLTK (Natural Language ToolKit). It is a common library for\n",
"natural language processing. It contains a function that splits a sentence into words:\n",
"**word\\_tokenize**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"import nltk\n",
"nltk.download('punkt')\n",
"\n",
"from nltk.tokenize import word_tokenize"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# all_next_words is a dictionary that, for an input w_in gives all the\n",
"# possible following words [w0, ..., w_n] (with repetition).\n",
"all_next_words = dict()\n",
"\n",
"# Using the following skeleton, fill the all_next_words dictionary.\n",
"for _, row in df.iterrows():\n",
" words = word_tokenize(row[\"content\"].lower())\n",
" for i in range(len(words) - 1):\n",
" prev_word = words[i]\n",
" next_word = words[i+1]\n",
" if prev_word not in all_next_words:\n",
" all_next_words[prev_word] = []\n",
" all_next_words[???].append(???)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# We do not a raw list for each input, but rather a frequency (so we can select\n",
"# the best one). Here, we use the Counter library to do it for us.\n",
"from collections import Counter\n",
"\n",
"conditional_probabilities = dict()\n",
"\n",
"for key, value in all_next_words.items():\n",
" conditional_probabilities[key] = Counter(value)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Observe how we obtain the most common words with their frequency.\n",
"conditional_probabilities[\"father\"].most_common(5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Text Generation Algorithms\n",
"\n",
"Now that we have a simple language model, we can use it to generate text. We will implement\n",
"two algorithms here: the greedy search and the sampling search."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Given an initial word, generate a sentence that starts with this word using the\n",
"# language model we created previously and a greedy search (i.e. also take the most\n",
"# probable word).\n",
"# To prevent infinite generation, we want to stop the generation once we generation a\n",
"# dot (.) or once the generate sentence contains 20 tokens.\n",
"def generate_greedy(initial, temperature=1):\n",
" # cur contains the current word\n",
" cur = initial\n",
" # res is the list containing all the words in our final sentence.\n",
" res = [cur]\n",
" while cur != \".\" and len(res) < 20:\n",
" words = [x[0] for x in conditional_probabilities[cur].most_common(100)]\n",
" proba = [x[1] for x in conditional_probabilities[cur].most_common(100)]\n",
" # TO COMPLETE: choose the word with highest probability\n",
" return \" \".join(res)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Try to play with it. What are the limitations that you can observe?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"import random"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Write the sampling generation approach, i.e. the next word is sampled from all possible\n",
"# next words with a probability proportional to their weight. To do so, we can use:\n",
"# random.choices(words, proba, k=1)[0]\n",
"def generate_proba(initial, temperature=1):\n",
" cur = initial\n",
" res = [cur]\n",
" while cur != \".\" and len(res) < 20:\n",
" words = [x[0] for x in conditional_probabilities[cur].most_common(100)]\n",
" proba = [x[1] ** (1 / temperature) for x in conditional_probabilities[cur].most_common(100)]\n",
" # TO COMPLETE : Sample a word following the probability distribution\n",
" return \" \".join(res)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Observe what this algorithm generate. If you have time, try to add an additional parameter\n",
"called the temperature. It modifies the weights such that a high temperature create a uniform\n",
"distribution, a temperature of 1 is the distribution given by the language model, and\n",
"a temperature close to zero converge toward a probability of 1 for the most frequent word."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## Pretrained Neural Network Model\n",
"\n",
"The language model we generated previously is very limiting for two reasons: It\n",
"only models bigrams and the corpus is too simple. Solving the second problem is \"easy\":\n",
"We could use Wikipedia, all the books ever written, or even the entire web to learn the\n",
"probabilities.\n",
"\n",
"For the second problem, we cannot keep using a N-gram approach for a small N. We need to\n",
"use a neural network that will be able to predict the next word using more context.\n",
"Besides, it will generalize much better.\n",
"\n",
"We do not have time to explain how these neural networks work. We want to play a bit with\n",
"them to generate text. For your own culture, there are two language models that\n",
"are widely used: **GPT1/2/3** that can generate text (what we want here) and **BERT**\n",
"that can (almost) do everything else."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"from transformers import GPT2LMHeadModel, GPT2Tokenizer"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n",
"model = GPT2LMHeadModel.from_pretrained(\"gpt2\", pad_token_id=tokenizer.eos_token_id)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Here is the function for the greedy search. We can notice that it also tokenizes the\n",
"# input sequence (that can be composed of several words now). Besides, it turns each\n",
"# token into a number that identifies it uniquely. The reason is that neural networks\n",
"# work on numbers, not raw text.\n",
"def generate_greedy_gpt(sentence):\n",
" input_ids = tokenizer.encode(sentence, return_tensors=\"pt\")\n",
" greedy_output = model.generate(input_ids, max_length=50)\n",
" return tokenizer.decode(greedy_output[0], skip_special_tokens=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Try this function and remember that you are not limited to a single word now.\n",
"Is it better that before? You can try to modify the maximum input sequence."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Here is the function that implements the sampling approach.\n",
"def generate_proba_gpt(sentence):\n",
" input_ids = tokenizer.encode(sentence, return_tensors=\"pt\")\n",
" sample_output = model.generate(\n",
" input_ids,\n",
" do_sample=True,\n",
" max_length=50,\n",
" top_k=0\n",
" )\n",
" return tokenizer.decode(sample_output[0], skip_special_tokens=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Try to play with this function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# We are finally reaching one of the most used algorithm to generate text: The beam search!\n",
"def generate_beam_gpt(sentence):\n",
" input_ids = tokenizer.encode(sentence, return_tensors=\"pt\")\n",
" beam_output = model.generate(\n",
" input_ids,\n",
" max_length=50,\n",
" num_beams=5,\n",
" early_stopping=True\n",
" )\n",
" return tokenizer.decode(beam_output[0], skip_special_tokens=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Try to play with it. What do you observe if you increase the number of beams?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"You are now reaching the end of this lab. If you are interested in natural language\n",
"processing, do not hesitate to contact me for some advice, for a project, or for\n",
"an internship.\n",
"\n",
"\n",
"As a last note, you can try GPT3, one of the most powerful language model,\n",
"online: [https://beta.openai.com/playground](https://beta.openai.com/playground). It is\n",
"quite astonishing!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment