GeorgeDittmar/lm-huggingface-finetune-gpt-2.ipynb

## lm-huggingface-finetune-gpt-2.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "LM Huggingface finetune GPT-2.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true,
      "mount_file_id": "1lkuYC06chnKHyMAeD3TpZwV7DrAPNPj0",
      "authorship_tag": "ABX9TyOAoMuCXHpR4iX+/S6GUoKH",
      "include_colab_link": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/GeorgeDittmar/5c57a35332b2b5818e51618af7953351/lm-huggingface-finetune-gpt-2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MZMbGCw0Qxfg"
      },
      "source": [
        "# **Finetuning GPT2 using HuggingFace and Tensorflow**\n",
        "\n",
        "In this colab notebook we set up a simple outline of how you can use Huggingface to fine tune a gpt2 model on finance titles to generate new possible headlines. This notebook uses the hugginface finefuning scripts and then uses the TensorFlow version of the genreated models."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PzB7m0GI5fSt"
      },
      "source": [
        "First begin setup by cloning transformers repo. We need to store the training script locally since there isnt an easier way to train tf based gpt2 models as far as I can see."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "91rmSAUQVIUP",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "13e10b71-6fa8-4e32-8b09-1d6c53d27680"
      },
      "source": [
        "#Clone the transformers repo into the notebook\n",
        "!git clone https://github.com/huggingface/transformers"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Cloning into 'transformers'...\n",
            "remote: Enumerating objects: 92, done.\u001b[K\n",
            "remote: Counting objects: 100% (92/92), done.\u001b[K\n",
            "remote: Compressing objects: 100% (49/49), done.\u001b[K\n",
            "remote: Total 57660 (delta 47), reused 59 (delta 33), pack-reused 57568\u001b[K\n",
            "Receiving objects: 100% (57660/57660), 42.96 MiB | 5.55 MiB/s, done.\n",
            "Resolving deltas: 100% (40454/40454), done.\n",
            "Cloning into 'reddit-scan-datasets'...\n",
            "remote: Enumerating objects: 21, done.\u001b[K\n",
            "remote: Counting objects: 100% (21/21), done.\u001b[K\n",
            "remote: Compressing objects: 100% (18/18), done.\u001b[K\n",
            "remote: Total 141 (delta 6), reused 12 (delta 3), pack-reused 120\u001b[K\n",
            "Receiving objects: 100% (141/141), 51.18 MiB | 5.98 MiB/s, done.\n",
            "Resolving deltas: 100% (62/62), done.\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "RFzEr4Y1VJKP",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "94b9a4c6-e068-46bd-d600-37c365ded8f9"
      },
      "source": [
        "# Clone should now be in the machine\n",
        "!ls"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "reddit-scan-datasets  sample_data  transformers\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "M-ke4920tDqv"
      },
      "source": [
        "Check to see what gpu we were granted. For Colab Pro it will vary between a Tesla V100 or P100. For normal colab it should be a k80"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "BPIgByLqmI81",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "d68ded87-37bb-4156-d5f4-0df01133f3e8"
      },
      "source": [
        "!nvidia-smi"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Mon Dec 28 20:38:50 2020       \n",
            "+-----------------------------------------------------------------------------+\n",
            "| NVIDIA-SMI 460.27.04    Driver Version: 418.67       CUDA Version: 10.1     |\n",
            "|-------------------------------+----------------------+----------------------+\n",
            "| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
            "| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n",
            "|                               |                      |               MIG M. |\n",
            "|===============================+======================+======================|\n",
            "|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |\n",
            "| N/A   35C    P0    25W / 300W |      0MiB / 16130MiB |      0%      Default |\n",
            "|                               |                      |                 ERR! |\n",
            "+-------------------------------+----------------------+----------------------+\n",
            "                                                                               \n",
            "+-----------------------------------------------------------------------------+\n",
            "| Processes:                                                                  |\n",
            "|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |\n",
            "|        ID   ID                                                   Usage      |\n",
            "|=============================================================================|\n",
            "|  No running processes found                                                 |\n",
            "+-----------------------------------------------------------------------------+\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XOTdb4rWv8YN"
      },
      "source": [
        "Change directory location to be in the examples folder and then install any requirements"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "K2M2Oz9CYB4P",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "68fa2b00-20c9-497c-8ba6-9ed85843650a"
      },
      "source": [
        "import os\n",
        "os.chdir(\"transformers\")\n",
        "os.chdir(\"./examples/language-modeling\")\n",
        "!ls"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "README.md\t  run_clm.py\t   run_mlm.py\t   run_plm.py\n",
            "requirements.txt  run_mlm_flax.py  run_mlm_wwm.py\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "04rBGxwiYnep",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "3902c408-0ab6-49b9-8ed5-5d7f5bd6cb54"
      },
      "source": [
        "!pip install -r requirements.txt"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Collecting datasets>=1.1.3\n",
            "\u001b[?25l  Downloading https://files.pythonhosted.org/packages/1a/38/0c24dce24767386123d528d27109024220db0e7a04467b658d587695241a/datasets-1.1.3-py3-none-any.whl (153kB)\n",
            "\u001b[K     |████████████████████████████████| 163kB 14.1MB/s \n",
            "\u001b[?25hCollecting sentencepiece!=0.1.92\n",
            "\u001b[?25l  Downloading https://files.pythonhosted.org/packages/e5/2d/6d4ca4bef9a67070fa1cac508606328329152b1df10bdf31fb6e4e727894/sentencepiece-0.1.94-cp36-cp36m-manylinux2014_x86_64.whl (1.1MB)\n",
            "\u001b[K     |████████████████████████████████| 1.1MB 22.0MB/s \n",
            "\u001b[?25hRequirement already satisfied: protobuf in /usr/local/lib/python3.6/dist-packages (from -r requirements.txt (line 3)) (3.12.4)\n",
            "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.6/dist-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (2.23.0)\n",
            "Requirement already satisfied: tqdm<4.50.0,>=4.27 in /usr/local/lib/python3.6/dist-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (4.41.1)\n",
            "Requirement already satisfied: dill in /usr/local/lib/python3.6/dist-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.3.3)\n",
            "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.6/dist-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (1.19.4)\n",
            "Requirement already satisfied: dataclasses; python_version < \"3.7\" in /usr/local/lib/python3.6/dist-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.8)\n",
            "Collecting xxhash\n",
            "\u001b[?25l  Downloading https://files.pythonhosted.org/packages/f7/73/826b19f3594756cb1c6c23d2fbd8ca6a77a9cd3b650c9dec5acc85004c38/xxhash-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (242kB)\n",
            "\u001b[K     |████████████████████████████████| 245kB 49.6MB/s \n",
            "\u001b[?25hRequirement already satisfied: multiprocess in /usr/local/lib/python3.6/dist-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.70.11.1)\n",
            "Collecting pyarrow>=0.17.1\n",
            "\u001b[?25l  Downloading https://files.pythonhosted.org/packages/d7/e1/27958a70848f8f7089bff8d6ebe42519daf01f976d28b481e1bfd52c8097/pyarrow-2.0.0-cp36-cp36m-manylinux2014_x86_64.whl (17.7MB)\n",
            "\u001b[K     |████████████████████████████████| 17.7MB 367kB/s \n",
            "\u001b[?25hRequirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (1.1.5)\n",
            "Requirement already satisfied: six>=1.9 in /usr/local/lib/python3.6/dist-packages (from protobuf->-r requirements.txt (line 3)) (1.15.0)\n",
            "Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from protobuf->-r requirements.txt (line 3)) (51.0.0)\n",
            "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (1.24.3)\n",
            "Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (2.10)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (2020.12.5)\n",
            "Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (3.0.4)\n",
            "Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->datasets>=1.1.3->-r requirements.txt (line 1)) (2018.9)\n",
            "Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.6/dist-packages (from pandas->datasets>=1.1.3->-r requirements.txt (line 1)) (2.8.1)\n",
            "Installing collected packages: xxhash, pyarrow, datasets, sentencepiece\n",
            "  Found existing installation: pyarrow 0.14.1\n",
            "    Uninstalling pyarrow-0.14.1:\n",
            "      Successfully uninstalled pyarrow-0.14.1\n",
            "Successfully installed datasets-1.1.3 pyarrow-2.0.0 sentencepiece-0.1.94 xxhash-2.0.0\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "iB29mAKjaNIQ",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "cf082656-d9de-44b4-c812-44757a196c32"
      },
      "source": [
        "!ls"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "README.md\t  run_clm.py\t   run_mlm.py\t   run_plm.py\n",
            "requirements.txt  run_mlm_flax.py  run_mlm_wwm.py\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "eo5gRmXaWx0m",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "4413970a-159b-484f-b090-106f50be1434"
      },
      "source": [
        "!pip install pyarrow --upgrade"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Requirement already up-to-date: pyarrow in /usr/local/lib/python3.6/dist-packages (2.0.0)\n",
            "Requirement already satisfied, skipping upgrade: numpy>=1.14 in /usr/local/lib/python3.6/dist-packages (from pyarrow) (1.19.4)\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "y5sdYSpAWY1S"
      },
      "source": [
        "import os\n",
        "os.chdir(\"/content/transformers/examples/\")\n",
        "os.chdir(\"./language-modeling\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "a6G6WINaYmwx",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "589aa5af-1149-40f3-a069-7893d55bf77a"
      },
      "source": [
        "# Need to install latest transformer packages from github so the scripts will run correctly\n",
        "! pip install git+git://github.com/huggingface/transformers/"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Collecting git+git://github.com/huggingface/transformers/\n",
            "  Cloning git://github.com/huggingface/transformers/ to /tmp/pip-req-build-fii4adi2\n",
            "  Running command git clone -q git://github.com/huggingface/transformers/ /tmp/pip-req-build-fii4adi2\n",
            "  Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
            "  Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
            "    Preparing wheel metadata ... \u001b[?25l\u001b[?25hdone\n",
            "Requirement already satisfied: dataclasses; python_version < \"3.7\" in /usr/local/lib/python3.6/dist-packages (from transformers==4.2.0.dev0) (0.8)\n",
            "Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from transformers==4.2.0.dev0) (3.0.12)\n",
            "Collecting sacremoses\n",
            "\u001b[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)\n",
            "\u001b[K     |████████████████████████████████| 890kB 12.8MB/s \n",
            "\u001b[?25hCollecting tokenizers==0.9.4\n",
            "\u001b[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)\n",
            "\u001b[K     |████████████████████████████████| 2.9MB 31.9MB/s \n",
            "\u001b[?25hRequirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.6/dist-packages (from transformers==4.2.0.dev0) (2019.12.20)\n",
            "Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from transformers==4.2.0.dev0) (2.23.0)\n",
            "Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from transformers==4.2.0.dev0) (20.8)\n",
            "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.6/dist-packages (from transformers==4.2.0.dev0) (4.41.1)\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from transformers==4.2.0.dev0) (1.19.4)\n",
            "Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers==4.2.0.dev0) (1.15.0)\n",
            "Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers==4.2.0.dev0) (7.1.2)\n",
            "Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers==4.2.0.dev0) (1.0.0)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->transformers==4.2.0.dev0) (2020.12.5)\n",
            "Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->transformers==4.2.0.dev0) (2.10)\n",
            "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->transformers==4.2.0.dev0) (1.24.3)\n",
            "Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->transformers==4.2.0.dev0) (3.0.4)\n",
            "Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from packaging->transformers==4.2.0.dev0) (2.4.7)\n",
            "Building wheels for collected packages: transformers\n",
            "  Building wheel for transformers (PEP 517) ... \u001b[?25l\u001b[?25hdone\n",
            "  Created wheel for transformers: filename=transformers-4.2.0.dev0-cp36-none-any.whl size=1527284 sha256=286ed09408e39f89cf6a92cbdc8de24217038dec6efc94593139f9befc937614\n",
            "  Stored in directory: /tmp/pip-ephem-wheel-cache-__qy55tc/wheels/dc/e5/1e/3a2977a646558fca07585cadcf56aa4a910e995ba945961c4e\n",
            "Successfully built transformers\n",
            "Building wheels for collected packages: sacremoses\n",
            "  Building wheel for sacremoses (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=684546f54ac2b31fd17b4109c1898cac578b4dbaef839e7a0af6d6a40840bade\n",
            "  Stored in directory: /root/.cache/pip/wheels/29/3c/fd/7ce5c3f0666dab31a50123635e6fb5e19ceb42ce38d4e58f45\n",
            "Successfully built sacremoses\n",
            "Installing collected packages: sacremoses, tokenizers, transformers\n",
            "Successfully installed sacremoses-0.0.43 tokenizers-0.9.4 transformers-4.2.0.dev0\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oDuJnVmrY_Fb"
      },
      "source": [
        "Set up data from a text file in the format <|title|> some data <|endoftext|> and split into training and eval sets."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "uU4ckPTf9T-2",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "7bcdab3a-f95d-4e99-b7fc-993422fad68f"
      },
      "source": [
        "\"\"\"\n",
        "Now load the data line by line\n",
        "\"\"\"\n",
        "from sklearn.model_selection import train_test_split\n",
        "\n",
        "with open('<path to text file>', 'r') as data:\n",
        "  dataset = [\"<|title|>\" + x.strip() for x in data.readlines()]\n",
        "\n",
        "train, eval = train_test_split(dataset, train_size=.9, random_state=2020)\n",
        "print(\"training size:\" + str(len(train)))\n",
        "print(\"Evaluation size: \" + str(len(eval))\n",
        "\n",
        "with open('train_tmp.txt', 'w') as file_handle:\n",
        "  file_handle.write(\"<|endoftext|>\".join(train))\n",
        "\n",
        "with open('eval_tmp.txt', 'w') as file_handle:\n",
        "  file_handle.write(\"<|endoftext|>\".join(eval))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "training size:86114\n",
            "Evaluation size: 9569\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1TYi8Oqs6Npo"
      },
      "source": [
        "The script below will fine tune GPT2 on your text data that you setup above. This training step will take anywhre from tens of minutes to hours depending on how large your training set is, how many epochs you intend to train on, and if you are using colab or colab pro. We utilize mixed precision in this model to shave off some training time. For a large data set I was using for another experiment it saved us over 30 mins in training time."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "yyV_rTL3ZE_H",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "a1e1bccb-a8bf-4616-dcc8-d6eb893cf7d2"
      },
      "source": [
        "!python run_clm.py \\\n",
        "--model_type gpt2-medium \\\n",
        "--model_name_or_path gpt2-medium \\\n",
        "--train_file \"train_tmp.txt\" \\\n",
        "--do_train \\\n",
        "--validation_file \"eval_tmp.txt\" \\\n",
        "--do_eval \\\n",
        "--per_gpu_train_batch_size 1 \\\n",
        "--save_steps -1 \\\n",
        "--num_train_epochs 5 \\\n",
        "--fp16 \\\n",
        "--output_dir=\"<Path to your output dir>\""
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "2020-12-28 20:44:54.505215: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1\n",
            "12/28/2020 20:44:56 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: True\n",
            "12/28/2020 20:44:56 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=<Path to your output dir>, overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, model_parallel=False, evaluation_strategy=EvaluationStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_steps=0, logging_dir=runs/Dec28_20-44-56_016c2c103498, logging_first_step=False, logging_steps=500, save_steps=-1, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level=O1, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=<Path to your output dir>, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend=auto, sharded_ddp=False, label_smoothing_factor=0.0, adafactor=False)\n",
            "Using custom data configuration default\n",
            "Reusing dataset text (/root/.cache/huggingface/datasets/text/default-baa6195a2cf10815/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab)\n",
            "[INFO|configuration_utils.py:431] 2020-12-28 20:44:57,584 >> loading configuration file https://huggingface.co/gpt2-medium/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3a7a4b7235202f93d14a4a5e8200709184c5b25a29d9cfa6b0ede5166adf0768.cf0ec4a33a38dc96108560e01338af4bd3360dd859385d451c35b41987ae73ff\n",
            "[INFO|configuration_utils.py:467] 2020-12-28 20:44:57,584 >> Model config GPT2Config {\n",
            "  \"activation_function\": \"gelu_new\",\n",
            "  \"architectures\": [\n",
            "    \"GPT2LMHeadModel\"\n",
            "  ],\n",
            "  \"attn_pdrop\": 0.1,\n",
            "  \"bos_token_id\": 50256,\n",
            "  \"embd_pdrop\": 0.1,\n",
            "  \"eos_token_id\": 50256,\n",
            "  \"gradient_checkpointing\": false,\n",
            "  \"initializer_range\": 0.02,\n",
            "  \"layer_norm_epsilon\": 1e-05,\n",
            "  \"model_type\": \"gpt2\",\n",
            "  \"n_ctx\": 1024,\n",
            "  \"n_embd\": 1024,\n",
            "  \"n_head\": 16,\n",
            "  \"n_inner\": null,\n",
            "  \"n_layer\": 24,\n",
            "  \"n_positions\": 1024,\n",
            "  \"n_special\": 0,\n",
            "  \"predict_special_tokens\": true,\n",
            "  \"resid_pdrop\": 0.1,\n",
            "  \"summary_activation\": null,\n",
            "  \"summary_first_dropout\": 0.1,\n",
            "  \"summary_proj_to_labels\": true,\n",
            "  \"summary_type\": \"cls_index\",\n",
            "  \"summary_use_proj\": true,\n",
            "  \"task_specific_params\": {\n",
            "    \"text-generation\": {\n",
            "      \"do_sample\": true,\n",
            "      \"max_length\": 50\n",
            "    }\n",
            "  },\n",
            "  \"use_cache\": true,\n",
            "  \"vocab_size\": 50257\n",
            "}\n",
            "\n",
            "[INFO|configuration_utils.py:431] 2020-12-28 20:44:57,869 >> loading configuration file https://huggingface.co/gpt2-medium/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3a7a4b7235202f93d14a4a5e8200709184c5b25a29d9cfa6b0ede5166adf0768.cf0ec4a33a38dc96108560e01338af4bd3360dd859385d451c35b41987ae73ff\n",
            "[INFO|configuration_utils.py:467] 2020-12-28 20:44:57,869 >> Model config GPT2Config {\n",
            "  \"activation_function\": \"gelu_new\",\n",
            "  \"architectures\": [\n",
            "    \"GPT2LMHeadModel\"\n",
            "  ],\n",
            "  \"attn_pdrop\": 0.1,\n",
            "  \"bos_token_id\": 50256,\n",
            "  \"embd_pdrop\": 0.1,\n",
            "  \"eos_token_id\": 50256,\n",
            "  \"gradient_checkpointing\": false,\n",
            "  \"initializer_range\": 0.02,\n",
            "  \"layer_norm_epsilon\": 1e-05,\n",
            "  \"model_type\": \"gpt2\",\n",
            "  \"n_ctx\": 1024,\n",
            "  \"n_embd\": 1024,\n",
            "  \"n_head\": 16,\n",
            "  \"n_inner\": null,\n",
            "  \"n_layer\": 24,\n",
            "  \"n_positions\": 1024,\n",
            "  \"n_special\": 0,\n",
            "  \"predict_special_tokens\": true,\n",
            "  \"resid_pdrop\": 0.1,\n",
            "  \"summary_activation\": null,\n",
            "  \"summary_first_dropout\": 0.1,\n",
            "  \"summary_proj_to_labels\": true,\n",
            "  \"summary_type\": \"cls_index\",\n",
            "  \"summary_use_proj\": true,\n",
            "  \"task_specific_params\": {\n",
            "    \"text-generation\": {\n",
            "      \"do_sample\": true,\n",
            "      \"max_length\": 50\n",
            "    }\n",
            "  },\n",
            "  \"use_cache\": true,\n",
            "  \"vocab_size\": 50257\n",
            "}\n",
            "\n",
            "[INFO|tokenization_utils_base.py:1802] 2020-12-28 20:44:58,700 >> loading file https://huggingface.co/gpt2-medium/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/fee58641d7a73348d842afaa337d5a7763dad32beff8d9008bb3c3c847749d6b.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f\n",
            "[INFO|tokenization_utils_base.py:1802] 2020-12-28 20:44:58,700 >> loading file https://huggingface.co/gpt2-medium/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/23c853a0fcfc12c7d72ad4e922068b6982665b673f6de30b4c5cbe5bd70a2236.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b\n",
            "[INFO|tokenization_utils_base.py:1802] 2020-12-28 20:44:58,700 >> loading file https://huggingface.co/gpt2-medium/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/8e4f9a65085b1b4ae69ffac9a953a44249c9ea1e72e4a7816ee87b70081df038.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0\n",
            "[INFO|modeling_utils.py:1024] 2020-12-28 20:44:59,045 >> loading weights file https://huggingface.co/gpt2-medium/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/6249eef5c8c1fcfccf9f36fc2e59301b109ac4036d8ebbee9c2b7f7e47f440bd.2538e2565f9e439a3668b981faf959c8b490b36dd631f3c4cd992519b2dd36f1\n",
            "[INFO|modeling_utils.py:1140] 2020-12-28 20:45:13,169 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.\n",
            "\n",
            "[INFO|modeling_utils.py:1149] 2020-12-28 20:45:13,170 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2-medium.\n",
            "If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.\n",
            "[WARNING|tokenization_utils_base.py:3233] 2020-12-28 20:45:18,316 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1645690 > 1024). Running this sequence through the model will result in indexing errors\n",
            "Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-baa6195a2cf10815/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-2f16e55684bde3f7.arrow\n",
            "Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-baa6195a2cf10815/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-7b35839b03132ea8.arrow\n",
            "Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-baa6195a2cf10815/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-fc4f97785ff59f90.arrow\n",
            "Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-baa6195a2cf10815/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-a79f1dcbcc03ea5e.arrow\n",
            "[INFO|trainer.py:395] 2020-12-28 20:45:26,558 >> The following columns in the training set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: .\n",
            "[INFO|trainer.py:395] 2020-12-28 20:45:26,559 >> The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: .\n",
            "[INFO|trainer.py:312] 2020-12-28 20:45:26,559 >> Using amp fp16 backend\n",
            "[WARNING|training_args.py:450] 2020-12-28 20:45:26,560 >> Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.\n",
            "[WARNING|training_args.py:450] 2020-12-28 20:45:26,565 >> Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.\n",
            "[INFO|trainer.py:718] 2020-12-28 20:45:26,565 >> ***** Running training *****\n",
            "[INFO|trainer.py:719] 2020-12-28 20:45:26,565 >>   Num examples = 1607\n",
            "[INFO|trainer.py:720] 2020-12-28 20:45:26,565 >>   Num Epochs = 5\n",
            "[INFO|trainer.py:721] 2020-12-28 20:45:26,565 >>   Instantaneous batch size per device = 8\n",
            "[INFO|trainer.py:722] 2020-12-28 20:45:26,565 >>   Total train batch size (w. parallel, distributed & accumulation) = 1\n",
            "[INFO|trainer.py:723] 2020-12-28 20:45:26,566 >>   Gradient Accumulation steps = 1\n",
            "[INFO|trainer.py:724] 2020-12-28 20:45:26,566 >>   Total optimization steps = 8035\n",
            "[WARNING|training_args.py:450] 2020-12-28 20:45:26,575 >> Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.\n",
            "  0% 0/8035 [00:00<?, ?it/s]/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate\n",
            "  \"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate\", UserWarning)\n",
            "{'loss': 2.9204345703125, 'learning_rate': 4.688861232109521e-05, 'epoch': 0.3111387678904792}\n",
            "{'loss': 2.78691552734375, 'learning_rate': 4.377722464219042e-05, 'epoch': 0.6222775357809583}\n",
            "{'loss': 2.756384521484375, 'learning_rate': 4.066583696328562e-05, 'epoch': 0.9334163036714375}\n",
            "{'loss': 2.565302734375, 'learning_rate': 3.755444928438084e-05, 'epoch': 1.2445550715619167}\n",
            "{'loss': 2.526352294921875, 'learning_rate': 3.4443061605476046e-05, 'epoch': 1.5556938394523958}\n",
            "{'loss': 2.51713671875, 'learning_rate': 3.1331673926571255e-05, 'epoch': 1.866832607342875}\n",
            "{'loss': 2.410624267578125, 'learning_rate': 2.822028624766646e-05, 'epoch': 2.177971375233354}\n",
            "{'loss': 2.3454755859375, 'learning_rate': 2.510889856876167e-05, 'epoch': 2.4891101431238334}\n",
            "{'loss': 2.360398193359375, 'learning_rate': 2.1997510889856875e-05, 'epoch': 2.8002489110143123}\n",
            "{'loss': 2.30828369140625, 'learning_rate': 1.8886123210952087e-05, 'epoch': 3.1113876789047916}\n",
            "{'loss': 2.229231201171875, 'learning_rate': 1.5774735532047295e-05, 'epoch': 3.4225264467952705}\n",
            "{'loss': 2.239233154296875, 'learning_rate': 1.26633478531425e-05, 'epoch': 3.73366521468575}\n",
            "{'loss': 2.2248154296875, 'learning_rate': 9.551960174237711e-06, 'epoch': 4.044803982576229}\n",
            "{'loss': 2.15593310546875, 'learning_rate': 6.440572495332919e-06, 'epoch': 4.355942750466708}\n",
            "{'loss': 2.15936572265625, 'learning_rate': 3.329184816428127e-06, 'epoch': 4.667081518357187}\n",
            "{'loss': 2.155581298828125, 'learning_rate': 2.1779713752333544e-07, 'epoch': 4.978220286247667}\n",
            "100% 8035/8035 [33:06<00:00,  4.05it/s][INFO|trainer.py:878] 2020-12-28 21:18:32,591 >> \n",
            "\n",
            "Training completed. Do not forget to share your model on huggingface.co/models =)\n",
            "\n",
            "\n",
            "{'train_runtime': 1986.0255, 'train_samples_per_second': 4.046, 'epoch': 5.0}\n",
            "100% 8035/8035 [33:06<00:00,  4.05it/s]\n",
            "[INFO|trainer.py:1248] 2020-12-28 21:18:32,593 >> Saving model checkpoint to <Path to your output dir>\n",
            "[INFO|configuration_utils.py:289] 2020-12-28 21:18:32,595 >> Configuration saved in <Path to your output dir>/config.json\n",
            "[INFO|modeling_utils.py:814] 2020-12-28 21:18:38,661 >> Model weights saved in <Path to your output dir>/pytorch_model.bin\n",
            "12/28/2020 21:18:38 - INFO - __main__ -   ***** Train results *****\n",
            "12/28/2020 21:18:38 - INFO - __main__ -     epoch = 5.0\n",
            "12/28/2020 21:18:38 - INFO - __main__ -     train_runtime = 1986.0255\n",
            "12/28/2020 21:18:38 - INFO - __main__ -     train_samples_per_second = 4.046\n",
            "12/28/2020 21:18:38 - INFO - __main__ -   *** Evaluate ***\n",
            "[INFO|trainer.py:1440] 2020-12-28 21:18:38,760 >> ***** Running Evaluation *****\n",
            "[INFO|trainer.py:1441] 2020-12-28 21:18:38,760 >>   Num examples = 180\n",
            "[INFO|trainer.py:1442] 2020-12-28 21:18:38,760 >>   Batch size = 8\n",
            "100% 23/23 [00:08<00:00,  2.58it/s]\n",
            "12/28/2020 21:18:47 - INFO - __main__ -   ***** Eval results *****\n",
            "12/28/2020 21:18:47 - INFO - __main__ -     perplexity = 15.386910338321965\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3CpzI5mU1jPl"
      },
      "source": [
        "# **Using the model**\n",
        "Next lets take our model we just trained and use it to generate some text! We will import the Tensorflow version of the gpt2 language model and set the from_pt flag to True. Then we load a pretrained tokenizer from huggingface. This may take some time to download the tokenizer data."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "kFOx9AUa1tnk"
      },
      "source": [
        "# setup imports to use the model\n",
        "from transformers import TFGPT2LMHeadModel\n",
        "from transformers import GPT2Tokenizer\n",
        "\n",
        "model = TFGPT2LMHeadModel.from_pretrained(\"<path to model directory>\", from_pt=True)\n",
        "tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3oWWo4HJ4wd-"
      },
      "source": [
        "Encoding sample text is now extremely simple using the pretrained tokenizer."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "6xT0tc07_-SL"
      },
      "source": [
        "input_ids = tokenizer.encode(\"Some text to encode\", return_tensors='tf')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 183
        },
        "id": "lxtUSIAc_-1G",
        "outputId": "23eb4da1-04d8-4711-b716-980831ef8c39"
      },
      "source": [
        "# the tf tensor object\n",
        "input_ids[0]"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "error",
          "ename": "NameError",
          "evalue": "ignored",
          "traceback": [
            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
            "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
            "\u001b[0;32m<ipython-input-7-05c6c1a3123b>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# the tf tensor object\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0minput_ids\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
            "\u001b[0;31mNameError\u001b[0m: name 'input_ids' is not defined"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BZ3YNTNlBsh8"
      },
      "source": [
        "Next we will use the model to generate the text from our input sample. The parameters I used are based on trail and error from playing around with the huggingface tutorial, https://huggingface.co/blog/how-to-generate, which really goes into great detail on how to go about finding the best parameters for generating text. As well they dive into really good information on what each parameter does and how they play into one another."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "NbzHNvvaAPns"
      },
      "source": [
        "generated_text_samples = model.generate(\n",
        "    input_ids, \n",
        "    max_length=150,  \n",
        "    num_return_sequences=5,\n",
        "    no_repeat_ngram_size=2,\n",
        "    repetition_penalty=1.5,\n",
        "    top_p=0.92,\n",
        "    temperature=.85,\n",
        "    do_sample=True,\n",
        "    top_k=125,\n",
        "    early_stopping=True\n",
        ")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "XPdteSR_B3w1"
      },
      "source": [
        "#Print output for each sequence generated above\n",
        "for i, beam in enumerate(beam_output):\n",
        "  print(\"{}: {}\".format(i,tokenizer.decode(beam, skip_special_tokens=True)))\n",
        "  print()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GekD8zzvCq3b"
      },
      "source": [
        "# **Conclusion**\n",
        "And there you have it, a simple end to end outline on how you can use Colab, Huggingface, and Tensorflow to train and generate new text data using GPT-2. There is a lot of playing around with hyperparameters in the generate phase but given enough tweaking and time you can usually find something that works well with your data and task. I found that even with the larger GPT-2 model and more examples, it could still repeat itself a bit so something you have to generate a large number of sequences before you get a set that you like. Even OpenAI made note of this in their initial results for GPT-2 so if at first it doesnt generate what you want keep trying and playing with the parameters!\n",
        "\n",
        "One tip I did notice was that if you do not setup your examples with a start token, then you run into the issue of repeated phrases more easily. Given more data that might be less of a problem but I ran into that a lot before putting in the start token of <|title|> in my exmaples. This start token also has the added benefit of giving you a generic starting point in the text generation so that each run is mostly unique from the last run if you do not care about having a specific prompt."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "OLZEvT-ACz1R"
      },
      "source": [
        ""
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
}