AmitNikhade/converting-dialogpt-gpt-2-to-tensorflow-lite.ipynb

## converting-dialogpt-gpt-2-to-tensorflow-lite.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Converting DialoGPT & GPT-2 to TensorFlow Lite",
      "provenance": [],
      "collapsed_sections": [],
      "authorship_tag": "ABX9TyPwV3E5xkn12jevZN0uMQ9O",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "widgets": {
      "application/vnd.jupyter.widget-state+json": {
        "0ee2cfbc694840e5a0f7d111b8e352ef": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "ButtonModel",
          "model_module_version": "1.5.0",
          "state": {
            "_view_name": "ButtonView",
            "style": "IPY_MODEL_e1ede4bddbcf4b8b859189f5caf2b4a9",
            "_dom_classes": [],
            "description": "Download model",
            "_model_name": "ButtonModel",
            "button_style": "",
            "_view_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "tooltip": "",
            "_view_count": null,
            "disabled": false,
            "_view_module_version": "1.5.0",
            "layout": "IPY_MODEL_111a2ccf148d47e0bab303f1b747c432",
            "_model_module": "@jupyter-widgets/controls",
            "icon": ""
          }
        },
        "e1ede4bddbcf4b8b859189f5caf2b4a9": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "ButtonStyleModel",
          "model_module_version": "1.5.0",
          "state": {
            "_view_name": "StyleView",
            "_model_name": "ButtonStyleModel",
            "_view_module": "@jupyter-widgets/base",
            "_model_module_version": "1.5.0",
            "_view_count": null,
            "button_color": null,
            "font_weight": "",
            "_view_module_version": "1.2.0",
            "_model_module": "@jupyter-widgets/controls"
          }
        },
        "111a2ccf148d47e0bab303f1b747c432": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "model_module_version": "1.2.0",
          "state": {
            "_view_name": "LayoutView",
            "grid_template_rows": null,
            "right": null,
            "justify_content": null,
            "_view_module": "@jupyter-widgets/base",
            "overflow": null,
            "_model_module_version": "1.2.0",
            "_view_count": null,
            "flex_flow": null,
            "width": null,
            "min_width": null,
            "border": null,
            "align_items": null,
            "bottom": null,
            "_model_module": "@jupyter-widgets/base",
            "top": null,
            "grid_column": null,
            "overflow_y": null,
            "overflow_x": null,
            "grid_auto_flow": null,
            "grid_area": null,
            "grid_template_columns": null,
            "flex": null,
            "_model_name": "LayoutModel",
            "justify_items": null,
            "grid_row": null,
            "max_height": null,
            "align_content": null,
            "visibility": null,
            "align_self": null,
            "height": null,
            "min_height": null,
            "padding": null,
            "grid_auto_rows": null,
            "grid_gap": null,
            "max_width": null,
            "order": null,
            "_view_module_version": "1.2.0",
            "grid_template_areas": null,
            "object_position": null,
            "object_fit": null,
            "grid_auto_columns": null,
            "margin": null,
            "display": null,
            "left": null
          }
        }
      }
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/sudo-carson/158d9b9e7208e42977b08d966f3f4989/converting-dialogpt-gpt-2-to-tensorflow-lite.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SmpUiIaaDBdy"
      },
      "source": [
        "# DialoGPT & GPT-2 conversion to TensorFlow Lite\n",
        "I haven't seen anyone else do a simple Colab on this so I decided to do it myself.\n",
        "\n",
        "I'm a total noob when it comes to AI so the way I've done it is most likely not the most efficient way or I'm committing some sin of some kind in the ML world, but it doesn't give me any errors, works decently, and doesn't take hours to run, so I'm ok with it.\n",
        "\n",
        "I'd recommend against using GPU acceleration here because A. it's just not necessary, and B. it always crashes on me for some reason.\n",
        "\n",
        "## Pre-converted models in comments!\n",
        "## Comparison to TensorFlow models\n",
        "I haven't found too big of an accuracy difference between the generated TensorFlow Lite models and TensorFlow models. I think the benefits far outweigh the *one* con of there being a little less accuracy though.\n",
        "\n",
        "Only small GPT models will work on Colab though without Colab Pro (see notes below) so the TensorFlow models aren't really even a good match, but you'll get very respectable results using Top-K and Top-P sampling.\n",
        "\n",
        "The TensorFlow Lite models are (from my experience) around 2x faster, use 2x less memory, and are 2x smaller in size. TensorFlow Lite itself is also around 1 MB, and additionally, I've hacked these models to **not** require TF Select Ops (at the cost of a small accuracy hit), so you really *only need TFLite*!\n",
        "\n",
        "Using TensorFlow Lite's C API and compressing a model, you can probably make a &lt;150 MB DistilGPT-2 demo app and *probably* run it on a Raspberry Pi! Not bad considering DistilGPT-2 has 82M parameters *and it's a freaking rock we tricked into thinking using lightning, taped to a piece of resin that can speak English and fit in our pocket*.\n",
        "\n",
        "## Notes  \n",
        "- Small GPT models use a decent chunk of memory (&lt;6 GB) and are easily managable. Medium models kill every Colab runtime I've ever touched within seconds, including my own server w/ 32 GB RAM. Probably best to stick with the smaller models unless your machine can handle it 😅\n",
        "- Only float16 quantization works (there's probably a way to do it with int8 but I have no idea how)\n",
        "- GELU does not work with pure TFLite, so I had to use RELU instead, resulting in a small accuracy hit\n",
        "\n",
        "<hr />"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "qyBe04wu8EYh"
      },
      "source": [
        "!pip install -q transformers==4.11.0 tensorflow==2.6.0\n",
        "\n",
        "# I want clean outputs, damnit!\n",
        "import os\n",
        "os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'\n",
        "import warnings\n",
        "warnings.filterwarnings(\"ignore\")\n",
        "\n",
        "# Important imports\n",
        "import tensorflow as tf\n",
        "import numpy as np\n",
        "from transformers import TFGPT2LMHeadModel, GPT2TokenizerFast\n",
        "\n",
        "# Colab & Jupyter stuff, not important\n",
        "from google.colab import files\n",
        "from IPython.display import display\n",
        "from ipywidgets import Button"
      ],
      "execution_count": 19,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "id": "5oRSnuJxtZF9"
      },
      "source": [
        "MODEL_TYPE = 'distilgpt2' #@param ['microsoft/DialoGPT-small','distilgpt2','gpt2']\n",
        "#@markdown \"Sequence length\" is how far back the AI can see (in tokens). Setting it too high can lead to large performance hits, but increased contextual accuracy. Make sure you take this into account in your own code.\n",
        "#@markdown \n",
        "#@markdown I've found 64 to be a good balance between speed and contextual accuracy.\n",
        "SEQUENCE_LENGTH = 64 #@param {type:\"slider\",min:32,max:128,step:4}"
      ],
      "execution_count": 18,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fP92MMQRE1Yh"
      },
      "source": [
        "First we're downloading the model and storing it somewhere easier to access. The second parameter (I believe) saves it as a TensorFlow saved model instead of a Keras model.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "q3NBl5hG8le2",
        "outputId": "cfa85ce8-1200-4c9c-dc1d-5715707f9ec7"
      },
      "source": [
        "model = TFGPT2LMHeadModel.from_pretrained(MODEL_TYPE)\n",
        "model.save_pretrained('./model', True)"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "All model checkpoint layers were used when initializing TFGPT2LMHeadModel.\n",
            "\n",
            "All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at distilgpt2.\n",
            "If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.\n",
            "WARNING:absl:Found untraced functions such as wte_layer_call_and_return_conditional_losses, wte_layer_call_fn, dropout_layer_call_and_return_conditional_losses, dropout_layer_call_fn, ln_f_layer_call_and_return_conditional_losses while saving (showing 5 of 375). These functions will not be directly callable after loading.\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "INFO:tensorflow:Assets written to: ./model/saved_model/1/assets\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "INFO:tensorflow:Assets written to: ./model/saved_model/1/assets\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MSmASgmIFIXm"
      },
      "source": [
        "Now we replace the GELU activation function with RELU in the model's config..."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "amblrKVT9cNN"
      },
      "source": [
        "!sed -i 's/gelu-new/relu/' model/config.json"
      ],
      "execution_count": 4,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "u-7EzsRWFONU"
      },
      "source": [
        "Re-load the model with the new activation function changed and adjust the input size to our `SEQUENCE_LENGTH`. Then, we create a Keras -> TensorFlow Lite model converter which we will use to convert and save the model in the next step."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "LrInNM169n-s",
        "outputId": "428d8766-37c4-4a6b-a64d-7f52bdf9c134"
      },
      "source": [
        "model = TFGPT2LMHeadModel.from_pretrained('./model/')\n",
        "\n",
        "keras_input = tf.keras.Input([ SEQUENCE_LENGTH ], batch_size=1, dtype=tf.int32)\n",
        "keras_output = model(keras_input, training=False)\n",
        "model = tf.keras.Model(keras_input, keras_output)\n",
        "\n",
        "converter = tf.lite.TFLiteConverter.from_keras_model(model)\n",
        "converter.optimizations = [ tf.lite.Optimize.DEFAULT ]\n",
        "converter.target_spec.supported_types = [ tf.float16 ]"
      ],
      "execution_count": 17,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "All model checkpoint layers were used when initializing TFGPT2LMHeadModel.\n",
            "\n",
            "All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at ./model/.\n",
            "If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Gy6xf6SVF5bD"
      },
      "source": [
        "And finally, we convert the model and save it."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 173,
          "referenced_widgets": [
            "0ee2cfbc694840e5a0f7d111b8e352ef",
            "e1ede4bddbcf4b8b859189f5caf2b4a9",
            "111a2ccf148d47e0bab303f1b747c432"
          ]
        },
        "id": "VkToNwik-pj-",
        "outputId": "2b5539e9-7dd8-473e-eb1e-d3f468f3e8ca"
      },
      "source": [
        "tflite_model = converter.convert()\n",
        "open('model.tflite', 'wb').write(tflite_model)\n",
        "\n",
        "def download(a):\n",
        "  files.download('model.tflite')\n",
        "button = Button(description='Download model')\n",
        "button.on_click(download)\n",
        "display(button)"
      ],
      "execution_count": 16,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.\n",
            "WARNING:absl:Found untraced functions such as wte_layer_call_and_return_conditional_losses, wte_layer_call_fn, dropout_19_layer_call_and_return_conditional_losses, dropout_19_layer_call_fn, ln_f_layer_call_and_return_conditional_losses while saving (showing 5 of 375). These functions will not be directly callable after loading.\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "INFO:tensorflow:Assets written to: /tmp/tmp1ra7tmyc/assets\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "INFO:tensorflow:Assets written to: /tmp/tmp1ra7tmyc/assets\n",
            "INFO:absl:Using new converter: If you encounter a problem please file a bug. You can opt-out by setting experimental_new_converter=False\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "0ee2cfbc694840e5a0f7d111b8e352ef",
              "version_minor": 0,
              "version_major": 2
            },
            "text/plain": [
              "Button(description='Download model', style=ButtonStyle())"
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LL9HFjpomvSQ"
      },
      "source": [
        "Now it's time to load and test our model! This will only output one word though... A more advanced implementation that outputs full sentences would work something like:  \n",
        "- Take our starting sentence and encode it into our \"token stack\"\n",
        "- Create a new array of tokens called \"output tokens\"\n",
        "- Cap off token stack at `SEQUENCE_LENGTH` (if necessary)\n",
        "- Until we reach the desired length (or until we reach EOF if we're super super fancy):  \n",
        "    * Invoke the interpreter and get the next token in the sequence\n",
        "    * Add it to the token stack **and** our output tokens\n",
        "    * If the token stack is > `SEQUENCE_LENGTH`, slice token stack to be the last `SEQUENCE_LENGTH` tokens. Do **NOT** slice output tokens\n",
        "    * Repeat until desired length reached or until EOF reached\n",
        "- Decode output tokens and return them"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "1Aso49vtOx3P",
        "outputId": "ac64f074-fa80-4f04-86ac-6084512f4ca8"
      },
      "source": [
        "def pad_up_to(t, max_in_dims, constant_values):\n",
        "    s = tf.shape(t)\n",
        "    paddings = [ [ 0, m - s[i] ] for (i, m) in enumerate(max_in_dims) ]\n",
        "    return tf.pad(t, paddings, 'CONSTANT', constant_values=constant_values)\n",
        "  \n",
        "tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_TYPE)\n",
        "\n",
        "interp = tf.lite.Interpreter('model.tflite')\n",
        "interp.allocate_tensors()\n",
        "\n",
        "input_details = interp.get_input_details()\n",
        "output_details = interp.get_output_details()\n",
        "\n",
        "input_string = 'Scientists at the CERN laboratory announce they have discovered a new' #@param {type:\"string\"}\n",
        "input_data = tokenizer.encode(input_string + tokenizer.eos_token if MODEL_TYPE == 'microsoft/DialoGPT-small' else input_string, return_tensors='tf')\n",
        "interp.set_tensor(input_details[0]['index'], pad_up_to(input_data, [ 1, SEQUENCE_LENGTH ], 0))\n",
        "interp.invoke()\n",
        "\n",
        "print(tokenizer.decode([ np.argmax(interp.get_tensor(output_details[0]['index'])[0][input_data.shape[1] - 1]) ]))"
      ],
      "execution_count": 15,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            " type\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IU2hg4YNogWA"
      },
      "source": [
        "What a time to be alive!\n",
        "\n",
        "Now go make something cool with this! 😄"
      ]
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "Converting DialoGPT & GPT-2 to TensorFlow Lite",
	"provenance": [],
	"collapsed_sections": [],
	"authorship_tag": "ABX9TyPwV3E5xkn12jevZN0uMQ9O",
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	},
	"language_info": {
	"name": "python"
	},
	"widgets": {
	"application/vnd.jupyter.widget-state+json": {
	"0ee2cfbc694840e5a0f7d111b8e352ef": {
	"model_module": "@jupyter-widgets/controls",
	"model_name": "ButtonModel",
	"model_module_version": "1.5.0",
	"state": {
	"_view_name": "ButtonView",
	"style": "IPY_MODEL_e1ede4bddbcf4b8b859189f5caf2b4a9",
	"_dom_classes": [],
	"description": "Download model",
	"_model_name": "ButtonModel",
	"button_style": "",
	"_view_module": "@jupyter-widgets/controls",
	"_model_module_version": "1.5.0",
	"tooltip": "",
	"_view_count": null,
	"disabled": false,
	"_view_module_version": "1.5.0",
	"layout": "IPY_MODEL_111a2ccf148d47e0bab303f1b747c432",
	"_model_module": "@jupyter-widgets/controls",
	"icon": ""
	}
	},
	"e1ede4bddbcf4b8b859189f5caf2b4a9": {
	"model_module": "@jupyter-widgets/controls",
	"model_name": "ButtonStyleModel",
	"model_module_version": "1.5.0",
	"state": {
	"_view_name": "StyleView",
	"_model_name": "ButtonStyleModel",
	"_view_module": "@jupyter-widgets/base",
	"_model_module_version": "1.5.0",
	"_view_count": null,
	"button_color": null,
	"font_weight": "",
	"_view_module_version": "1.2.0",
	"_model_module": "@jupyter-widgets/controls"
	}
	},
	"111a2ccf148d47e0bab303f1b747c432": {
	"model_module": "@jupyter-widgets/base",
	"model_name": "LayoutModel",
	"model_module_version": "1.2.0",
	"state": {
	"_view_name": "LayoutView",
	"grid_template_rows": null,
	"right": null,
	"justify_content": null,
	"_view_module": "@jupyter-widgets/base",
	"overflow": null,
	"_model_module_version": "1.2.0",
	"_view_count": null,
	"flex_flow": null,
	"width": null,
	"min_width": null,
	"border": null,
	"align_items": null,
	"bottom": null,
	"_model_module": "@jupyter-widgets/base",
	"top": null,
	"grid_column": null,
	"overflow_y": null,
	"overflow_x": null,
	"grid_auto_flow": null,
	"grid_area": null,
	"grid_template_columns": null,
	"flex": null,
	"_model_name": "LayoutModel",
	"justify_items": null,
	"grid_row": null,
	"max_height": null,
	"align_content": null,
	"visibility": null,
	"align_self": null,
	"height": null,
	"min_height": null,
	"padding": null,
	"grid_auto_rows": null,
	"grid_gap": null,
	"max_width": null,
	"order": null,
	"_view_module_version": "1.2.0",
	"grid_template_areas": null,
	"object_position": null,
	"object_fit": null,
	"grid_auto_columns": null,
	"margin": null,
	"display": null,
	"left": null
	}
	}
	}
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/sudo-carson/158d9b9e7208e42977b08d966f3f4989/converting-dialogpt-gpt-2-to-tensorflow-lite.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "SmpUiIaaDBdy"
	},
	"source": [
	"# DialoGPT & GPT-2 conversion to TensorFlow Lite\n",
	"I haven't seen anyone else do a simple Colab on this so I decided to do it myself.\n",
	"\n",
	"I'm a total noob when it comes to AI so the way I've done it is most likely not the most efficient way or I'm committing some sin of some kind in the ML world, but it doesn't give me any errors, works decently, and doesn't take hours to run, so I'm ok with it.\n",
	"\n",
	"I'd recommend against using GPU acceleration here because A. it's just not necessary, and B. it always crashes on me for some reason.\n",
	"\n",
	"## Pre-converted models in comments!\n",
	"## Comparison to TensorFlow models\n",
	"I haven't found too big of an accuracy difference between the generated TensorFlow Lite models and TensorFlow models. I think the benefits far outweigh the one con of there being a little less accuracy though.\n",
	"\n",
	"Only small GPT models will work on Colab though without Colab Pro (see notes below) so the TensorFlow models aren't really even a good match, but you'll get very respectable results using Top-K and Top-P sampling.\n",
	"\n",
	"The TensorFlow Lite models are (from my experience) around 2x faster, use 2x less memory, and are 2x smaller in size. TensorFlow Lite itself is also around 1 MB, and additionally, I've hacked these models to not require TF Select Ops (at the cost of a small accuracy hit), so you really only need TFLite!\n",
	"\n",
	"Using TensorFlow Lite's C API and compressing a model, you can probably make a <150 MB DistilGPT-2 demo app and probably run it on a Raspberry Pi! Not bad considering DistilGPT-2 has 82M parameters and it's a freaking rock we tricked into thinking using lightning, taped to a piece of resin that can speak English and fit in our pocket.\n",
	"\n",
	"## Notes \n",
	"- Small GPT models use a decent chunk of memory (<6 GB) and are easily managable. Medium models kill every Colab runtime I've ever touched within seconds, including my own server w/ 32 GB RAM. Probably best to stick with the smaller models unless your machine can handle it 😅\n",
	"- Only float16 quantization works (there's probably a way to do it with int8 but I have no idea how)\n",
	"- GELU does not work with pure TFLite, so I had to use RELU instead, resulting in a small accuracy hit\n",
	"\n",
	"<hr />"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "qyBe04wu8EYh"
	},
	"source": [
	"!pip install -q transformers==4.11.0 tensorflow==2.6.0\n",
	"\n",
	"# I want clean outputs, damnit!\n",
	"import os\n",
	"os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'\n",
	"import warnings\n",
	"warnings.filterwarnings(\"ignore\")\n",
	"\n",
	"# Important imports\n",
	"import tensorflow as tf\n",
	"import numpy as np\n",
	"from transformers import TFGPT2LMHeadModel, GPT2TokenizerFast\n",
	"\n",
	"# Colab & Jupyter stuff, not important\n",
	"from google.colab import files\n",
	"from IPython.display import display\n",
	"from ipywidgets import Button"
	],
	"execution_count": 19,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"cellView": "form",
	"id": "5oRSnuJxtZF9"
	},
	"source": [
	"MODEL_TYPE = 'distilgpt2' #@param ['microsoft/DialoGPT-small','distilgpt2','gpt2']\n",
	"#@markdown \"Sequence length\" is how far back the AI can see (in tokens). Setting it too high can lead to large performance hits, but increased contextual accuracy. Make sure you take this into account in your own code.\n",
	"#@markdown \n",
	"#@markdown I've found 64 to be a good balance between speed and contextual accuracy.\n",
	"SEQUENCE_LENGTH = 64 #@param {type:\"slider\",min:32,max:128,step:4}"
	],
	"execution_count": 18,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "fP92MMQRE1Yh"
	},
	"source": [
	"First we're downloading the model and storing it somewhere easier to access. The second parameter (I believe) saves it as a TensorFlow saved model instead of a Keras model.\n"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "q3NBl5hG8le2",
	"outputId": "cfa85ce8-1200-4c9c-dc1d-5715707f9ec7"
	},
	"source": [
	"model = TFGPT2LMHeadModel.from_pretrained(MODEL_TYPE)\n",
	"model.save_pretrained('./model', True)"
	],
	"execution_count": 3,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stderr",
	"text": [
	"All model checkpoint layers were used when initializing TFGPT2LMHeadModel.\n",
	"\n",
	"All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at distilgpt2.\n",
	"If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.\n",
	"WARNING:absl:Found untraced functions such as wte_layer_call_and_return_conditional_losses, wte_layer_call_fn, dropout_layer_call_and_return_conditional_losses, dropout_layer_call_fn, ln_f_layer_call_and_return_conditional_losses while saving (showing 5 of 375). These functions will not be directly callable after loading.\n"
	]
	},
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"INFO:tensorflow:Assets written to: ./model/saved_model/1/assets\n"
	]
	},
	{
	"output_type": "stream",
	"name": "stderr",
	"text": [
	"INFO:tensorflow:Assets written to: ./model/saved_model/1/assets\n"
	]
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "MSmASgmIFIXm"
	},
	"source": [
	"Now we replace the GELU activation function with RELU in the model's config..."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "amblrKVT9cNN"
	},
	"source": [
	"!sed -i 's/gelu-new/relu/' model/config.json"
	],
	"execution_count": 4,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "u-7EzsRWFONU"
	},
	"source": [
	"Re-load the model with the new activation function changed and adjust the input size to our `SEQUENCE_LENGTH`. Then, we create a Keras -> TensorFlow Lite model converter which we will use to convert and save the model in the next step."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "LrInNM169n-s",
	"outputId": "428d8766-37c4-4a6b-a64d-7f52bdf9c134"
	},
	"source": [
	"model = TFGPT2LMHeadModel.from_pretrained('./model/')\n",
	"\n",
	"keras_input = tf.keras.Input([ SEQUENCE_LENGTH ], batch_size=1, dtype=tf.int32)\n",
	"keras_output = model(keras_input, training=False)\n",
	"model = tf.keras.Model(keras_input, keras_output)\n",
	"\n",
	"converter = tf.lite.TFLiteConverter.from_keras_model(model)\n",
	"converter.optimizations = [ tf.lite.Optimize.DEFAULT ]\n",
	"converter.target_spec.supported_types = [ tf.float16 ]"
	],
	"execution_count": 17,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stderr",
	"text": [
	"All model checkpoint layers were used when initializing TFGPT2LMHeadModel.\n",
	"\n",
	"All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at ./model/.\n",
	"If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.\n"
	]
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "Gy6xf6SVF5bD"
	},
	"source": [
	"And finally, we convert the model and save it."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 173,
	"referenced_widgets": [
	"0ee2cfbc694840e5a0f7d111b8e352ef",
	"e1ede4bddbcf4b8b859189f5caf2b4a9",
	"111a2ccf148d47e0bab303f1b747c432"
	]
	},
	"id": "VkToNwik-pj-",
	"outputId": "2b5539e9-7dd8-473e-eb1e-d3f468f3e8ca"
	},
	"source": [
	"tflite_model = converter.convert()\n",
	"open('model.tflite', 'wb').write(tflite_model)\n",
	"\n",
	"def download(a):\n",
	" files.download('model.tflite')\n",
	"button = Button(description='Download model')\n",
	"button.on_click(download)\n",
	"display(button)"
	],
	"execution_count": 16,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.\n"
	]
	},
	{
	"output_type": "stream",
	"name": "stderr",
	"text": [
	"WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.\n",
	"WARNING:absl:Found untraced functions such as wte_layer_call_and_return_conditional_losses, wte_layer_call_fn, dropout_19_layer_call_and_return_conditional_losses, dropout_19_layer_call_fn, ln_f_layer_call_and_return_conditional_losses while saving (showing 5 of 375). These functions will not be directly callable after loading.\n"
	]
	},
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"INFO:tensorflow:Assets written to: /tmp/tmp1ra7tmyc/assets\n"
	]
	},
	{
	"output_type": "stream",
	"name": "stderr",
	"text": [
	"INFO:tensorflow:Assets written to: /tmp/tmp1ra7tmyc/assets\n",
	"INFO:absl:Using new converter: If you encounter a problem please file a bug. You can opt-out by setting experimental_new_converter=False\n"
	]
	},
	{
	"output_type": "display_data",
	"data": {
	"application/vnd.jupyter.widget-view+json": {
	"model_id": "0ee2cfbc694840e5a0f7d111b8e352ef",
	"version_minor": 0,
	"version_major": 2
	},
	"text/plain": [
	"Button(description='Download model', style=ButtonStyle())"
	]
	},
	"metadata": {}
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "LL9HFjpomvSQ"
	},
	"source": [
	"Now it's time to load and test our model! This will only output one word though... A more advanced implementation that outputs full sentences would work something like: \n",
	"- Take our starting sentence and encode it into our \"token stack\"\n",
	"- Create a new array of tokens called \"output tokens\"\n",
	"- Cap off token stack at `SEQUENCE_LENGTH` (if necessary)\n",
	"- Until we reach the desired length (or until we reach EOF if we're super super fancy): \n",
	" * Invoke the interpreter and get the next token in the sequence\n",
	" * Add it to the token stack and our output tokens\n",
	" * If the token stack is > `SEQUENCE_LENGTH`, slice token stack to be the last `SEQUENCE_LENGTH` tokens. Do NOT slice output tokens\n",
	" * Repeat until desired length reached or until EOF reached\n",
	"- Decode output tokens and return them"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "1Aso49vtOx3P",
	"outputId": "ac64f074-fa80-4f04-86ac-6084512f4ca8"
	},
	"source": [
	"def pad_up_to(t, max_in_dims, constant_values):\n",
	" s = tf.shape(t)\n",
	" paddings = [ [ 0, m - s[i] ] for (i, m) in enumerate(max_in_dims) ]\n",
	" return tf.pad(t, paddings, 'CONSTANT', constant_values=constant_values)\n",
	" \n",
	"tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_TYPE)\n",
	"\n",
	"interp = tf.lite.Interpreter('model.tflite')\n",
	"interp.allocate_tensors()\n",
	"\n",
	"input_details = interp.get_input_details()\n",
	"output_details = interp.get_output_details()\n",
	"\n",
	"input_string = 'Scientists at the CERN laboratory announce they have discovered a new' #@param {type:\"string\"}\n",
	"input_data = tokenizer.encode(input_string + tokenizer.eos_token if MODEL_TYPE == 'microsoft/DialoGPT-small' else input_string, return_tensors='tf')\n",
	"interp.set_tensor(input_details[0]['index'], pad_up_to(input_data, [ 1, SEQUENCE_LENGTH ], 0))\n",
	"interp.invoke()\n",
	"\n",
	"print(tokenizer.decode([ np.argmax(interp.get_tensor(output_details[0]['index'])[0][input_data.shape[1] - 1]) ]))"
	],
	"execution_count": 15,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	" type\n"
	]
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "IU2hg4YNogWA"
	},
	"source": [
	"What a time to be alive!\n",
	"\n",
	"Now go make something cool with this! 😄"
	]
	}
	]
	}