RaviChandraVeeramachaneni/fastai-hf_week2_tokenizer_from_scratch.ipynb

## fastai-hf_week2_tokenizer_from_scratch.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "name": "fastai+HF_week2_Tokenizer_from_scratch.ipynb",
      "provenance": [],
      "machine_shape": "hm",
      "authorship_tag": "ABX9TyPJy961c3SciPJkthjDuIlt",
      "include_colab_link": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/RaviChandraVeeramachaneni/fd5b2c1397626f3b44074f5a53008b56/fastai-hf_week2_tokenizer_from_scratch.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "shIkAiW5e523"
      },
      "source": [
        "Install the Transformers and Datasets libraries to run this notebook."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "lDlDVVKgeXb2",
        "outputId": "32b05f62-617b-4e6a-d815-a25245ece1ad"
      },
      "source": [
        "!pip install -qq transformers[sentencepiece]\n",
        "!pip install -qq datasets\n",
        "\n",
        "from transformers import pipeline"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "\u001b[K     |████████████████████████████████| 2.6 MB 8.4 MB/s \n",
            "\u001b[K     |████████████████████████████████| 895 kB 59.1 MB/s \n",
            "\u001b[K     |████████████████████████████████| 636 kB 59.1 MB/s \n",
            "\u001b[K     |████████████████████████████████| 3.3 MB 62.0 MB/s \n",
            "\u001b[K     |████████████████████████████████| 1.1 MB 71.0 MB/s \n",
            "\u001b[K     |████████████████████████████████| 542 kB 8.5 MB/s \n",
            "\u001b[K     |████████████████████████████████| 76 kB 6.0 MB/s \n",
            "\u001b[K     |████████████████████████████████| 243 kB 75.2 MB/s \n",
            "\u001b[K     |████████████████████████████████| 118 kB 70.6 MB/s \n",
            "\u001b[?25h"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9WTz-sxEfNqH"
      },
      "source": [
        "Build a tokenizer from scratch"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aubneieAfWXC"
      },
      "source": [
        "Step1: Download the wikitext-103(516M of text) dataset and extract it"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "sb-1KlfFfPEQ",
        "outputId": "9dff9c2d-d4be-46ef-cb93-d528d92d5a65"
      },
      "source": [
        "!wget \"https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip\"\n",
        "!unzip wikitext-103-raw-v1.zip"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "--2021-07-23 23:39:59--  https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip\n",
            "Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.129.240\n",
            "Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.129.240|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 191984949 (183M) [application/zip]\n",
            "Saving to: ‘wikitext-103-raw-v1.zip’\n",
            "\n",
            "wikitext-103-raw-v1 100%[===================>] 183.09M  44.3MB/s    in 4.6s    \n",
            "\n",
            "2021-07-23 23:40:04 (40.0 MB/s) - ‘wikitext-103-raw-v1.zip’ saved [191984949/191984949]\n",
            "\n",
            "Archive:  wikitext-103-raw-v1.zip\n",
            "   creating: wikitext-103-raw/\n",
            "  inflating: wikitext-103-raw/wiki.test.raw  \n",
            "  inflating: wikitext-103-raw/wiki.valid.raw  \n",
            "  inflating: wikitext-103-raw/wiki.train.raw  \n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1G6Vs_XWgOoV"
      },
      "source": [
        "###Task: Let's build and train a Byte-Pair Encoding (BPE) tokenizer.\n",
        "    - Start with all the characters present in the training corpus as tokens.\n",
        "    - Identify the most common pair of tokens and merge it into one token.\n",
        "    - Repeat until the vocabulary (e.g., the number of tokens) has reached the size we want."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XIoFwqekgki2"
      },
      "source": [
        "Step2: Import the Tokenizer & BPE"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "erFbaG1tgjyl"
      },
      "source": [
        "from tokenizers import Tokenizer\n",
        "from tokenizers.models import BPE\n",
        "\n",
        "tokenizer = Tokenizer(BPE(unk_token=\"[UNK]\"))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wgmbnMvXgwpf"
      },
      "source": [
        "Step3: To train our tokenizer on the wikitext files, we will need to instantiate a trainer, in this case a BpeTrainer"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Xqf_0vy8gRPe"
      },
      "source": [
        "from tokenizers.trainers import BpeTrainer\n",
        "\n",
        "trainer = BpeTrainer(special_tokens=[\"[UNK]\", \"[CLS]\", \"[SEP]\", \"[PAD]\", \"[MASK]\"])"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Tdmv90WghECe"
      },
      "source": [
        "Step4: Utilizing pre-tokenization to make sure we have clear seperation of tokens, they do not overlap"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "JzaVG-j6hTHp"
      },
      "source": [
        "from tokenizers.pre_tokenizers import Whitespace\n",
        "\n",
        "tokenizer.pre_tokenizer = Whitespace()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qmeSBiAFhZHU"
      },
      "source": [
        "Step5: Training"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "uZv_L-JjhbB3"
      },
      "source": [
        "files = [f\"wikitext-103-raw/wiki.{split}.raw\" for split in [\"test\", \"train\", \"valid\"]]\n",
        "tokenizer.train(files, trainer)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "n3zj9IRpmp34"
      },
      "source": [
        "Step6: Save Tokenizer to a file that contains full configuration and vocab"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_Nlky_0cmz22"
      },
      "source": [
        "tokenizer.save(\"tokenizer-wiki.json\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3QpyQnINnY1B"
      },
      "source": [
        "step7: Reloading the tokenizer from the above file"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "b2DBuokVpabU"
      },
      "source": [
        "tokenizer = Tokenizer.from_file(\"tokenizer-wiki.json\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_t_vZy_zpgJD"
      },
      "source": [
        "step8: Using the tokenizer we just created & the output would be a encoded object"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TlKIHYpKpoIH"
      },
      "source": [
        "output = tokenizer.encode(\"This is my week-2 learning from fastAI and hf study group\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FPFSIpMPqM1Y"
      },
      "source": [
        "Checking the Tokens"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "xJApfbjJqkcw",
        "outputId": "f9ea5215-e0ea-466c-95d1-066edc30ffe9"
      },
      "source": [
        "print(output.tokens)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "['This', 'is', 'my', 'week', '-', '2', 'learning', 'from', 'fast', 'AI', 'and', 'h', 'f', 'study', 'group']\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "i9MrN2npquZT"
      },
      "source": [
        "Checking the id's atribute will contain the index of each of those tokens in the tokenizer’s vocabulary"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "apuC90vwq143",
        "outputId": "495cb907-aeda-46ce-8d82-7ff6e3511c73"
      },
      "source": [
        "print(output.ids)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[5521, 5031, 5454, 5830, 17, 22, 12018, 5108, 7930, 11571, 5025, 76, 74, 7506, 5733]\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RJQruptPrDe7"
      },
      "source": [
        "Checking the offsets"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "kgoTyWw7rGvs",
        "outputId": "672c6b60-8c6f-40cf-d618-01a3d7e9589c"
      },
      "source": [
        "print(output.offsets[6])"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "(18, 26)\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "V3U2wjGxrK5N"
      },
      "source": [
        "Matching the offsets back to text and see if the encodings are right\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "nIh5GWeJrU_n",
        "outputId": "6854cf20-27ce-40e4-a283-259f9e45527b"
      },
      "source": [
        "sentence = \"This is my week-2 learning from fastAI and hf study group\"\n",
        "sentence[18:26]"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'learning'"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 13
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wFHVohlqsj7I"
      },
      "source": [
        "### Step9: Post-Processing Steps\n",
        "    - To add special tokens like \"[CLS]\" or \"[SEP]\"\n",
        "    - TemplateProcessing is the most commonly used Post-Processor"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-cOV6Z8Wtjrb"
      },
      "source": [
        "from tokenizers.processors import TemplateProcessing\n",
        "\n",
        "tokenizer.post_processor = TemplateProcessing(\n",
        "    single=\"[CLS] $A [SEP]\",\n",
        "    pair=\"[CLS] $A [SEP] $B:1 [SEP]:1\",\n",
        "    special_tokens=[\n",
        "        (\"[CLS]\", tokenizer.token_to_id(\"[CLS]\")),\n",
        "        (\"[SEP]\", tokenizer.token_to_id(\"[SEP]\")),\n",
        "    ],\n",
        ")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BEzJmlfcuJ4J"
      },
      "source": [
        "Step10: Let’s try to encode the same sentence as before and see if thats works"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "w-eR80ujuOI_",
        "outputId": "294ca31f-03f2-41e4-f9b9-a27d93925ae0"
      },
      "source": [
        "output = tokenizer.encode(\"This is my week-2 learning from fastAI and hf study group\")\n",
        "print(output.tokens)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "['[CLS]', 'This', 'is', 'my', 'week', '-', '2', 'learning', 'from', 'fast', 'AI', 'and', 'h', 'f', 'study', 'group', '[SEP]']\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "rMtHKTXTu4NA",
        "outputId": "922b4197-6a3b-48a1-ff1d-c04ed7444192"
      },
      "source": [
        "output = tokenizer.encode(\"This is my week-2 learning\", \"from fastAI and hf study group\")\n",
        "print(output.tokens)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "['[CLS]', 'This', 'is', 'my', 'week', '-', '2', 'learning', '[SEP]', 'from', 'fast', 'AI', 'and', 'h', 'f', 'study', 'group', '[SEP]']\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BoAy787FueNV"
      },
      "source": [
        "Check the ID's"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "CJ9ZIdoNugSo",
        "outputId": "6d18baad-0c2e-455a-ee6f-c594774f6940"
      },
      "source": [
        "print(output.type_ids)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]\n"
          ],
          "name": "stdout"
        }
      ]
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"accelerator": "GPU",
	"colab": {
	"name": "fastai+HF_week2_Tokenizer_from_scratch.ipynb",
	"provenance": [],
	"machine_shape": "hm",
	"authorship_tag": "ABX9TyPJy961c3SciPJkthjDuIlt",
	"include_colab_link": true
	},
	"kernelspec": {
	"display_name": "Python 3",
	"name": "python3"
	},
	"language_info": {
	"name": "python"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/RaviChandraVeeramachaneni/fd5b2c1397626f3b44074f5a53008b56/fastai-hf_week2_tokenizer_from_scratch.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "shIkAiW5e523"
	},
	"source": [
	"Install the Transformers and Datasets libraries to run this notebook."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "lDlDVVKgeXb2",
	"outputId": "32b05f62-617b-4e6a-d815-a25245ece1ad"
	},
	"source": [
	"!pip install -qq transformers[sentencepiece]\n",
	"!pip install -qq datasets\n",
	"\n",
	"from transformers import pipeline"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"\u001b[K \|████████████████████████████████\| 2.6 MB 8.4 MB/s \n",
	"\u001b[K \|████████████████████████████████\| 895 kB 59.1 MB/s \n",
	"\u001b[K \|████████████████████████████████\| 636 kB 59.1 MB/s \n",
	"\u001b[K \|████████████████████████████████\| 3.3 MB 62.0 MB/s \n",
	"\u001b[K \|████████████████████████████████\| 1.1 MB 71.0 MB/s \n",
	"\u001b[K \|████████████████████████████████\| 542 kB 8.5 MB/s \n",
	"\u001b[K \|████████████████████████████████\| 76 kB 6.0 MB/s \n",
	"\u001b[K \|████████████████████████████████\| 243 kB 75.2 MB/s \n",
	"\u001b[K \|████████████████████████████████\| 118 kB 70.6 MB/s \n",
	"\u001b[?25h"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "9WTz-sxEfNqH"
	},
	"source": [
	"Build a tokenizer from scratch"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "aubneieAfWXC"
	},
	"source": [
	"Step1: Download the wikitext-103(516M of text) dataset and extract it"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "sb-1KlfFfPEQ",
	"outputId": "9dff9c2d-d4be-46ef-cb93-d528d92d5a65"
	},
	"source": [
	"!wget \"https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip\"\n",
	"!unzip wikitext-103-raw-v1.zip"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"--2021-07-23 23:39:59-- https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip\n",
	"Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.129.240\n",
	"Connecting to s3.amazonaws.com (s3.amazonaws.com)\|52.217.129.240\|:443... connected.\n",
	"HTTP request sent, awaiting response... 200 OK\n",
	"Length: 191984949 (183M) [application/zip]\n",
	"Saving to: ‘wikitext-103-raw-v1.zip’\n",
	"\n",
	"wikitext-103-raw-v1 100%[===================>] 183.09M 44.3MB/s in 4.6s \n",
	"\n",
	"2021-07-23 23:40:04 (40.0 MB/s) - ‘wikitext-103-raw-v1.zip’ saved [191984949/191984949]\n",
	"\n",
	"Archive: wikitext-103-raw-v1.zip\n",
	" creating: wikitext-103-raw/\n",
	" inflating: wikitext-103-raw/wiki.test.raw \n",
	" inflating: wikitext-103-raw/wiki.valid.raw \n",
	" inflating: wikitext-103-raw/wiki.train.raw \n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "1G6Vs_XWgOoV"
	},
	"source": [
	"###Task: Let's build and train a Byte-Pair Encoding (BPE) tokenizer.\n",
	" - Start with all the characters present in the training corpus as tokens.\n",
	" - Identify the most common pair of tokens and merge it into one token.\n",
	" - Repeat until the vocabulary (e.g., the number of tokens) has reached the size we want."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "XIoFwqekgki2"
	},
	"source": [
	"Step2: Import the Tokenizer & BPE"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "erFbaG1tgjyl"
	},
	"source": [
	"from tokenizers import Tokenizer\n",
	"from tokenizers.models import BPE\n",
	"\n",
	"tokenizer = Tokenizer(BPE(unk_token=\"[UNK]\"))"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "wgmbnMvXgwpf"
	},
	"source": [
	"Step3: To train our tokenizer on the wikitext files, we will need to instantiate a trainer, in this case a BpeTrainer"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "Xqf_0vy8gRPe"
	},
	"source": [
	"from tokenizers.trainers import BpeTrainer\n",
	"\n",
	"trainer = BpeTrainer(special_tokens=[\"[UNK]\", \"[CLS]\", \"[SEP]\", \"[PAD]\", \"[MASK]\"])"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "Tdmv90WghECe"
	},
	"source": [
	"Step4: Utilizing pre-tokenization to make sure we have clear seperation of tokens, they do not overlap"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "JzaVG-j6hTHp"
	},
	"source": [
	"from tokenizers.pre_tokenizers import Whitespace\n",
	"\n",
	"tokenizer.pre_tokenizer = Whitespace()"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "qmeSBiAFhZHU"
	},
	"source": [
	"Step5: Training"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "uZv_L-JjhbB3"
	},
	"source": [
	"files = [f\"wikitext-103-raw/wiki.{split}.raw\" for split in [\"test\", \"train\", \"valid\"]]\n",
	"tokenizer.train(files, trainer)"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "n3zj9IRpmp34"
	},
	"source": [
	"Step6: Save Tokenizer to a file that contains full configuration and vocab"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "_Nlky_0cmz22"
	},
	"source": [
	"tokenizer.save(\"tokenizer-wiki.json\")"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "3QpyQnINnY1B"
	},
	"source": [
	"step7: Reloading the tokenizer from the above file"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "b2DBuokVpabU"
	},
	"source": [
	"tokenizer = Tokenizer.from_file(\"tokenizer-wiki.json\")"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "_t_vZy_zpgJD"
	},
	"source": [
	"step8: Using the tokenizer we just created & the output would be a encoded object"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "TlKIHYpKpoIH"
	},
	"source": [
	"output = tokenizer.encode(\"This is my week-2 learning from fastAI and hf study group\")"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "FPFSIpMPqM1Y"
	},
	"source": [
	"Checking the Tokens"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "xJApfbjJqkcw",
	"outputId": "f9ea5215-e0ea-466c-95d1-066edc30ffe9"
	},
	"source": [
	"print(output.tokens)"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"['This', 'is', 'my', 'week', '-', '2', 'learning', 'from', 'fast', 'AI', 'and', 'h', 'f', 'study', 'group']\n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "i9MrN2npquZT"
	},
	"source": [
	"Checking the id's atribute will contain the index of each of those tokens in the tokenizer’s vocabulary"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "apuC90vwq143",
	"outputId": "495cb907-aeda-46ce-8d82-7ff6e3511c73"
	},
	"source": [
	"print(output.ids)"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"[5521, 5031, 5454, 5830, 17, 22, 12018, 5108, 7930, 11571, 5025, 76, 74, 7506, 5733]\n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "RJQruptPrDe7"
	},
	"source": [
	"Checking the offsets"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "kgoTyWw7rGvs",
	"outputId": "672c6b60-8c6f-40cf-d618-01a3d7e9589c"
	},
	"source": [
	"print(output.offsets[6])"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"(18, 26)\n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "V3U2wjGxrK5N"
	},
	"source": [
	"Matching the offsets back to text and see if the encodings are right\n",
	"\n"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 35
	},
	"id": "nIh5GWeJrU_n",
	"outputId": "6854cf20-27ce-40e4-a283-259f9e45527b"
	},
	"source": [
	"sentence = \"This is my week-2 learning from fastAI and hf study group\"\n",
	"sentence[18:26]"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"application/vnd.google.colaboratory.intrinsic+json": {
	"type": "string"
	},
	"text/plain": [
	"'learning'"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 13
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "wFHVohlqsj7I"
	},
	"source": [
	"### Step9: Post-Processing Steps\n",
	" - To add special tokens like \"[CLS]\" or \"[SEP]\"\n",
	" - TemplateProcessing is the most commonly used Post-Processor"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "-cOV6Z8Wtjrb"
	},
	"source": [
	"from tokenizers.processors import TemplateProcessing\n",
	"\n",
	"tokenizer.post_processor = TemplateProcessing(\n",
	" single=\"[CLS] $A [SEP]\",\n",
	" pair=\"[CLS] $A [SEP] $B:1 [SEP]:1\",\n",
	" special_tokens=[\n",
	" (\"[CLS]\", tokenizer.token_to_id(\"[CLS]\")),\n",
	" (\"[SEP]\", tokenizer.token_to_id(\"[SEP]\")),\n",
	" ],\n",
	")"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "BEzJmlfcuJ4J"
	},
	"source": [
	"Step10: Let’s try to encode the same sentence as before and see if thats works"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "w-eR80ujuOI_",
	"outputId": "294ca31f-03f2-41e4-f9b9-a27d93925ae0"
	},
	"source": [
	"output = tokenizer.encode(\"This is my week-2 learning from fastAI and hf study group\")\n",
	"print(output.tokens)"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"['[CLS]', 'This', 'is', 'my', 'week', '-', '2', 'learning', 'from', 'fast', 'AI', 'and', 'h', 'f', 'study', 'group', '[SEP]']\n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "rMtHKTXTu4NA",
	"outputId": "922b4197-6a3b-48a1-ff1d-c04ed7444192"
	},
	"source": [
	"output = tokenizer.encode(\"This is my week-2 learning\", \"from fastAI and hf study group\")\n",
	"print(output.tokens)"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"['[CLS]', 'This', 'is', 'my', 'week', '-', '2', 'learning', '[SEP]', 'from', 'fast', 'AI', 'and', 'h', 'f', 'study', 'group', '[SEP]']\n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "BoAy787FueNV"
	},
	"source": [
	"Check the ID's"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "CJ9ZIdoNugSo",
	"outputId": "6d18baad-0c2e-455a-ee6f-c594774f6940"
	},
	"source": [
	"print(output.type_ids)"
	],
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]\n"
	],
	"name": "stdout"
	}
	]
	}
	]
	}