Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
fastai+HF_week2_Tokenizer_from_scratch.ipynb
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"accelerator": "GPU",
"colab": {
"name": "fastai+HF_week2_Tokenizer_from_scratch.ipynb",
"provenance": [],
"machine_shape": "hm",
"authorship_tag": "ABX9TyPJy961c3SciPJkthjDuIlt",
"include_colab_link": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/RaviChandraVeeramachaneni/fd5b2c1397626f3b44074f5a53008b56/fastai-hf_week2_tokenizer_from_scratch.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "shIkAiW5e523"
},
"source": [
"Install the Transformers and Datasets libraries to run this notebook."
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "lDlDVVKgeXb2",
"outputId": "32b05f62-617b-4e6a-d815-a25245ece1ad"
},
"source": [
"!pip install -qq transformers[sentencepiece]\n",
"!pip install -qq datasets\n",
"\n",
"from transformers import pipeline"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"\u001b[K |████████████████████████████████| 2.6 MB 8.4 MB/s \n",
"\u001b[K |████████████████████████████████| 895 kB 59.1 MB/s \n",
"\u001b[K |████████████████████████████████| 636 kB 59.1 MB/s \n",
"\u001b[K |████████████████████████████████| 3.3 MB 62.0 MB/s \n",
"\u001b[K |████████████████████████████████| 1.1 MB 71.0 MB/s \n",
"\u001b[K |████████████████████████████████| 542 kB 8.5 MB/s \n",
"\u001b[K |████████████████████████████████| 76 kB 6.0 MB/s \n",
"\u001b[K |████████████████████████████████| 243 kB 75.2 MB/s \n",
"\u001b[K |████████████████████████████████| 118 kB 70.6 MB/s \n",
"\u001b[?25h"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9WTz-sxEfNqH"
},
"source": [
"Build a tokenizer from scratch"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aubneieAfWXC"
},
"source": [
"Step1: Download the wikitext-103(516M of text) dataset and extract it"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "sb-1KlfFfPEQ",
"outputId": "9dff9c2d-d4be-46ef-cb93-d528d92d5a65"
},
"source": [
"!wget \"https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip\"\n",
"!unzip wikitext-103-raw-v1.zip"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"--2021-07-23 23:39:59-- https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip\n",
"Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.129.240\n",
"Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.129.240|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 191984949 (183M) [application/zip]\n",
"Saving to: ‘wikitext-103-raw-v1.zip’\n",
"\n",
"wikitext-103-raw-v1 100%[===================>] 183.09M 44.3MB/s in 4.6s \n",
"\n",
"2021-07-23 23:40:04 (40.0 MB/s) - ‘wikitext-103-raw-v1.zip’ saved [191984949/191984949]\n",
"\n",
"Archive: wikitext-103-raw-v1.zip\n",
" creating: wikitext-103-raw/\n",
" inflating: wikitext-103-raw/wiki.test.raw \n",
" inflating: wikitext-103-raw/wiki.valid.raw \n",
" inflating: wikitext-103-raw/wiki.train.raw \n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1G6Vs_XWgOoV"
},
"source": [
"###Task: Let's build and train a Byte-Pair Encoding (BPE) tokenizer.\n",
" - Start with all the characters present in the training corpus as tokens.\n",
" - Identify the most common pair of tokens and merge it into one token.\n",
" - Repeat until the vocabulary (e.g., the number of tokens) has reached the size we want."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XIoFwqekgki2"
},
"source": [
"Step2: Import the Tokenizer & BPE"
]
},
{
"cell_type": "code",
"metadata": {
"id": "erFbaG1tgjyl"
},
"source": [
"from tokenizers import Tokenizer\n",
"from tokenizers.models import BPE\n",
"\n",
"tokenizer = Tokenizer(BPE(unk_token=\"[UNK]\"))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "wgmbnMvXgwpf"
},
"source": [
"Step3: To train our tokenizer on the wikitext files, we will need to instantiate a trainer, in this case a BpeTrainer"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Xqf_0vy8gRPe"
},
"source": [
"from tokenizers.trainers import BpeTrainer\n",
"\n",
"trainer = BpeTrainer(special_tokens=[\"[UNK]\", \"[CLS]\", \"[SEP]\", \"[PAD]\", \"[MASK]\"])"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Tdmv90WghECe"
},
"source": [
"Step4: Utilizing pre-tokenization to make sure we have clear seperation of tokens, they do not overlap"
]
},
{
"cell_type": "code",
"metadata": {
"id": "JzaVG-j6hTHp"
},
"source": [
"from tokenizers.pre_tokenizers import Whitespace\n",
"\n",
"tokenizer.pre_tokenizer = Whitespace()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "qmeSBiAFhZHU"
},
"source": [
"Step5: Training"
]
},
{
"cell_type": "code",
"metadata": {
"id": "uZv_L-JjhbB3"
},
"source": [
"files = [f\"wikitext-103-raw/wiki.{split}.raw\" for split in [\"test\", \"train\", \"valid\"]]\n",
"tokenizer.train(files, trainer)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "n3zj9IRpmp34"
},
"source": [
"Step6: Save Tokenizer to a file that contains full configuration and vocab"
]
},
{
"cell_type": "code",
"metadata": {
"id": "_Nlky_0cmz22"
},
"source": [
"tokenizer.save(\"tokenizer-wiki.json\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "3QpyQnINnY1B"
},
"source": [
"step7: Reloading the tokenizer from the above file"
]
},
{
"cell_type": "code",
"metadata": {
"id": "b2DBuokVpabU"
},
"source": [
"tokenizer = Tokenizer.from_file(\"tokenizer-wiki.json\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "_t_vZy_zpgJD"
},
"source": [
"step8: Using the tokenizer we just created & the output would be a encoded object"
]
},
{
"cell_type": "code",
"metadata": {
"id": "TlKIHYpKpoIH"
},
"source": [
"output = tokenizer.encode(\"This is my week-2 learning from fastAI and hf study group\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "FPFSIpMPqM1Y"
},
"source": [
"Checking the Tokens"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "xJApfbjJqkcw",
"outputId": "f9ea5215-e0ea-466c-95d1-066edc30ffe9"
},
"source": [
"print(output.tokens)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"['This', 'is', 'my', 'week', '-', '2', 'learning', 'from', 'fast', 'AI', 'and', 'h', 'f', 'study', 'group']\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "i9MrN2npquZT"
},
"source": [
"Checking the id's atribute will contain the index of each of those tokens in the tokenizer’s vocabulary"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "apuC90vwq143",
"outputId": "495cb907-aeda-46ce-8d82-7ff6e3511c73"
},
"source": [
"print(output.ids)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"[5521, 5031, 5454, 5830, 17, 22, 12018, 5108, 7930, 11571, 5025, 76, 74, 7506, 5733]\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RJQruptPrDe7"
},
"source": [
"Checking the offsets"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "kgoTyWw7rGvs",
"outputId": "672c6b60-8c6f-40cf-d618-01a3d7e9589c"
},
"source": [
"print(output.offsets[6])"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"(18, 26)\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "V3U2wjGxrK5N"
},
"source": [
"Matching the offsets back to text and see if the encodings are right\n",
"\n"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"id": "nIh5GWeJrU_n",
"outputId": "6854cf20-27ce-40e4-a283-259f9e45527b"
},
"source": [
"sentence = \"This is my week-2 learning from fastAI and hf study group\"\n",
"sentence[18:26]"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
},
"text/plain": [
"'learning'"
]
},
"metadata": {
"tags": []
},
"execution_count": 13
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wFHVohlqsj7I"
},
"source": [
"### Step9: Post-Processing Steps\n",
" - To add special tokens like \"[CLS]\" or \"[SEP]\"\n",
" - TemplateProcessing is the most commonly used Post-Processor"
]
},
{
"cell_type": "code",
"metadata": {
"id": "-cOV6Z8Wtjrb"
},
"source": [
"from tokenizers.processors import TemplateProcessing\n",
"\n",
"tokenizer.post_processor = TemplateProcessing(\n",
" single=\"[CLS] $A [SEP]\",\n",
" pair=\"[CLS] $A [SEP] $B:1 [SEP]:1\",\n",
" special_tokens=[\n",
" (\"[CLS]\", tokenizer.token_to_id(\"[CLS]\")),\n",
" (\"[SEP]\", tokenizer.token_to_id(\"[SEP]\")),\n",
" ],\n",
")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "BEzJmlfcuJ4J"
},
"source": [
"Step10: Let’s try to encode the same sentence as before and see if thats works"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "w-eR80ujuOI_",
"outputId": "294ca31f-03f2-41e4-f9b9-a27d93925ae0"
},
"source": [
"output = tokenizer.encode(\"This is my week-2 learning from fastAI and hf study group\")\n",
"print(output.tokens)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"['[CLS]', 'This', 'is', 'my', 'week', '-', '2', 'learning', 'from', 'fast', 'AI', 'and', 'h', 'f', 'study', 'group', '[SEP]']\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "rMtHKTXTu4NA",
"outputId": "922b4197-6a3b-48a1-ff1d-c04ed7444192"
},
"source": [
"output = tokenizer.encode(\"This is my week-2 learning\", \"from fastAI and hf study group\")\n",
"print(output.tokens)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"['[CLS]', 'This', 'is', 'my', 'week', '-', '2', 'learning', '[SEP]', 'from', 'fast', 'AI', 'and', 'h', 'f', 'study', 'group', '[SEP]']\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BoAy787FueNV"
},
"source": [
"Check the ID's"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "CJ9ZIdoNugSo",
"outputId": "6d18baad-0c2e-455a-ee6f-c594774f6940"
},
"source": [
"print(output.type_ids)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]\n"
],
"name": "stdout"
}
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment