Created
July 24, 2021 01:10
-
-
Save RaviChandraVeeramachaneni/fd5b2c1397626f3b44074f5a53008b56 to your computer and use it in GitHub Desktop.
fastai+HF_week2_Tokenizer_from_scratch.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"accelerator": "GPU", | |
"colab": { | |
"name": "fastai+HF_week2_Tokenizer_from_scratch.ipynb", | |
"provenance": [], | |
"machine_shape": "hm", | |
"authorship_tag": "ABX9TyPJy961c3SciPJkthjDuIlt", | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"display_name": "Python 3", | |
"name": "python3" | |
}, | |
"language_info": { | |
"name": "python" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/RaviChandraVeeramachaneni/fd5b2c1397626f3b44074f5a53008b56/fastai-hf_week2_tokenizer_from_scratch.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "shIkAiW5e523" | |
}, | |
"source": [ | |
"Install the Transformers and Datasets libraries to run this notebook." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "lDlDVVKgeXb2", | |
"outputId": "32b05f62-617b-4e6a-d815-a25245ece1ad" | |
}, | |
"source": [ | |
"!pip install -qq transformers[sentencepiece]\n", | |
"!pip install -qq datasets\n", | |
"\n", | |
"from transformers import pipeline" | |
], | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"\u001b[K |████████████████████████████████| 2.6 MB 8.4 MB/s \n", | |
"\u001b[K |████████████████████████████████| 895 kB 59.1 MB/s \n", | |
"\u001b[K |████████████████████████████████| 636 kB 59.1 MB/s \n", | |
"\u001b[K |████████████████████████████████| 3.3 MB 62.0 MB/s \n", | |
"\u001b[K |████████████████████████████████| 1.1 MB 71.0 MB/s \n", | |
"\u001b[K |████████████████████████████████| 542 kB 8.5 MB/s \n", | |
"\u001b[K |████████████████████████████████| 76 kB 6.0 MB/s \n", | |
"\u001b[K |████████████████████████████████| 243 kB 75.2 MB/s \n", | |
"\u001b[K |████████████████████████████████| 118 kB 70.6 MB/s \n", | |
"\u001b[?25h" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "9WTz-sxEfNqH" | |
}, | |
"source": [ | |
"Build a tokenizer from scratch" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "aubneieAfWXC" | |
}, | |
"source": [ | |
"Step1: Download the wikitext-103(516M of text) dataset and extract it" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "sb-1KlfFfPEQ", | |
"outputId": "9dff9c2d-d4be-46ef-cb93-d528d92d5a65" | |
}, | |
"source": [ | |
"!wget \"https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip\"\n", | |
"!unzip wikitext-103-raw-v1.zip" | |
], | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"--2021-07-23 23:39:59-- https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip\n", | |
"Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.129.240\n", | |
"Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.129.240|:443... connected.\n", | |
"HTTP request sent, awaiting response... 200 OK\n", | |
"Length: 191984949 (183M) [application/zip]\n", | |
"Saving to: ‘wikitext-103-raw-v1.zip’\n", | |
"\n", | |
"wikitext-103-raw-v1 100%[===================>] 183.09M 44.3MB/s in 4.6s \n", | |
"\n", | |
"2021-07-23 23:40:04 (40.0 MB/s) - ‘wikitext-103-raw-v1.zip’ saved [191984949/191984949]\n", | |
"\n", | |
"Archive: wikitext-103-raw-v1.zip\n", | |
" creating: wikitext-103-raw/\n", | |
" inflating: wikitext-103-raw/wiki.test.raw \n", | |
" inflating: wikitext-103-raw/wiki.valid.raw \n", | |
" inflating: wikitext-103-raw/wiki.train.raw \n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "1G6Vs_XWgOoV" | |
}, | |
"source": [ | |
"###Task: Let's build and train a Byte-Pair Encoding (BPE) tokenizer.\n", | |
" - Start with all the characters present in the training corpus as tokens.\n", | |
" - Identify the most common pair of tokens and merge it into one token.\n", | |
" - Repeat until the vocabulary (e.g., the number of tokens) has reached the size we want." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "XIoFwqekgki2" | |
}, | |
"source": [ | |
"Step2: Import the Tokenizer & BPE" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "erFbaG1tgjyl" | |
}, | |
"source": [ | |
"from tokenizers import Tokenizer\n", | |
"from tokenizers.models import BPE\n", | |
"\n", | |
"tokenizer = Tokenizer(BPE(unk_token=\"[UNK]\"))" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "wgmbnMvXgwpf" | |
}, | |
"source": [ | |
"Step3: To train our tokenizer on the wikitext files, we will need to instantiate a trainer, in this case a BpeTrainer" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Xqf_0vy8gRPe" | |
}, | |
"source": [ | |
"from tokenizers.trainers import BpeTrainer\n", | |
"\n", | |
"trainer = BpeTrainer(special_tokens=[\"[UNK]\", \"[CLS]\", \"[SEP]\", \"[PAD]\", \"[MASK]\"])" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "Tdmv90WghECe" | |
}, | |
"source": [ | |
"Step4: Utilizing pre-tokenization to make sure we have clear seperation of tokens, they do not overlap" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "JzaVG-j6hTHp" | |
}, | |
"source": [ | |
"from tokenizers.pre_tokenizers import Whitespace\n", | |
"\n", | |
"tokenizer.pre_tokenizer = Whitespace()" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "qmeSBiAFhZHU" | |
}, | |
"source": [ | |
"Step5: Training" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "uZv_L-JjhbB3" | |
}, | |
"source": [ | |
"files = [f\"wikitext-103-raw/wiki.{split}.raw\" for split in [\"test\", \"train\", \"valid\"]]\n", | |
"tokenizer.train(files, trainer)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "n3zj9IRpmp34" | |
}, | |
"source": [ | |
"Step6: Save Tokenizer to a file that contains full configuration and vocab" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "_Nlky_0cmz22" | |
}, | |
"source": [ | |
"tokenizer.save(\"tokenizer-wiki.json\")" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "3QpyQnINnY1B" | |
}, | |
"source": [ | |
"step7: Reloading the tokenizer from the above file" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "b2DBuokVpabU" | |
}, | |
"source": [ | |
"tokenizer = Tokenizer.from_file(\"tokenizer-wiki.json\")" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "_t_vZy_zpgJD" | |
}, | |
"source": [ | |
"step8: Using the tokenizer we just created & the output would be a encoded object" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "TlKIHYpKpoIH" | |
}, | |
"source": [ | |
"output = tokenizer.encode(\"This is my week-2 learning from fastAI and hf study group\")" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "FPFSIpMPqM1Y" | |
}, | |
"source": [ | |
"Checking the Tokens" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "xJApfbjJqkcw", | |
"outputId": "f9ea5215-e0ea-466c-95d1-066edc30ffe9" | |
}, | |
"source": [ | |
"print(output.tokens)" | |
], | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"['This', 'is', 'my', 'week', '-', '2', 'learning', 'from', 'fast', 'AI', 'and', 'h', 'f', 'study', 'group']\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "i9MrN2npquZT" | |
}, | |
"source": [ | |
"Checking the id's atribute will contain the index of each of those tokens in the tokenizer’s vocabulary" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "apuC90vwq143", | |
"outputId": "495cb907-aeda-46ce-8d82-7ff6e3511c73" | |
}, | |
"source": [ | |
"print(output.ids)" | |
], | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"[5521, 5031, 5454, 5830, 17, 22, 12018, 5108, 7930, 11571, 5025, 76, 74, 7506, 5733]\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "RJQruptPrDe7" | |
}, | |
"source": [ | |
"Checking the offsets" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "kgoTyWw7rGvs", | |
"outputId": "672c6b60-8c6f-40cf-d618-01a3d7e9589c" | |
}, | |
"source": [ | |
"print(output.offsets[6])" | |
], | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"(18, 26)\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "V3U2wjGxrK5N" | |
}, | |
"source": [ | |
"Matching the offsets back to text and see if the encodings are right\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 35 | |
}, | |
"id": "nIh5GWeJrU_n", | |
"outputId": "6854cf20-27ce-40e4-a283-259f9e45527b" | |
}, | |
"source": [ | |
"sentence = \"This is my week-2 learning from fastAI and hf study group\"\n", | |
"sentence[18:26]" | |
], | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"application/vnd.google.colaboratory.intrinsic+json": { | |
"type": "string" | |
}, | |
"text/plain": [ | |
"'learning'" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 13 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "wFHVohlqsj7I" | |
}, | |
"source": [ | |
"### Step9: Post-Processing Steps\n", | |
" - To add special tokens like \"[CLS]\" or \"[SEP]\"\n", | |
" - TemplateProcessing is the most commonly used Post-Processor" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "-cOV6Z8Wtjrb" | |
}, | |
"source": [ | |
"from tokenizers.processors import TemplateProcessing\n", | |
"\n", | |
"tokenizer.post_processor = TemplateProcessing(\n", | |
" single=\"[CLS] $A [SEP]\",\n", | |
" pair=\"[CLS] $A [SEP] $B:1 [SEP]:1\",\n", | |
" special_tokens=[\n", | |
" (\"[CLS]\", tokenizer.token_to_id(\"[CLS]\")),\n", | |
" (\"[SEP]\", tokenizer.token_to_id(\"[SEP]\")),\n", | |
" ],\n", | |
")" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "BEzJmlfcuJ4J" | |
}, | |
"source": [ | |
"Step10: Let’s try to encode the same sentence as before and see if thats works" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "w-eR80ujuOI_", | |
"outputId": "294ca31f-03f2-41e4-f9b9-a27d93925ae0" | |
}, | |
"source": [ | |
"output = tokenizer.encode(\"This is my week-2 learning from fastAI and hf study group\")\n", | |
"print(output.tokens)" | |
], | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"['[CLS]', 'This', 'is', 'my', 'week', '-', '2', 'learning', 'from', 'fast', 'AI', 'and', 'h', 'f', 'study', 'group', '[SEP]']\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "rMtHKTXTu4NA", | |
"outputId": "922b4197-6a3b-48a1-ff1d-c04ed7444192" | |
}, | |
"source": [ | |
"output = tokenizer.encode(\"This is my week-2 learning\", \"from fastAI and hf study group\")\n", | |
"print(output.tokens)" | |
], | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"['[CLS]', 'This', 'is', 'my', 'week', '-', '2', 'learning', '[SEP]', 'from', 'fast', 'AI', 'and', 'h', 'f', 'study', 'group', '[SEP]']\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "BoAy787FueNV" | |
}, | |
"source": [ | |
"Check the ID's" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "CJ9ZIdoNugSo", | |
"outputId": "6d18baad-0c2e-455a-ee6f-c594774f6940" | |
}, | |
"source": [ | |
"print(output.type_ids)" | |
], | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]\n" | |
], | |
"name": "stdout" | |
} | |
] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment