Created
August 3, 2023 16:09
-
-
Save sanchezcarlosjr/32e6c1e1af704d5341dfa42809fd5062 to your computer and use it in GitHub Desktop.
OCR captcha.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "JEAmyTyePBdF" | |
}, | |
"source": [ | |
"## OCR of Captcha Images\n", | |
"\n", | |
"Dataset Source: https://www.kaggle.com/datasets/alizahidraja/captcha-data" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "qOSTkrnkPBdH" | |
}, | |
"source": [ | |
"##### Install Necessary Libraries" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:38:00.419279Z", | |
"iopub.status.busy": "2023-07-02T16:38:00.418892Z", | |
"iopub.status.idle": "2023-07-02T16:38:10.993604Z", | |
"shell.execute_reply": "2023-07-02T16:38:10.992638Z", | |
"shell.execute_reply.started": "2023-07-02T16:38:00.419255Z" | |
}, | |
"id": "iAmNUfbePBdI", | |
"outputId": "16bd7b12-9d8c-4d1a-adb2-29aed21b2d39", | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Requirement already satisfied: torch in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (1.11.0)\n", | |
"Requirement already satisfied: torchvision in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (0.12.0)\n", | |
"Requirement already satisfied: torchaudio in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (0.11.0)\n", | |
"Requirement already satisfied: typing_extensions in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from torch) (4.7.0)\n", | |
"Requirement already satisfied: numpy in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from torchvision) (1.21.6)\n", | |
"Requirement already satisfied: requests in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from torchvision) (2.31.0)\n", | |
"Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from torchvision) (9.1.1)\n", | |
"Requirement already satisfied: charset-normalizer<4,>=2 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->torchvision) (2.0.12)\n", | |
"Requirement already satisfied: idna<4,>=2.5 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->torchvision) (3.4)\n", | |
"Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->torchvision) (1.26.15)\n", | |
"Requirement already satisfied: certifi>=2017.4.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->torchvision) (2023.5.7)\n", | |
"Note: you may need to restart the kernel to use updated packages.\n", | |
"\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", | |
"dask-cuda 22.4.0 requires click==8.0.4, but you have click 8.1.3 which is incompatible.\u001b[0m\u001b[31m\n", | |
"\u001b[0mNote: you may need to restart the kernel to use updated packages.\n" | |
] | |
} | |
], | |
"source": [ | |
"%pip install torch torchvision torchaudio\n", | |
"%pip install -q datasets jiwer" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:38:23.527710Z", | |
"iopub.status.busy": "2023-07-02T16:38:23.527353Z", | |
"iopub.status.idle": "2023-07-02T16:38:30.048267Z", | |
"shell.execute_reply": "2023-07-02T16:38:30.047483Z", | |
"shell.execute_reply.started": "2023-07-02T16:38:23.527669Z" | |
}, | |
"id": "6NXldfxQPPyf", | |
"outputId": "156a5d1d-b3c9-4cd0-914b-efa8f8f92924", | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Requirement already satisfied: transformers in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (4.30.2)\n", | |
"Requirement already satisfied: filelock in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (3.12.2)\n", | |
"Requirement already satisfied: huggingface-hub<1.0,>=0.14.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (0.15.1)\n", | |
"Requirement already satisfied: numpy>=1.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (1.21.6)\n", | |
"Requirement already satisfied: packaging>=20.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (23.1)\n", | |
"Requirement already satisfied: pyyaml>=5.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (6.0)\n", | |
"Requirement already satisfied: regex!=2019.12.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (2023.6.3)\n", | |
"Requirement already satisfied: requests in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (2.31.0)\n", | |
"Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (0.13.3)\n", | |
"Requirement already satisfied: safetensors>=0.3.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (0.3.1)\n", | |
"Requirement already satisfied: tqdm>=4.27 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (4.65.0)\n", | |
"Requirement already satisfied: fsspec in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (2022.3.0)\n", | |
"Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (4.7.0)\n", | |
"Requirement already satisfied: charset-normalizer<4,>=2 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->transformers) (2.0.12)\n", | |
"Requirement already satisfied: idna<4,>=2.5 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->transformers) (3.4)\n", | |
"Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->transformers) (1.26.15)\n", | |
"Requirement already satisfied: certifi>=2017.4.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->transformers) (2023.5.7)\n", | |
"Requirement already satisfied: accelerate in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (0.20.3)\n", | |
"Requirement already satisfied: numpy>=1.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from accelerate) (1.21.6)\n", | |
"Requirement already satisfied: packaging>=20.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from accelerate) (23.1)\n", | |
"Requirement already satisfied: psutil in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from accelerate) (5.9.5)\n", | |
"Requirement already satisfied: pyyaml in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from accelerate) (6.0)\n", | |
"Requirement already satisfied: torch>=1.6.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from accelerate) (1.11.0)\n", | |
"Requirement already satisfied: typing_extensions in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from torch>=1.6.0->accelerate) (4.7.0)\n" | |
] | |
} | |
], | |
"source": [ | |
"! pip install transformers\n", | |
"! pip install accelerate -U" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:38:30.050265Z", | |
"iopub.status.busy": "2023-07-02T16:38:30.049932Z", | |
"iopub.status.idle": "2023-07-02T16:38:33.677374Z", | |
"shell.execute_reply": "2023-07-02T16:38:33.676484Z", | |
"shell.execute_reply.started": "2023-07-02T16:38:30.050235Z" | |
}, | |
"id": "l_I_Z3R-PfBA", | |
"outputId": "872df2f1-8599-4913-c792-0bbb80a57b4b", | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Collecting evaluate\n", | |
" Using cached evaluate-0.4.0-py3-none-any.whl (81 kB)\n", | |
"Requirement already satisfied: datasets>=2.0.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (2.13.1)\n", | |
"Requirement already satisfied: numpy>=1.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (1.21.6)\n", | |
"Requirement already satisfied: dill in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (0.3.6)\n", | |
"Requirement already satisfied: pandas in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (1.4.2)\n", | |
"Requirement already satisfied: requests>=2.19.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (2.31.0)\n", | |
"Requirement already satisfied: tqdm>=4.62.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (4.65.0)\n", | |
"Requirement already satisfied: xxhash in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (3.2.0)\n", | |
"Requirement already satisfied: multiprocess in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (0.70.14)\n", | |
"Requirement already satisfied: fsspec[http]>=2021.05.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (2022.3.0)\n", | |
"Requirement already satisfied: huggingface-hub>=0.7.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (0.15.1)\n", | |
"Requirement already satisfied: packaging in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (23.1)\n", | |
"Collecting responses<0.19 (from evaluate)\n", | |
" Using cached responses-0.18.0-py3-none-any.whl (38 kB)\n", | |
"Requirement already satisfied: pyarrow>=8.0.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from datasets>=2.0.0->evaluate) (12.0.1)\n", | |
"Requirement already satisfied: aiohttp in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from datasets>=2.0.0->evaluate) (3.8.4)\n", | |
"Requirement already satisfied: pyyaml>=5.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from datasets>=2.0.0->evaluate) (6.0)\n", | |
"Requirement already satisfied: filelock in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from huggingface-hub>=0.7.0->evaluate) (3.12.2)\n", | |
"Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from huggingface-hub>=0.7.0->evaluate) (4.7.0)\n", | |
"Requirement already satisfied: charset-normalizer<4,>=2 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests>=2.19.0->evaluate) (2.0.12)\n", | |
"Requirement already satisfied: idna<4,>=2.5 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests>=2.19.0->evaluate) (3.4)\n", | |
"Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests>=2.19.0->evaluate) (1.26.15)\n", | |
"Requirement already satisfied: certifi>=2017.4.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests>=2.19.0->evaluate) (2023.5.7)\n", | |
"Requirement already satisfied: python-dateutil>=2.8.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from pandas->evaluate) (2.8.2)\n", | |
"Requirement already satisfied: pytz>=2020.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from pandas->evaluate) (2023.3)\n", | |
"Requirement already satisfied: six>=1.5 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas->evaluate) (1.16.0)\n", | |
"Requirement already satisfied: attrs>=17.3.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (23.1.0)\n", | |
"Requirement already satisfied: multidict<7.0,>=4.5 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (6.0.4)\n", | |
"Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (4.0.2)\n", | |
"Requirement already satisfied: yarl<2.0,>=1.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.9.2)\n", | |
"Requirement already satisfied: frozenlist>=1.1.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.3.3)\n", | |
"Requirement already satisfied: aiosignal>=1.1.2 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.3.1)\n", | |
"Installing collected packages: responses, evaluate\n", | |
"Successfully installed evaluate-0.4.0 responses-0.18.0\n" | |
] | |
} | |
], | |
"source": [ | |
"! pip install evaluate" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "E2WGyBdhPBdK" | |
}, | |
"source": [ | |
"##### Import Necessary Libraries" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:38:56.612960Z", | |
"iopub.status.busy": "2023-07-02T16:38:56.612617Z", | |
"iopub.status.idle": "2023-07-02T16:39:00.587841Z", | |
"shell.execute_reply": "2023-07-02T16:39:00.587056Z", | |
"shell.execute_reply.started": "2023-07-02T16:38:56.612928Z" | |
}, | |
"id": "e7_XKGqgPBdL", | |
"tags": [] | |
}, | |
"outputs": [], | |
"source": [ | |
"import os, sys, itertools\n", | |
"os.environ['TOKENIZERS_PARALLELISM']='false'\n", | |
"\n", | |
"import pandas as pd\n", | |
"\n", | |
"from PIL import Image\n", | |
"\n", | |
"import torch\n", | |
"from torch.utils.data import Dataset\n", | |
"\n", | |
"import datasets\n", | |
"from datasets import load_dataset\n", | |
"\n", | |
"import transformers\n", | |
"from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer\n", | |
"from transformers import VisionEncoderDecoderModel, TrOCRProcessor, default_data_collator\n", | |
"\n", | |
"import evaluate" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "iv2paOsWPBdM" | |
}, | |
"source": [ | |
"##### Display Versions of Relevant Software & Libraries" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:39:03.132896Z", | |
"iopub.status.busy": "2023-07-02T16:39:03.132523Z", | |
"iopub.status.idle": "2023-07-02T16:39:03.138346Z", | |
"shell.execute_reply": "2023-07-02T16:39:03.137525Z", | |
"shell.execute_reply.started": "2023-07-02T16:39:03.132871Z" | |
}, | |
"id": "z4c1UdR6PBdN", | |
"outputId": "3272a877-2a41-40fc-ba5a-68f9135aeb84", | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" Python: 3.9.15\n", | |
" Pandas: 1.4.2\n", | |
" Datasets: 2.13.1\n", | |
" Transformers: 4.30.2\n", | |
" Torch: 1.11.0\n" | |
] | |
} | |
], | |
"source": [ | |
"print(\"Python:\".rjust(15), sys.version[0:6])\n", | |
"print(\"Pandas:\".rjust(15), pd.__version__)\n", | |
"print(\"Datasets:\".rjust(15), datasets.__version__)\n", | |
"print(\"Transformers:\".rjust(15), transformers.__version__)\n", | |
"print(\"Torch:\".rjust(15), torch.__version__)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:39:05.051096Z", | |
"iopub.status.busy": "2023-07-02T16:39:05.050710Z", | |
"iopub.status.idle": "2023-07-02T16:39:05.840681Z", | |
"shell.execute_reply": "2023-07-02T16:39:05.839435Z", | |
"shell.execute_reply.started": "2023-07-02T16:39:05.051074Z" | |
}, | |
"id": "oTg4CqR5QD1n", | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"/home/jovyan/workspace\n" | |
] | |
} | |
], | |
"source": [ | |
"! pwd" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "y4VWEzsHPBdP" | |
}, | |
"source": [ | |
"##### Ingest & Preprocess Training DataFrame" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:39:53.671763Z", | |
"iopub.status.busy": "2023-07-02T16:39:53.671379Z", | |
"iopub.status.idle": "2023-07-02T16:39:53.826411Z", | |
"shell.execute_reply": "2023-07-02T16:39:53.825668Z", | |
"shell.execute_reply.started": "2023-07-02T16:39:53.671738Z" | |
}, | |
"id": "Atcq5U47Pw0m", | |
"outputId": "0a73f060-abd2-4e5c-d156-3090962328f3", | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"1153" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import zipfile\n", | |
"\n", | |
"with zipfile.ZipFile('dataset-imss.zip', 'r') as zip_ref:\n", | |
" zip_ref.extractall('captchas')\n", | |
"\n", | |
"import glob\n", | |
"images = glob.glob('/home/jovyan/workspace/captchas/dataset4/**.png', recursive=True)\n", | |
"len(images)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 206 | |
}, | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:39:55.929613Z", | |
"iopub.status.busy": "2023-07-02T16:39:55.929242Z", | |
"iopub.status.idle": "2023-07-02T16:39:55.955770Z", | |
"shell.execute_reply": "2023-07-02T16:39:55.955138Z", | |
"shell.execute_reply.started": "2023-07-02T16:39:55.929587Z" | |
}, | |
"id": "HRbxg-trPBdR", | |
"outputId": "ead0b4ba-d387-48bb-b3ba-bfca324c6ca8", | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>file_name</th>\n", | |
" <th>text</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>/home/jovyan/workspace/captchas/dataset4/VMyk4...</td>\n", | |
" <td>VMyk4Dc</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>/home/jovyan/workspace/captchas/dataset4/WCYaD...</td>\n", | |
" <td>WCYaDHH</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>/home/jovyan/workspace/captchas/dataset4/FP8sR...</td>\n", | |
" <td>FP8sReS</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>/home/jovyan/workspace/captchas/dataset4/hM89J...</td>\n", | |
" <td>hM89JvG</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>/home/jovyan/workspace/captchas/dataset4/6VMeA...</td>\n", | |
" <td>6VMeACE</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" file_name text\n", | |
"0 /home/jovyan/workspace/captchas/dataset4/VMyk4... VMyk4Dc\n", | |
"1 /home/jovyan/workspace/captchas/dataset4/WCYaD... WCYaDHH\n", | |
"2 /home/jovyan/workspace/captchas/dataset4/FP8sR... FP8sReS\n", | |
"3 /home/jovyan/workspace/captchas/dataset4/hM89J... hM89JvG\n", | |
"4 /home/jovyan/workspace/captchas/dataset4/6VMeA... 6VMeACE" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import re\n", | |
"\n", | |
"df = pd.DataFrame(images, columns=['file_name'])\n", | |
"df['text'] = df['file_name'].map(lambda x: re.search(r'.*/(.*).png', x).group(1))\n", | |
"df.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:40:01.052879Z", | |
"iopub.status.busy": "2023-07-02T16:40:01.052507Z", | |
"iopub.status.idle": "2023-07-02T16:40:01.082474Z", | |
"shell.execute_reply": "2023-07-02T16:40:01.081807Z", | |
"shell.execute_reply.started": "2023-07-02T16:40:01.052854Z" | |
}, | |
"id": "pK3lbuuVRThP", | |
"tags": [] | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.model_selection import train_test_split\n", | |
"\n", | |
"train, test = train_test_split(df, test_size=0.2)\n", | |
"train = train.reset_index()\n", | |
"test = test.reset_index()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 424 | |
}, | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:40:04.454583Z", | |
"iopub.status.busy": "2023-07-02T16:40:04.454171Z", | |
"iopub.status.idle": "2023-07-02T16:40:04.466088Z", | |
"shell.execute_reply": "2023-07-02T16:40:04.465430Z", | |
"shell.execute_reply.started": "2023-07-02T16:40:04.454558Z" | |
}, | |
"id": "PVwUn2rOZVlY", | |
"outputId": "91a4b500-a27c-4bf7-98e0-87e543c99bd3", | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>index</th>\n", | |
" <th>file_name</th>\n", | |
" <th>text</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>33</td>\n", | |
" <td>/home/jovyan/workspace/captchas/dataset4/TtErH...</td>\n", | |
" <td>TtErHy3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>790</td>\n", | |
" <td>/home/jovyan/workspace/captchas/dataset4/wyUEp...</td>\n", | |
" <td>wyUEpjM</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>63</td>\n", | |
" <td>/home/jovyan/workspace/captchas/dataset4/wsndT...</td>\n", | |
" <td>wsndTMe</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>362</td>\n", | |
" <td>/home/jovyan/workspace/captchas/dataset4/rjUU4...</td>\n", | |
" <td>rjUU4Ru</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>519</td>\n", | |
" <td>/home/jovyan/workspace/captchas/dataset4/GrVPa...</td>\n", | |
" <td>GrVPa9Y</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>...</th>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>917</th>\n", | |
" <td>851</td>\n", | |
" <td>/home/jovyan/workspace/captchas/dataset4/Hwbva...</td>\n", | |
" <td>Hwbvav4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>918</th>\n", | |
" <td>10</td>\n", | |
" <td>/home/jovyan/workspace/captchas/dataset4/TCesP...</td>\n", | |
" <td>TCesPS4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>919</th>\n", | |
" <td>754</td>\n", | |
" <td>/home/jovyan/workspace/captchas/dataset4/Estmq...</td>\n", | |
" <td>Estmq6S</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>920</th>\n", | |
" <td>477</td>\n", | |
" <td>/home/jovyan/workspace/captchas/dataset4/pUQyG...</td>\n", | |
" <td>pUQyGSe</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>921</th>\n", | |
" <td>40</td>\n", | |
" <td>/home/jovyan/workspace/captchas/dataset4/KCM3s...</td>\n", | |
" <td>KCM3s9h</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>922 rows × 3 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" index file_name text\n", | |
"0 33 /home/jovyan/workspace/captchas/dataset4/TtErH... TtErHy3\n", | |
"1 790 /home/jovyan/workspace/captchas/dataset4/wyUEp... wyUEpjM\n", | |
"2 63 /home/jovyan/workspace/captchas/dataset4/wsndT... wsndTMe\n", | |
"3 362 /home/jovyan/workspace/captchas/dataset4/rjUU4... rjUU4Ru\n", | |
"4 519 /home/jovyan/workspace/captchas/dataset4/GrVPa... GrVPa9Y\n", | |
".. ... ... ...\n", | |
"917 851 /home/jovyan/workspace/captchas/dataset4/Hwbva... Hwbvav4\n", | |
"918 10 /home/jovyan/workspace/captchas/dataset4/TCesP... TCesPS4\n", | |
"919 754 /home/jovyan/workspace/captchas/dataset4/Estmq... Estmq6S\n", | |
"920 477 /home/jovyan/workspace/captchas/dataset4/pUQyG... pUQyGSe\n", | |
"921 40 /home/jovyan/workspace/captchas/dataset4/KCM3s... KCM3s9h\n", | |
"\n", | |
"[922 rows x 3 columns]" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"train" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:40:09.619510Z", | |
"iopub.status.busy": "2023-07-02T16:40:09.619106Z", | |
"iopub.status.idle": "2023-07-02T16:40:09.624809Z", | |
"shell.execute_reply": "2023-07-02T16:40:09.623938Z", | |
"shell.execute_reply.started": "2023-07-02T16:40:09.619481Z" | |
}, | |
"tags": [] | |
}, | |
"outputs": [], | |
"source": [ | |
"def tokenize(file_path, text):\n", | |
" # prepare image (i.e. resize + normalize)\n", | |
" image = Image.open(file_path).convert(\"RGB\")\n", | |
" pixel_values = self.processor(image, return_tensors=\"pt\").pixel_values\n", | |
" # add labels (input_ids) by encoding the text\n", | |
" labels = self.processor.tokenizer(text, padding=\"max_length\", max_length=self.max_target_length).input_ids\n", | |
" # important: make sure that PAD tokens are ignored by the loss function\n", | |
" labels = [label if label != self.processor.tokenizer.pad_token_id\n", | |
" else -100 for label in labels]\n", | |
"\n", | |
" encoding = {\"pixel_values\" : pixel_values.squeeze(), \"labels\" : torch.tensor(labels)}\n", | |
" return encoding " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:40:36.770297Z", | |
"iopub.status.busy": "2023-07-02T16:40:36.769909Z", | |
"iopub.status.idle": "2023-07-02T16:40:36.776311Z", | |
"shell.execute_reply": "2023-07-02T16:40:36.775675Z", | |
"shell.execute_reply.started": "2023-07-02T16:40:36.770274Z" | |
}, | |
"id": "CS86kRsJPBdX", | |
"tags": [] | |
}, | |
"outputs": [], | |
"source": [ | |
"class Captcha_Dataset(Dataset):\n", | |
"\n", | |
" def __init__(self,df, processor, max_target_length=128):\n", | |
" self.df = df\n", | |
" self.processor = processor\n", | |
" self.max_target_length = max_target_length\n", | |
"\n", | |
" def __len__(self):\n", | |
" return len(self.df)\n", | |
"\n", | |
" def __getitem__(self, idx):\n", | |
" # get file name + text\n", | |
" file_path = self.df['file_name'][idx]\n", | |
" text = self.df['text'][idx]\n", | |
" # prepare image (i.e. resize + normalize)\n", | |
" image = Image.open(file_path).convert(\"RGB\")\n", | |
" pixel_values = self.processor(image, return_tensors=\"pt\").pixel_values\n", | |
" # add labels (input_ids) by encoding the text\n", | |
" labels = self.processor.tokenizer(text, padding=\"max_length\", max_length=self.max_target_length).input_ids\n", | |
" # important: make sure that PAD tokens are ignored by the loss function\n", | |
" labels = [label if label != self.processor.tokenizer.pad_token_id\n", | |
" else -100 for label in labels]\n", | |
"\n", | |
" encoding = {\"pixel_values\" : pixel_values.squeeze(), \"labels\" : torch.tensor(labels)}\n", | |
" return encoding" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "qMuiii6UPBdY" | |
}, | |
"source": [ | |
"##### Basic Values/Constants" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:41:43.503277Z", | |
"iopub.status.busy": "2023-07-02T16:41:43.502894Z", | |
"iopub.status.idle": "2023-07-02T16:41:43.506967Z", | |
"shell.execute_reply": "2023-07-02T16:41:43.506238Z", | |
"shell.execute_reply.started": "2023-07-02T16:41:43.503252Z" | |
}, | |
"id": "Pz10NsdfPBdZ", | |
"tags": [] | |
}, | |
"outputs": [], | |
"source": [ | |
"MODEL_CKPT = \"microsoft/trocr-base-printed\"\n", | |
"MODEL_NAME = MODEL_CKPT.split(\"/\")[-1] + \"_captcha_ocr\"\n", | |
"NUM_OF_EPOCHS = 5" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "mHdN01enPBdZ" | |
}, | |
"source": [ | |
"##### Instantiate Processor, Create Training, & Testing Dataset Instances" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:41:45.762132Z", | |
"iopub.status.busy": "2023-07-02T16:41:45.761743Z", | |
"iopub.status.idle": "2023-07-02T16:41:45.951471Z", | |
"shell.execute_reply": "2023-07-02T16:41:45.950844Z", | |
"shell.execute_reply.started": "2023-07-02T16:41:45.762108Z" | |
}, | |
"id": "Vo6_X-rjPBdZ", | |
"outputId": "9b427a64-daa6-424a-e45b-e61dea91487a", | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.\n" | |
] | |
} | |
], | |
"source": [ | |
"processor = TrOCRProcessor.from_pretrained(MODEL_CKPT)\n", | |
"train_ds = Captcha_Dataset(df=train,processor=processor)\n", | |
"test_ds = Captcha_Dataset(df=test,processor=processor)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "lx7hfHVGPBda" | |
}, | |
"source": [ | |
"##### Print Length of Training & Testing Datasets" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:41:48.729007Z", | |
"iopub.status.busy": "2023-07-02T16:41:48.728622Z", | |
"iopub.status.idle": "2023-07-02T16:41:48.733380Z", | |
"shell.execute_reply": "2023-07-02T16:41:48.732602Z", | |
"shell.execute_reply.started": "2023-07-02T16:41:48.728985Z" | |
}, | |
"id": "wgHpCqKfPBdb", | |
"outputId": "106f80b7-f3a7-4efe-eec2-92df29061702", | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"The training dataset has 922 samples in it.\n", | |
"The testing dataset has 231 samples in it.\n" | |
] | |
} | |
], | |
"source": [ | |
"print(f\"The training dataset has {len(train_ds)} samples in it.\")\n", | |
"print(f\"The testing dataset has {len(test_ds)} samples in it.\")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "rUAIaTqwPBdb" | |
}, | |
"source": [ | |
"##### Example of Input Data Shapes" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:41:59.911117Z", | |
"iopub.status.busy": "2023-07-02T16:41:59.910734Z", | |
"iopub.status.idle": "2023-07-02T16:41:59.930886Z", | |
"shell.execute_reply": "2023-07-02T16:41:59.930111Z", | |
"shell.execute_reply.started": "2023-07-02T16:41:59.911090Z" | |
}, | |
"id": "FnEUP8PKPBdc", | |
"outputId": "cd817977-eeb4-47df-bdda-8c33fc4c18dd", | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"pixel_values : torch.Size([3, 384, 384])\n", | |
"labels : torch.Size([128])\n" | |
] | |
} | |
], | |
"source": [ | |
"encoding = train_ds[10]\n", | |
"\n", | |
"for k,v in encoding.items():\n", | |
" print(k, \" : \", v.shape)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "N38-0NPhPBdc" | |
}, | |
"source": [ | |
"##### Show Example" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 67 | |
}, | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:42:03.356471Z", | |
"iopub.status.busy": "2023-07-02T16:42:03.355908Z", | |
"iopub.status.idle": "2023-07-02T16:42:03.365782Z", | |
"shell.execute_reply": "2023-07-02T16:42:03.365173Z", | |
"shell.execute_reply.started": "2023-07-02T16:42:03.356433Z" | |
}, | |
"id": "KIWjkbV9PBdd", | |
"outputId": "54a13aeb-18c6-4248-8d42-4818886df3d7", | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAANwAAAAoCAIAAAAaOwPZAAAFO0lEQVR4nO1c25GjOhDV3toAyGCUgZUBZMBkQAiE4HIIDoEQyAAykDPQZNAZzP045S4tM8aSwI1mV+cLKEs03Uf9ksq/Pj8/VUFBTvjvaAEKCpYopCzIDoWUBdmhkLIgOxRSFmSHQsqC7FBIWZAdCikLskMhZUF2+C32JmstLrTWVVWJvbfgx0GIlNZaIhrHkYjqujbGZEVNIsJFPiI555xzSqmqqowxB0sjCzlP6ZwjIih6nue6rsHLQ9gJSZiLkKppGpUHLyGetXae577vrbUyvFyo5SjrCJFyoVN2A1prYXbCZzvn5nmGDEBd19Za8FIAnMw8YptzbhxHpdTlcjmfz6+Wx18GrBatddu2TdMI81LOUyJkG2OstfhsIrLWWmvByPCw/tSiKwOJaBgGn44MrXXUbMngZKZt22+9IBQiI4y6M3Kaptvt5j/nhSHMSzlSKk/XPjPwBNe3263runVePrXoOsZxZEYaY9q29WWTUT0RXS4XpZS1drsXTF6iC5FwUdd1XddKqev1yqEc3mSLkFEQJaW6Z2y8IrXWfd8rpeZ5xhOoeEUFC4tG8VJrfTqdpmlSd0buW0Zw0ixmwo1LFOAF2TQNfIS1tu/76/Wq7sXAnkI/gzQpOWQ758BIrEJjDCxqjFlxV5yJYzjyrXBj+O5wF/aAE3zrnIOzlyxNdnG6i2yhaZppmo6q+aRJ6VcYoAhYCBPCwCu6gNX5Wmu90UlsARgJTgBg/HZhmOhElMwMebe9F6R3dFhT6m5CX+mL22/x9vbm38LvJsuDUJ4G30vxw9PpBMefLAOCyTAMfliIDaBENE3TMAwL8cJFWpgpdoYtkPaUAIIFmkFRA2Hsj48P1poxpmmaWGcAGyP6MyfS8kuYnIj6vg/sHiBEYOBCBnUPJtyjwG0U0ZmRVVUNw9B1XfhAXgZIKNEVChy+F0RJ6S96aDlhCUJN1lqYv23bKGYzITgrxXPUPQmRF28HIwPHov/FbPYTAJ7Q10y4p8TP2EcSUbh+mIt4HaYiItSgkv5SlJRY9LhOcJMAq+brRQh8QuwCvD2qacL11iOqoS/jpzqBwO/RXFRKdV0X3mJ0zoHKLKS65xJEpLUW46UcKbnuVsftX/mv5iXBReuBu8x+4cy+nJsygYDvZ7easBnTdR2He9SjcJnwoIgG4bMlQ3Tvm91kDqcxIADYkLa76KeGsF/sR63IAPMHTggWDsPQti2UjAUWJQ8227B5gVukFsxLsW6lXPXtB6Pk2P0KJO93IxPA9WInfbsMgXxilw/ScPyNTbUBTvSZ0+fzOaQlsi+ESOkvtQNj977gD6mq6hWOBPOvOyp/YfCqiEolV3BUMiNEylfEbrZWWtXCw7kdg6x3mqZpmnashL5FLOEeeeLFCtda78VIde9WciUuBukDGWqP2O23dZCAq/hlDRtjOHjJu03J7aFw+E2AR23IqJCCaLuRkV93TbnS+jtbQl3XjeOIDY+Nn8cdXczDHZCoHXDOcf0+iBiqP3fhH7kiZHXYSv32B9xwJaJdGDmOox8lqjtwSiFt5ljIHfJFZaf2yFSqqkJlgHNoaKGFD+ew+NV+2jt0HCKnT5pXOJIVvZF3LFftl0cuoGPOue6FXz/3rwAXxVMU1zmVXLA5oQjbeOKYCZ0gP6fpe50PX4RvdVBV+oNJuRH07ESSANIIja1tjhJ1Xb+/v/8F3QzGMQcyckAOVkzOZLTWp9NJKYXYmsO37Ih/11P+aPg7ET/uuORTFFIWZIfyty0F2aGQsiA7FFIWZIdCyoLs8D9FlV0D7PxKyQAAAABJRU5ErkJggg==\n", | |
"text/plain": [ | |
"<PIL.Image.Image image mode=RGB size=220x40>" | |
] | |
}, | |
"execution_count": 29, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"Image.open(train['file_name'][0]).convert(\"RGB\")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "O7nOrwBpPBdd" | |
}, | |
"source": [ | |
"##### Show Label for Above Example" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:42:05.806458Z", | |
"iopub.status.busy": "2023-07-02T16:42:05.806065Z", | |
"iopub.status.idle": "2023-07-02T16:42:05.811038Z", | |
"shell.execute_reply": "2023-07-02T16:42:05.810402Z", | |
"shell.execute_reply.started": "2023-07-02T16:42:05.806432Z" | |
}, | |
"id": "tjOyc1ksPBde", | |
"outputId": "2771a745-dc3f-4058-f1fe-c27d28e57ffa", | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"vy26xDW\n" | |
] | |
} | |
], | |
"source": [ | |
"labels = encoding['labels']\n", | |
"labels[labels == -100] = processor.tokenizer.pad_token_id\n", | |
"label_str = processor.decode(labels, skip_special_tokens=True)\n", | |
"print(label_str)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "Sc3E_5KCPBde" | |
}, | |
"source": [ | |
"#### Instantiate Model" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:42:08.592330Z", | |
"iopub.status.busy": "2023-07-02T16:42:08.591936Z", | |
"iopub.status.idle": "2023-07-02T16:42:15.226651Z", | |
"shell.execute_reply": "2023-07-02T16:42:15.225927Z", | |
"shell.execute_reply.started": "2023-07-02T16:42:08.592307Z" | |
}, | |
"id": "giCycPYSPBdf", | |
"outputId": "0503d57a-9aaa-412c-afdc-60de94e1e7cb", | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"Some weights of VisionEncoderDecoderModel were not initialized from the model checkpoint at microsoft/trocr-base-printed and are newly initialized: ['encoder.pooler.dense.weight', 'encoder.pooler.dense.bias']\n", | |
"You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n" | |
] | |
} | |
], | |
"source": [ | |
"model = VisionEncoderDecoderModel.from_pretrained(MODEL_CKPT)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "-rkOX7UEPBdf" | |
}, | |
"source": [ | |
"##### Model Configuration Modifications" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:42:15.228189Z", | |
"iopub.status.busy": "2023-07-02T16:42:15.227891Z", | |
"iopub.status.idle": "2023-07-02T16:42:15.233150Z", | |
"shell.execute_reply": "2023-07-02T16:42:15.232303Z", | |
"shell.execute_reply.started": "2023-07-02T16:42:15.228166Z" | |
}, | |
"id": "BCimiZB2PBdg", | |
"tags": [] | |
}, | |
"outputs": [], | |
"source": [ | |
"model.config.decoder_start_token_id = processor.tokenizer.cls_token_id\n", | |
"model.config.pad_token_id = processor.tokenizer.pad_token_id\n", | |
"\n", | |
"model.config.vocab_size = model.config.decoder.vocab_size\n", | |
"\n", | |
"model.config.eos_token_id = processor.tokenizer.sep_token_id\n", | |
"model.config.max_length = 64\n", | |
"model.config.early_stopping = True\n", | |
"model.config.no_repeat_ngram_size = 3\n", | |
"model.config.length_penalty = 2.0\n", | |
"model.config.num_beams = 4" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "TdujhGLEPBdh" | |
}, | |
"source": [ | |
"##### Define Metrics Evaluation" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:42:17.957734Z", | |
"iopub.status.busy": "2023-07-02T16:42:17.957346Z", | |
"iopub.status.idle": "2023-07-02T16:42:19.107223Z", | |
"shell.execute_reply": "2023-07-02T16:42:19.106630Z", | |
"shell.execute_reply.started": "2023-07-02T16:42:17.957710Z" | |
}, | |
"id": "cSRjcoBVPBdh", | |
"tags": [] | |
}, | |
"outputs": [], | |
"source": [ | |
"cer_metric = evaluate.load(\"cer\")\n", | |
"\n", | |
"def compute_metrics(pred):\n", | |
" label_ids = pred.label_ids\n", | |
" pred_ids = pred.predictions\n", | |
"\n", | |
" pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)\n", | |
" label_ids[label_ids == -100] = processor.tokenizer.pad_token_id\n", | |
" label_str = processor.batch_decode(label_ids, skip_special_tokens=True)\n", | |
"\n", | |
" cer = cer_metric.compute(predictions=pred_str, references=label_str)\n", | |
"\n", | |
" return {\"cer\" : cer}" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "96ooUEZBPBdh" | |
}, | |
"source": [ | |
"##### Define Training Arguments" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:42:21.819117Z", | |
"iopub.status.busy": "2023-07-02T16:42:21.818715Z", | |
"iopub.status.idle": "2023-07-02T16:42:21.825251Z", | |
"shell.execute_reply": "2023-07-02T16:42:21.824571Z", | |
"shell.execute_reply.started": "2023-07-02T16:42:21.819072Z" | |
}, | |
"id": "FEEailluPBdi", | |
"tags": [] | |
}, | |
"outputs": [], | |
"source": [ | |
"args = Seq2SeqTrainingArguments(\n", | |
" output_dir = MODEL_NAME,\n", | |
" num_train_epochs=NUM_OF_EPOCHS,\n", | |
" predict_with_generate=True,\n", | |
" evaluation_strategy=\"epoch\",\n", | |
" save_strategy=\"steps\", # Change here\n", | |
" save_steps=1e6, # Add this line, set to a large number\n", | |
" per_device_train_batch_size=8,\n", | |
" per_device_eval_batch_size=8,\n", | |
" logging_first_step=True,\n", | |
" hub_private_repo=False,\n", | |
" push_to_hub=False\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "OnJTMfK9PBdj" | |
}, | |
"source": [ | |
"##### Define Trainer" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:42:26.816215Z", | |
"iopub.status.busy": "2023-07-02T16:42:26.815829Z", | |
"iopub.status.idle": "2023-07-02T16:42:27.179177Z", | |
"shell.execute_reply": "2023-07-02T16:42:27.178432Z", | |
"shell.execute_reply.started": "2023-07-02T16:42:26.816192Z" | |
}, | |
"id": "ukKLfcRLPBdj", | |
"outputId": "0913bf3b-0e6c-4e78-e6d5-a87864dc5b42", | |
"tags": [] | |
}, | |
"outputs": [], | |
"source": [ | |
"trainer = Seq2SeqTrainer(\n", | |
" model=model,\n", | |
" tokenizer=processor.feature_extractor,\n", | |
" args=args,\n", | |
" compute_metrics=compute_metrics,\n", | |
" train_dataset=train_ds,\n", | |
" eval_dataset=test_ds,\n", | |
" data_collator=default_data_collator\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "pm6MBc4mPBdk" | |
}, | |
"source": [ | |
"##### Fit/Train Model" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 134 | |
}, | |
"execution": { | |
"iopub.execute_input": "2023-07-02T16:42:30.324513Z", | |
"iopub.status.busy": "2023-07-02T16:42:30.324129Z", | |
"iopub.status.idle": "2023-07-02T17:06:12.146679Z", | |
"shell.execute_reply": "2023-07-02T17:06:12.146045Z", | |
"shell.execute_reply.started": "2023-07-02T16:42:30.324486Z" | |
}, | |
"id": "RdwPyzCDPBdl", | |
"outputId": "a1406ff0-bb27-4322-9362-83dcf6af14e7", | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"/opt/saturncloud/envs/saturn/lib/python3.9/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n", | |
" warnings.warn(\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/html": [ | |
"\n", | |
" <div>\n", | |
" \n", | |
" <progress value='580' max='580' style='width:300px; height:20px; vertical-align: middle;'></progress>\n", | |
" [580/580 23:39, Epoch 5/5]\n", | |
" </div>\n", | |
" <table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: left;\">\n", | |
" <th>Epoch</th>\n", | |
" <th>Training Loss</th>\n", | |
" <th>Validation Loss</th>\n", | |
" <th>Cer</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <td>1</td>\n", | |
" <td>11.165900</td>\n", | |
" <td>1.063115</td>\n", | |
" <td>0.163884</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>2</td>\n", | |
" <td>11.165900</td>\n", | |
" <td>0.590734</td>\n", | |
" <td>0.097712</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>3</td>\n", | |
" <td>11.165900</td>\n", | |
" <td>0.485241</td>\n", | |
" <td>0.053803</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>4</td>\n", | |
" <td>11.165900</td>\n", | |
" <td>0.321299</td>\n", | |
" <td>0.032158</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>5</td>\n", | |
" <td>0.438800</td>\n", | |
" <td>0.294638</td>\n", | |
" <td>0.028448</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table><p>" | |
], | |
"text/plain": [ | |
"<IPython.core.display.HTML object>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"/opt/saturncloud/envs/saturn/lib/python3.9/site-packages/transformers/generation/utils.py:1259: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)\n", | |
" warnings.warn(\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"TrainOutput(global_step=580, training_loss=0.40041021108627317, metrics={'train_runtime': 1421.6668, 'train_samples_per_second': 3.243, 'train_steps_per_second': 0.408, 'total_flos': 3.4495947304914125e+18, 'train_loss': 0.40041021108627317, 'epoch': 5.0})" | |
] | |
}, | |
"execution_count": 36, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"trainer.train()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 46, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T17:10:16.834072Z", | |
"iopub.status.busy": "2023-07-02T17:10:16.833664Z", | |
"iopub.status.idle": "2023-07-02T17:10:17.821400Z", | |
"shell.execute_reply": "2023-07-02T17:10:17.820510Z", | |
"shell.execute_reply.started": "2023-07-02T17:10:16.834044Z" | |
}, | |
"tags": [] | |
}, | |
"outputs": [], | |
"source": [ | |
"! rm -rf xtrocr-base-printed_captcha_ocr2" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "jjAvL7blPBdl" | |
}, | |
"source": [ | |
"##### Save Model & Model State" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 47, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T17:10:20.149589Z", | |
"iopub.status.busy": "2023-07-02T17:10:20.149176Z", | |
"iopub.status.idle": "2023-07-02T17:10:22.081015Z", | |
"shell.execute_reply": "2023-07-02T17:10:22.080412Z", | |
"shell.execute_reply.started": "2023-07-02T17:10:20.149561Z" | |
}, | |
"id": "CAL-qtqePBdl", | |
"tags": [] | |
}, | |
"outputs": [], | |
"source": [ | |
"trainer.save_model(f'/home/jovyan/workspace/x{MODEL_NAME}')\n", | |
"trainer.save_state()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "gqcy34rdPBdn" | |
}, | |
"source": [ | |
"##### Evaluate Model" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 48, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T17:10:25.211492Z", | |
"iopub.status.busy": "2023-07-02T17:10:25.211105Z", | |
"iopub.status.idle": "2023-07-02T17:11:50.250586Z", | |
"shell.execute_reply": "2023-07-02T17:11:50.249820Z", | |
"shell.execute_reply.started": "2023-07-02T17:10:25.211461Z" | |
}, | |
"id": "Vh-Hi13zPBdn", | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"\n", | |
" <div>\n", | |
" \n", | |
" <progress value='29' max='29' style='width:300px; height:20px; vertical-align: middle;'></progress>\n", | |
" [29/29 01:21]\n", | |
" </div>\n", | |
" " | |
], | |
"text/plain": [ | |
"<IPython.core.display.HTML object>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"{'eval_loss': 0.29463836550712585,\n", | |
" 'eval_cer': 0.02844774273345702,\n", | |
" 'eval_runtime': 85.0328,\n", | |
" 'eval_samples_per_second': 2.717,\n", | |
" 'eval_steps_per_second': 0.341,\n", | |
" 'epoch': 5.0}" | |
] | |
}, | |
"execution_count": 48, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"trainer.evaluate()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 49, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T17:29:33.805483Z", | |
"iopub.status.busy": "2023-07-02T17:29:33.805113Z", | |
"iopub.status.idle": "2023-07-02T17:29:34.704793Z", | |
"shell.execute_reply": "2023-07-02T17:29:34.703925Z", | |
"shell.execute_reply.started": "2023-07-02T17:29:33.805459Z" | |
}, | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"/home/jovyan/workspace\n" | |
] | |
} | |
], | |
"source": [ | |
"! pwd" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 50, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T17:29:36.103773Z", | |
"iopub.status.busy": "2023-07-02T17:29:36.103347Z", | |
"iopub.status.idle": "2023-07-02T17:29:42.509355Z", | |
"shell.execute_reply": "2023-07-02T17:29:42.508568Z", | |
"shell.execute_reply.started": "2023-07-02T17:29:36.103745Z" | |
}, | |
"tags": [] | |
}, | |
"outputs": [], | |
"source": [ | |
"from transformers import VisionEncoderDecoderModel\n", | |
"\n", | |
"model = VisionEncoderDecoderModel.from_pretrained(f'/home/jovyan/workspace/x{MODEL_NAME}')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 51, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T17:29:42.511546Z", | |
"iopub.status.busy": "2023-07-02T17:29:42.511120Z", | |
"iopub.status.idle": "2023-07-02T17:29:42.517343Z", | |
"shell.execute_reply": "2023-07-02T17:29:42.516488Z", | |
"shell.execute_reply.started": "2023-07-02T17:29:42.511512Z" | |
}, | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"'TtErHy3'" | |
] | |
}, | |
"execution_count": 51, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"train['text'][0]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 52, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T17:29:44.621137Z", | |
"iopub.status.busy": "2023-07-02T17:29:44.620754Z", | |
"iopub.status.idle": "2023-07-02T17:29:44.950276Z", | |
"shell.execute_reply": "2023-07-02T17:29:44.949560Z", | |
"shell.execute_reply.started": "2023-07-02T17:29:44.621109Z" | |
}, | |
"tags": [] | |
}, | |
"outputs": [], | |
"source": [ | |
"import cv2\n", | |
"\n", | |
"def preprocessing(image_path):\n", | |
" image = cv2.imread(image_path, 0)\n", | |
" # _, image = cv2.threshold(image, 230, 255, cv2.THRESH_BINARY)\n", | |
" # ret, image = cv2.threshold(cv2.GaussianBlur(image, (3,3), 0), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)\n", | |
" # ret, image = cv2.threshold(cv2.GaussianBlur(image, (5,5), 0), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)\n", | |
" # image = cv2.dilate(cv2.erode(cv2.dilate(image, kernel, iterations=1), kernel), kernel)\n", | |
" return processor(cv2.merge([image, image, image]), return_tensors=\"pt\").pixel_values\n", | |
"\n", | |
"def solve_captcha(file_path):\n", | |
" generated_ids = model.generate(preprocessing(file_path))\n", | |
" generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]\n", | |
" return generated_text" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 56, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T17:46:35.049756Z", | |
"iopub.status.busy": "2023-07-02T17:46:35.049363Z", | |
"iopub.status.idle": "2023-07-02T19:29:16.001986Z", | |
"shell.execute_reply": "2023-07-02T19:29:16.001278Z", | |
"shell.execute_reply.started": "2023-07-02T17:46:35.049731Z" | |
}, | |
"tags": [] | |
}, | |
"outputs": [], | |
"source": [ | |
"predicted_values = [solve_captcha(img_file) for img_file in df['file_name']]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 61, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T19:53:42.196913Z", | |
"iopub.status.busy": "2023-07-02T19:53:42.196492Z", | |
"iopub.status.idle": "2023-07-02T19:53:42.216670Z", | |
"shell.execute_reply": "2023-07-02T19:53:42.215702Z", | |
"shell.execute_reply.started": "2023-07-02T19:53:42.196881Z" | |
}, | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('utPHmUu', 'utPHUu'),\n", | |
" ('quWhtpt', 'quWhtPt'),\n", | |
" ('hvv2JJK', 'hVv2JK'),\n", | |
" ('vqpqpq4', 'VqPqpq4'),\n", | |
" ('GypHTKU', 'GyPHTKU'),\n", | |
" ('aXBKaaX', 'aXBKaX'),\n", | |
" ('C4hpRrp', 'G4hPRrP'),\n", | |
" ('pL4UEsv', 'PL4UEsv'),\n", | |
" ('f2wuxFu', 'f2WuxFu'),\n", | |
" ('QePx5cC', 'QePxcC'),\n", | |
" ('s6M2pWW', 's66M2pWW'),\n", | |
" ('exfCW4M', 'exfCWAM'),\n", | |
" ('CCrJjCf', 'CCrJJCf'),\n", | |
" ('UMWaxF8', 'UMWAxF8'),\n", | |
" ('t9tnnkr', 't9tnKr'),\n", | |
" ('VMHhjHK', 'VMHjHK'),\n", | |
" ('aeqrxnS', 'aeqrxS'),\n", | |
" ('Ecnmhyj', 'Ecnhnyj'),\n", | |
" ('K2WQY4W', 'k2wQY4w'),\n", | |
" ('sd6baaX', 'sd6baX'),\n", | |
" ('pje2kc2', 'Pje2kc2'),\n", | |
" ('QvaVv6k', 'QvaVvvv6k'),\n", | |
" ('fYrHLKe', 'fYrHLke'),\n", | |
" ('qsCGPWU', 'qsCGPwU'),\n", | |
" ('yb6wvjp', 'yb6wyjP'),\n", | |
" ('vQ7EKPv', 'vQ7EKpv'),\n", | |
" ('8eewnfL', '8ewnfL'),\n", | |
" ('e64wsQK', 'e64wSQK'),\n", | |
" ('yqdEkAp', 'yqdEkap'),\n", | |
" ('yRVCEmH', 'yRVCEmh'),\n", | |
" ('JpqkKHJ', 'JPqkKHJ'),\n", | |
" ('JYcknbj', 'JYcksnbj'),\n", | |
" ('SSHD6P2', 'sSHD6P2'),\n", | |
" ('d8yxwxF', 'd8yxF'),\n", | |
" ('mvTMBu7', 'nvTMBu7'),\n", | |
" ('pDVhkAa', 'pDVAkAa'),\n", | |
" ('8LM2EBH', '8LM2EEBH'),\n", | |
" ('qpexxeH', 'qpexceH'),\n", | |
" ('qJpVPSh', 'qJPVPSh'),\n", | |
" ('xGfxcHB', 'xGfxchB'),\n", | |
" ('TXsqhMG', 'TXsqMG'),\n", | |
" ('aVaUaeP', 'aVaUaep'),\n", | |
" ('wmsBbEb', 'WmsBbEb'),\n", | |
" ('E5Mqdun', 'ESMqdun'),\n", | |
" ('jf99937', 'jf9937'),\n", | |
" ('jwv7ywf', 'jWV7Ywf')]" | |
] | |
}, | |
"execution_count": 61, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import numpy as np\n", | |
"accuracy = list(filter(lambda x: x[1] == 0, [[i,int(df['text'][i]==predicted_values[i])] for i in range(0, len(predicted_values))]))\n", | |
"list(map(lambda x: (df['text'][x[0]],predicted_values[x[0]]), accuracy))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 58, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-07-02T19:52:08.024581Z", | |
"iopub.status.busy": "2023-07-02T19:52:08.024178Z", | |
"iopub.status.idle": "2023-07-02T19:52:08.029239Z", | |
"shell.execute_reply": "2023-07-02T19:52:08.028504Z", | |
"shell.execute_reply.started": "2023-07-02T19:52:08.024556Z" | |
}, | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"('jwv7ywf', 'jWV7Ywf')" | |
] | |
}, | |
"execution_count": 58, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df['text'][1117],predicted_values[1117]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 86, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-06-29T22:13:56.023823Z", | |
"iopub.status.busy": "2023-06-29T22:13:56.023442Z", | |
"iopub.status.idle": "2023-06-29T22:13:56.029904Z", | |
"shell.execute_reply": "2023-06-29T22:13:56.028956Z", | |
"shell.execute_reply.started": "2023-06-29T22:13:56.023798Z" | |
}, | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAIcAAAAmCAAAAADLzNedAAAEzUlEQVR4nO2XwXXrSA5Fb8/x/r8MhAyEDMgM1BmoM3AIPg7BGfTPwMyAzADOAMoAE8HMokiKkixLi5lZDTb2oQpVFyjUK9QfAPbX0YD8/XeymPVdb1xbjlNixs5N2885/V0FRptp/V6VwAl230y2tRerorKA5c88RchvRleessjE4AxSWVMgCXyvzWCK+GcCSV/8aC9/xVhEmKiI7eCMw83ojGoRKrxbUlIZpwwwE3tfOCpjArLF5g8weDnuGclBPTklADJBxR2Hee3I8EOv9v8pspDZL/UrcU6fuXHbP+IwwcgIWtC9MxjQN6OtH4meyqIiiwUkCrwzLZhUTp9ZoLZ/1rk/4EB9A+mGBcNNpd3pcEsi6+sAOUUWNUIvKqZI1B/MzwmoiiygWlSy27muOUBdBmMSAOS+RQ7fhNBqtzymiAYClcv6dd62YUnu60OExjFCDqiWgggfvjoT9u3GLDQGBDXKbEYIkadfDTNjimgDn8Tg5R1AFuuXQCHrLrNRuT2oIK8AKmLOIvkZuHpB5UfbFNQf7CmKti8YqdosGoh1WyoLcoLL0L4SID9e+ch5nMWxfYs1qgqP4hsluuHodwZrVKs1Fausdpwr8Y/X2+kqPzirX99fjqgQOSVv8RDk5SgRPlx+NfmeImPKRWW3CQPUtaBXnRHms6Yf02e2/Ayy+ve3Rxi8GOCjUeeUCJlBxhRrrDIdN25Vl1iYTIe2cR5vQ8xnqFWRHmK0+kDHYTd4tAVJioKVQiass+0uV06XIHJ1/byex2E3kOcI3h6DvMyex9Lv5vXnbkgqzreee2fXSpRTZoOey6POGOBZToKhmwh+5MBZwvOda0haSu5hVI4JqD9wGuctuDjWVgHC9PqElm44OKuWm1nB+4KhvLmj6nPIAvXHnmAk4bJeJApKeqV/AmLLsSyRJRzGt/cZA4YB6wyWqzhjGBuGgH1l89wczmp3d4O71sCnOBbrxxnjyJTQci91ttxyQE1dwBQA5HAYz+6/o4Cq97eRnKCF8RSH7EoiAPSGey2dCUxT1SIoYpYQoEn82apUQNq7BPDF17bZ+4nDuouZFutxam2ncqgoWjXpTHuzhFgLn47pYRt0mY97g9ZzV3S/CwzXmuth7ib7buORQ6jA7CgZjj3qDO/Vxz2iE2Due1tqLw8QgHdLCwAUB0aQdXK1G9P50e5yfFculeOYyG1na/KMXcwOG+cWvQmJp+wf935oJ3XzlKiqzKkKIq+Jr94cAG4Ap+cofsqHexY58KcaxojVV5OnSO9Mc7Cyr3tz5LMq9g1H5Xw9Wje3oG6iMgZ1WqYnQ2bsTKa5Tla/2SKvRfZZDpkK1TS3+C0hNaZ3Rk6RzmEfauexAgnJDv3svPpdBpXP9qfnf63LQOd3qB1qLCoyRCUU0pHQHGcVYO2pprrqqmVdxjdV8wyHHMNYW1v1MFbrZQTmr4bhQIzQhCpHCSylq4NhPlj+0PL/wIHjbE/aAtLmbe/Z3oG9gIwCA9gTSJetueRQ5hcfn+S40Rr1KBvG8qqejwgwKil6YVU7sPWFPYMfUNlDPV/sj3/9+POywbeiXzXCCTqnKoyNms4DYuKX/Mk6fcDxI2N7SAqyxDVG6zsuVPa/xPEftbu6/j+2/3Nc2r8BfnyIF4iSw0YAAAAASUVORK5CYII=\n", | |
"text/plain": [ | |
"<PIL.PngImagePlugin.PngImageFile image mode=L size=135x38>" | |
] | |
}, | |
"execution_count": 86, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"Image.open(df['file_name'][18])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": { | |
"execution": { | |
"iopub.execute_input": "2023-06-29T22:18:23.458106Z", | |
"iopub.status.busy": "2023-06-29T22:18:23.457704Z", | |
"iopub.status.idle": "2023-06-29T22:18:28.624912Z", | |
"shell.execute_reply": "2023-06-29T22:18:28.624253Z", | |
"shell.execute_reply.started": "2023-06-29T22:18:23.458083Z" | |
}, | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"/opt/saturncloud/envs/saturn/lib/python3.9/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (64) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.\n", | |
" warnings.warn(\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"'3Vn7kw'" | |
] | |
}, | |
"execution_count": 22, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"solve_captcha('3yvh7KW.png')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "JJbH8MtBPBdo" | |
}, | |
"source": [ | |
"##### Push Model to Hub (My Profile!!!)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "63SWFV94PBdo" | |
}, | |
"outputs": [], | |
"source": [ | |
"kwargs = {\n", | |
" \"finetuned_from\" : model.config._name_or_path,\n", | |
" \"tasks\" : \"image-to-text\",\n", | |
" \"tags\" : [\"image-to-text\"],\n", | |
"}\n", | |
"\n", | |
"if args.push_to_hub:\n", | |
" trainer.push_to_hub(\"All Dunn!!!\")\n", | |
"else:\n", | |
" trainer.create_model_card(**kwargs)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "5aSPONY_PBdp" | |
}, | |
"source": [ | |
"### Notes & Other Takeaways From This Project\n", | |
"****\n", | |
"- The Character Error Rate (CER) was 0.0075. I am pleased with that result.\n", | |
"- Context about metric: Zero (0) is perfection. One the worst score (unless there is an insertion error).\n", | |
"****" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "SKuM2BqePBdp" | |
}, | |
"source": [ | |
"### Citations\n", | |
"\n", | |
"##### For Transformer Checkpoint\n", | |
"- @misc{li2021trocr,\n", | |
" title={TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models},\n", | |
" author={Minghao Li and Tengchao Lv and Lei Cui and Yijuan Lu and Dinei Florencio and Cha Zhang and Zhoujun Li and Furu Wei},\n", | |
" year={2021},\n", | |
" eprint={2109.10282},\n", | |
" archivePrefix={arXiv},\n", | |
" primaryClass={cs.CL}\n", | |
"}\n", | |
"\n", | |
"##### For CER Metric\n", | |
"- @inproceedings{morris2004,\n", | |
"author = {Morris, Andrew and Maier, Viktoria and Green, Phil},\n", | |
"year = {2004},\n", | |
"month = {01},\n", | |
"pages = {},\n", | |
"title = {From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition.}\n", | |
"}" | |
] | |
} | |
], | |
"metadata": { | |
"colab": { | |
"provenance": [] | |
}, | |
"kernelspec": { | |
"display_name": "saturn (Python 3)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.9.15" | |
}, | |
"vscode": { | |
"interpreter": { | |
"hash": "a52fe47989fdc78fafbb981021cec52a6b82df6453830b9ffbd04250493e6cab" | |
} | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment