sanchezcarlosjr/Copy_of_OCR_captcha.ipynb

## Copy_of_OCR_captcha.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JEAmyTyePBdF"
   },
   "source": [
    "## OCR of Captcha Images\n",
    "\n",
    "Dataset Source: https://www.kaggle.com/datasets/alizahidraja/captcha-data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qOSTkrnkPBdH"
   },
   "source": [
    "##### Install Necessary Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2023-07-02T16:38:00.419279Z",
     "iopub.status.busy": "2023-07-02T16:38:00.418892Z",
     "iopub.status.idle": "2023-07-02T16:38:10.993604Z",
     "shell.execute_reply": "2023-07-02T16:38:10.992638Z",
     "shell.execute_reply.started": "2023-07-02T16:38:00.419255Z"
    },
    "id": "iAmNUfbePBdI",
    "outputId": "16bd7b12-9d8c-4d1a-adb2-29aed21b2d39",
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: torch in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (1.11.0)\n",
      "Requirement already satisfied: torchvision in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (0.12.0)\n",
      "Requirement already satisfied: torchaudio in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (0.11.0)\n",
      "Requirement already satisfied: typing_extensions in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from torch) (4.7.0)\n",
      "Requirement already satisfied: numpy in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from torchvision) (1.21.6)\n",
      "Requirement already satisfied: requests in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from torchvision) (2.31.0)\n",
      "Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from torchvision) (9.1.1)\n",
      "Requirement already satisfied: charset-normalizer<4,>=2 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->torchvision) (2.0.12)\n",
      "Requirement already satisfied: idna<4,>=2.5 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->torchvision) (3.4)\n",
      "Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->torchvision) (1.26.15)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->torchvision) (2023.5.7)\n",
      "Note: you may need to restart the kernel to use updated packages.\n",
      "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
      "dask-cuda 22.4.0 requires click==8.0.4, but you have click 8.1.3 which is incompatible.\u001b[0m\u001b[31m\n",
      "\u001b[0mNote: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install torch torchvision torchaudio\n",
    "%pip install -q datasets jiwer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2023-07-02T16:38:23.527710Z",
     "iopub.status.busy": "2023-07-02T16:38:23.527353Z",
     "iopub.status.idle": "2023-07-02T16:38:30.048267Z",
     "shell.execute_reply": "2023-07-02T16:38:30.047483Z",
     "shell.execute_reply.started": "2023-07-02T16:38:23.527669Z"
    },
    "id": "6NXldfxQPPyf",
    "outputId": "156a5d1d-b3c9-4cd0-914b-efa8f8f92924",
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: transformers in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (4.30.2)\n",
      "Requirement already satisfied: filelock in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (3.12.2)\n",
      "Requirement already satisfied: huggingface-hub<1.0,>=0.14.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (0.15.1)\n",
      "Requirement already satisfied: numpy>=1.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (1.21.6)\n",
      "Requirement already satisfied: packaging>=20.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (23.1)\n",
      "Requirement already satisfied: pyyaml>=5.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (6.0)\n",
      "Requirement already satisfied: regex!=2019.12.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (2023.6.3)\n",
      "Requirement already satisfied: requests in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (2.31.0)\n",
      "Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (0.13.3)\n",
      "Requirement already satisfied: safetensors>=0.3.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (0.3.1)\n",
      "Requirement already satisfied: tqdm>=4.27 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (4.65.0)\n",
      "Requirement already satisfied: fsspec in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (2022.3.0)\n",
      "Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (4.7.0)\n",
      "Requirement already satisfied: charset-normalizer<4,>=2 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->transformers) (2.0.12)\n",
      "Requirement already satisfied: idna<4,>=2.5 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->transformers) (3.4)\n",
      "Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->transformers) (1.26.15)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->transformers) (2023.5.7)\n",
      "Requirement already satisfied: accelerate in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (0.20.3)\n",
      "Requirement already satisfied: numpy>=1.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from accelerate) (1.21.6)\n",
      "Requirement already satisfied: packaging>=20.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from accelerate) (23.1)\n",
      "Requirement already satisfied: psutil in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from accelerate) (5.9.5)\n",
      "Requirement already satisfied: pyyaml in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from accelerate) (6.0)\n",
      "Requirement already satisfied: torch>=1.6.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from accelerate) (1.11.0)\n",
      "Requirement already satisfied: typing_extensions in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from torch>=1.6.0->accelerate) (4.7.0)\n"
     ]
    }
   ],
   "source": [
    "! pip install transformers\n",
    "! pip install accelerate -U"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2023-07-02T16:38:30.050265Z",
     "iopub.status.busy": "2023-07-02T16:38:30.049932Z",
     "iopub.status.idle": "2023-07-02T16:38:33.677374Z",
     "shell.execute_reply": "2023-07-02T16:38:33.676484Z",
     "shell.execute_reply.started": "2023-07-02T16:38:30.050235Z"
    },
    "id": "l_I_Z3R-PfBA",
    "outputId": "872df2f1-8599-4913-c792-0bbb80a57b4b",
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting evaluate\n",
      "  Using cached evaluate-0.4.0-py3-none-any.whl (81 kB)\n",
      "Requirement already satisfied: datasets>=2.0.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (2.13.1)\n",
      "Requirement already satisfied: numpy>=1.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (1.21.6)\n",
      "Requirement already satisfied: dill in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (0.3.6)\n",
      "Requirement already satisfied: pandas in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (1.4.2)\n",
      "Requirement already satisfied: requests>=2.19.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (2.31.0)\n",
      "Requirement already satisfied: tqdm>=4.62.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (4.65.0)\n",
      "Requirement already satisfied: xxhash in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (3.2.0)\n",
      "Requirement already satisfied: multiprocess in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (0.70.14)\n",
      "Requirement already satisfied: fsspec[http]>=2021.05.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (2022.3.0)\n",
      "Requirement already satisfied: huggingface-hub>=0.7.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (0.15.1)\n",
      "Requirement already satisfied: packaging in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (23.1)\n",
      "Collecting responses<0.19 (from evaluate)\n",
      "  Using cached responses-0.18.0-py3-none-any.whl (38 kB)\n",
      "Requirement already satisfied: pyarrow>=8.0.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from datasets>=2.0.0->evaluate) (12.0.1)\n",
      "Requirement already satisfied: aiohttp in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from datasets>=2.0.0->evaluate) (3.8.4)\n",
      "Requirement already satisfied: pyyaml>=5.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from datasets>=2.0.0->evaluate) (6.0)\n",
      "Requirement already satisfied: filelock in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from huggingface-hub>=0.7.0->evaluate) (3.12.2)\n",
      "Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from huggingface-hub>=0.7.0->evaluate) (4.7.0)\n",
      "Requirement already satisfied: charset-normalizer<4,>=2 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests>=2.19.0->evaluate) (2.0.12)\n",
      "Requirement already satisfied: idna<4,>=2.5 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests>=2.19.0->evaluate) (3.4)\n",
      "Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests>=2.19.0->evaluate) (1.26.15)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests>=2.19.0->evaluate) (2023.5.7)\n",
      "Requirement already satisfied: python-dateutil>=2.8.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from pandas->evaluate) (2.8.2)\n",
      "Requirement already satisfied: pytz>=2020.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from pandas->evaluate) (2023.3)\n",
      "Requirement already satisfied: six>=1.5 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas->evaluate) (1.16.0)\n",
      "Requirement already satisfied: attrs>=17.3.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (23.1.0)\n",
      "Requirement already satisfied: multidict<7.0,>=4.5 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (6.0.4)\n",
      "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (4.0.2)\n",
      "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.9.2)\n",
      "Requirement already satisfied: frozenlist>=1.1.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.3.3)\n",
      "Requirement already satisfied: aiosignal>=1.1.2 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.3.1)\n",
      "Installing collected packages: responses, evaluate\n",
      "Successfully installed evaluate-0.4.0 responses-0.18.0\n"
     ]
    }
   ],
   "source": [
    "! pip install evaluate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "E2WGyBdhPBdK"
   },
   "source": [
    "##### Import Necessary Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T16:38:56.612960Z",
     "iopub.status.busy": "2023-07-02T16:38:56.612617Z",
     "iopub.status.idle": "2023-07-02T16:39:00.587841Z",
     "shell.execute_reply": "2023-07-02T16:39:00.587056Z",
     "shell.execute_reply.started": "2023-07-02T16:38:56.612928Z"
    },
    "id": "e7_XKGqgPBdL",
    "tags": []
   },
   "outputs": [],
   "source": [
    "import os, sys, itertools\n",
    "os.environ['TOKENIZERS_PARALLELISM']='false'\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "from PIL import Image\n",
    "\n",
    "import torch\n",
    "from torch.utils.data import Dataset\n",
    "\n",
    "import datasets\n",
    "from datasets import load_dataset\n",
    "\n",
    "import transformers\n",
    "from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer\n",
    "from transformers import VisionEncoderDecoderModel, TrOCRProcessor, default_data_collator\n",
    "\n",
    "import evaluate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "iv2paOsWPBdM"
   },
   "source": [
    "##### Display Versions of Relevant Software & Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2023-07-02T16:39:03.132896Z",
     "iopub.status.busy": "2023-07-02T16:39:03.132523Z",
     "iopub.status.idle": "2023-07-02T16:39:03.138346Z",
     "shell.execute_reply": "2023-07-02T16:39:03.137525Z",
     "shell.execute_reply.started": "2023-07-02T16:39:03.132871Z"
    },
    "id": "z4c1UdR6PBdN",
    "outputId": "3272a877-2a41-40fc-ba5a-68f9135aeb84",
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        Python: 3.9.15\n",
      "        Pandas: 1.4.2\n",
      "      Datasets: 2.13.1\n",
      "  Transformers: 4.30.2\n",
      "         Torch: 1.11.0\n"
     ]
    }
   ],
   "source": [
    "print(\"Python:\".rjust(15), sys.version[0:6])\n",
    "print(\"Pandas:\".rjust(15), pd.__version__)\n",
    "print(\"Datasets:\".rjust(15), datasets.__version__)\n",
    "print(\"Transformers:\".rjust(15), transformers.__version__)\n",
    "print(\"Torch:\".rjust(15), torch.__version__)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T16:39:05.051096Z",
     "iopub.status.busy": "2023-07-02T16:39:05.050710Z",
     "iopub.status.idle": "2023-07-02T16:39:05.840681Z",
     "shell.execute_reply": "2023-07-02T16:39:05.839435Z",
     "shell.execute_reply.started": "2023-07-02T16:39:05.051074Z"
    },
    "id": "oTg4CqR5QD1n",
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/jovyan/workspace\n"
     ]
    }
   ],
   "source": [
    "! pwd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "y4VWEzsHPBdP"
   },
   "source": [
    "##### Ingest & Preprocess Training DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2023-07-02T16:39:53.671763Z",
     "iopub.status.busy": "2023-07-02T16:39:53.671379Z",
     "iopub.status.idle": "2023-07-02T16:39:53.826411Z",
     "shell.execute_reply": "2023-07-02T16:39:53.825668Z",
     "shell.execute_reply.started": "2023-07-02T16:39:53.671738Z"
    },
    "id": "Atcq5U47Pw0m",
    "outputId": "0a73f060-abd2-4e5c-d156-3090962328f3",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1153"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import zipfile\n",
    "\n",
    "with zipfile.ZipFile('dataset-imss.zip', 'r') as zip_ref:\n",
    "   zip_ref.extractall('captchas')\n",
    "\n",
    "import glob\n",
    "images = glob.glob('/home/jovyan/workspace/captchas/dataset4/**.png', recursive=True)\n",
    "len(images)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "execution": {
     "iopub.execute_input": "2023-07-02T16:39:55.929613Z",
     "iopub.status.busy": "2023-07-02T16:39:55.929242Z",
     "iopub.status.idle": "2023-07-02T16:39:55.955770Z",
     "shell.execute_reply": "2023-07-02T16:39:55.955138Z",
     "shell.execute_reply.started": "2023-07-02T16:39:55.929587Z"
    },
    "id": "HRbxg-trPBdR",
    "outputId": "ead0b4ba-d387-48bb-b3ba-bfca324c6ca8",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>file_name</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>/home/jovyan/workspace/captchas/dataset4/VMyk4...</td>\n",
       "      <td>VMyk4Dc</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>/home/jovyan/workspace/captchas/dataset4/WCYaD...</td>\n",
       "      <td>WCYaDHH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>/home/jovyan/workspace/captchas/dataset4/FP8sR...</td>\n",
       "      <td>FP8sReS</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>/home/jovyan/workspace/captchas/dataset4/hM89J...</td>\n",
       "      <td>hM89JvG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>/home/jovyan/workspace/captchas/dataset4/6VMeA...</td>\n",
       "      <td>6VMeACE</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                           file_name     text\n",
       "0  /home/jovyan/workspace/captchas/dataset4/VMyk4...  VMyk4Dc\n",
       "1  /home/jovyan/workspace/captchas/dataset4/WCYaD...  WCYaDHH\n",
       "2  /home/jovyan/workspace/captchas/dataset4/FP8sR...  FP8sReS\n",
       "3  /home/jovyan/workspace/captchas/dataset4/hM89J...  hM89JvG\n",
       "4  /home/jovyan/workspace/captchas/dataset4/6VMeA...  6VMeACE"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "df = pd.DataFrame(images, columns=['file_name'])\n",
    "df['text'] = df['file_name'].map(lambda x: re.search(r'.*/(.*).png', x).group(1))\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T16:40:01.052879Z",
     "iopub.status.busy": "2023-07-02T16:40:01.052507Z",
     "iopub.status.idle": "2023-07-02T16:40:01.082474Z",
     "shell.execute_reply": "2023-07-02T16:40:01.081807Z",
     "shell.execute_reply.started": "2023-07-02T16:40:01.052854Z"
    },
    "id": "pK3lbuuVRThP",
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "train, test = train_test_split(df, test_size=0.2)\n",
    "train = train.reset_index()\n",
    "test = test.reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 424
    },
    "execution": {
     "iopub.execute_input": "2023-07-02T16:40:04.454583Z",
     "iopub.status.busy": "2023-07-02T16:40:04.454171Z",
     "iopub.status.idle": "2023-07-02T16:40:04.466088Z",
     "shell.execute_reply": "2023-07-02T16:40:04.465430Z",
     "shell.execute_reply.started": "2023-07-02T16:40:04.454558Z"
    },
    "id": "PVwUn2rOZVlY",
    "outputId": "91a4b500-a27c-4bf7-98e0-87e543c99bd3",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index</th>\n",
       "      <th>file_name</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>33</td>\n",
       "      <td>/home/jovyan/workspace/captchas/dataset4/TtErH...</td>\n",
       "      <td>TtErHy3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>790</td>\n",
       "      <td>/home/jovyan/workspace/captchas/dataset4/wyUEp...</td>\n",
       "      <td>wyUEpjM</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>63</td>\n",
       "      <td>/home/jovyan/workspace/captchas/dataset4/wsndT...</td>\n",
       "      <td>wsndTMe</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>362</td>\n",
       "      <td>/home/jovyan/workspace/captchas/dataset4/rjUU4...</td>\n",
       "      <td>rjUU4Ru</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>519</td>\n",
       "      <td>/home/jovyan/workspace/captchas/dataset4/GrVPa...</td>\n",
       "      <td>GrVPa9Y</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>917</th>\n",
       "      <td>851</td>\n",
       "      <td>/home/jovyan/workspace/captchas/dataset4/Hwbva...</td>\n",
       "      <td>Hwbvav4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>918</th>\n",
       "      <td>10</td>\n",
       "      <td>/home/jovyan/workspace/captchas/dataset4/TCesP...</td>\n",
       "      <td>TCesPS4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>919</th>\n",
       "      <td>754</td>\n",
       "      <td>/home/jovyan/workspace/captchas/dataset4/Estmq...</td>\n",
       "      <td>Estmq6S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>920</th>\n",
       "      <td>477</td>\n",
       "      <td>/home/jovyan/workspace/captchas/dataset4/pUQyG...</td>\n",
       "      <td>pUQyGSe</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>921</th>\n",
       "      <td>40</td>\n",
       "      <td>/home/jovyan/workspace/captchas/dataset4/KCM3s...</td>\n",
       "      <td>KCM3s9h</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>922 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     index                                          file_name     text\n",
       "0       33  /home/jovyan/workspace/captchas/dataset4/TtErH...  TtErHy3\n",
       "1      790  /home/jovyan/workspace/captchas/dataset4/wyUEp...  wyUEpjM\n",
       "2       63  /home/jovyan/workspace/captchas/dataset4/wsndT...  wsndTMe\n",
       "3      362  /home/jovyan/workspace/captchas/dataset4/rjUU4...  rjUU4Ru\n",
       "4      519  /home/jovyan/workspace/captchas/dataset4/GrVPa...  GrVPa9Y\n",
       "..     ...                                                ...      ...\n",
       "917    851  /home/jovyan/workspace/captchas/dataset4/Hwbva...  Hwbvav4\n",
       "918     10  /home/jovyan/workspace/captchas/dataset4/TCesP...  TCesPS4\n",
       "919    754  /home/jovyan/workspace/captchas/dataset4/Estmq...  Estmq6S\n",
       "920    477  /home/jovyan/workspace/captchas/dataset4/pUQyG...  pUQyGSe\n",
       "921     40  /home/jovyan/workspace/captchas/dataset4/KCM3s...  KCM3s9h\n",
       "\n",
       "[922 rows x 3 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T16:40:09.619510Z",
     "iopub.status.busy": "2023-07-02T16:40:09.619106Z",
     "iopub.status.idle": "2023-07-02T16:40:09.624809Z",
     "shell.execute_reply": "2023-07-02T16:40:09.623938Z",
     "shell.execute_reply.started": "2023-07-02T16:40:09.619481Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def tokenize(file_path, text):\n",
    "        # prepare image (i.e. resize + normalize)\n",
    "        image = Image.open(file_path).convert(\"RGB\")\n",
    "        pixel_values = self.processor(image, return_tensors=\"pt\").pixel_values\n",
    "        # add labels (input_ids) by encoding the text\n",
    "        labels = self.processor.tokenizer(text, padding=\"max_length\", max_length=self.max_target_length).input_ids\n",
    "        # important: make sure that PAD tokens are ignored by the loss function\n",
    "        labels = [label if label != self.processor.tokenizer.pad_token_id\n",
    "                  else -100 for label in labels]\n",
    "\n",
    "        encoding = {\"pixel_values\" : pixel_values.squeeze(), \"labels\" : torch.tensor(labels)}\n",
    "        return encoding "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T16:40:36.770297Z",
     "iopub.status.busy": "2023-07-02T16:40:36.769909Z",
     "iopub.status.idle": "2023-07-02T16:40:36.776311Z",
     "shell.execute_reply": "2023-07-02T16:40:36.775675Z",
     "shell.execute_reply.started": "2023-07-02T16:40:36.770274Z"
    },
    "id": "CS86kRsJPBdX",
    "tags": []
   },
   "outputs": [],
   "source": [
    "class Captcha_Dataset(Dataset):\n",
    "\n",
    "    def __init__(self,df, processor, max_target_length=128):\n",
    "        self.df = df\n",
    "        self.processor = processor\n",
    "        self.max_target_length = max_target_length\n",
    "\n",
    "    def __len__(self):\n",
    "        return len(self.df)\n",
    "\n",
    "    def __getitem__(self, idx):\n",
    "        # get file name + text\n",
    "        file_path = self.df['file_name'][idx]\n",
    "        text = self.df['text'][idx]\n",
    "        # prepare image (i.e. resize + normalize)\n",
    "        image = Image.open(file_path).convert(\"RGB\")\n",
    "        pixel_values = self.processor(image, return_tensors=\"pt\").pixel_values\n",
    "        # add labels (input_ids) by encoding the text\n",
    "        labels = self.processor.tokenizer(text, padding=\"max_length\", max_length=self.max_target_length).input_ids\n",
    "        # important: make sure that PAD tokens are ignored by the loss function\n",
    "        labels = [label if label != self.processor.tokenizer.pad_token_id\n",
    "                  else -100 for label in labels]\n",
    "\n",
    "        encoding = {\"pixel_values\" : pixel_values.squeeze(), \"labels\" : torch.tensor(labels)}\n",
    "        return encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qMuiii6UPBdY"
   },
   "source": [
    "##### Basic Values/Constants"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T16:41:43.503277Z",
     "iopub.status.busy": "2023-07-02T16:41:43.502894Z",
     "iopub.status.idle": "2023-07-02T16:41:43.506967Z",
     "shell.execute_reply": "2023-07-02T16:41:43.506238Z",
     "shell.execute_reply.started": "2023-07-02T16:41:43.503252Z"
    },
    "id": "Pz10NsdfPBdZ",
    "tags": []
   },
   "outputs": [],
   "source": [
    "MODEL_CKPT = \"microsoft/trocr-base-printed\"\n",
    "MODEL_NAME =  MODEL_CKPT.split(\"/\")[-1] + \"_captcha_ocr\"\n",
    "NUM_OF_EPOCHS = 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mHdN01enPBdZ"
   },
   "source": [
    "##### Instantiate Processor, Create Training, & Testing Dataset Instances"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2023-07-02T16:41:45.762132Z",
     "iopub.status.busy": "2023-07-02T16:41:45.761743Z",
     "iopub.status.idle": "2023-07-02T16:41:45.951471Z",
     "shell.execute_reply": "2023-07-02T16:41:45.950844Z",
     "shell.execute_reply.started": "2023-07-02T16:41:45.762108Z"
    },
    "id": "Vo6_X-rjPBdZ",
    "outputId": "9b427a64-daa6-424a-e45b-e61dea91487a",
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.\n"
     ]
    }
   ],
   "source": [
    "processor = TrOCRProcessor.from_pretrained(MODEL_CKPT)\n",
    "train_ds = Captcha_Dataset(df=train,processor=processor)\n",
    "test_ds = Captcha_Dataset(df=test,processor=processor)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lx7hfHVGPBda"
   },
   "source": [
    "##### Print Length of Training & Testing Datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2023-07-02T16:41:48.729007Z",
     "iopub.status.busy": "2023-07-02T16:41:48.728622Z",
     "iopub.status.idle": "2023-07-02T16:41:48.733380Z",
     "shell.execute_reply": "2023-07-02T16:41:48.732602Z",
     "shell.execute_reply.started": "2023-07-02T16:41:48.728985Z"
    },
    "id": "wgHpCqKfPBdb",
    "outputId": "106f80b7-f3a7-4efe-eec2-92df29061702",
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The training dataset has 922 samples in it.\n",
      "The testing dataset has 231 samples in it.\n"
     ]
    }
   ],
   "source": [
    "print(f\"The training dataset has {len(train_ds)} samples in it.\")\n",
    "print(f\"The testing dataset has {len(test_ds)} samples in it.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rUAIaTqwPBdb"
   },
   "source": [
    "##### Example of Input Data Shapes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2023-07-02T16:41:59.911117Z",
     "iopub.status.busy": "2023-07-02T16:41:59.910734Z",
     "iopub.status.idle": "2023-07-02T16:41:59.930886Z",
     "shell.execute_reply": "2023-07-02T16:41:59.930111Z",
     "shell.execute_reply.started": "2023-07-02T16:41:59.911090Z"
    },
    "id": "FnEUP8PKPBdc",
    "outputId": "cd817977-eeb4-47df-bdda-8c33fc4c18dd",
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pixel_values  :  torch.Size([3, 384, 384])\n",
      "labels  :  torch.Size([128])\n"
     ]
    }
   ],
   "source": [
    "encoding = train_ds[10]\n",
    "\n",
    "for k,v in encoding.items():\n",
    "    print(k, \" : \", v.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "N38-0NPhPBdc"
   },
   "source": [
    "##### Show Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 67
    },
    "execution": {
     "iopub.execute_input": "2023-07-02T16:42:03.356471Z",
     "iopub.status.busy": "2023-07-02T16:42:03.355908Z",
     "iopub.status.idle": "2023-07-02T16:42:03.365782Z",
     "shell.execute_reply": "2023-07-02T16:42:03.365173Z",
     "shell.execute_reply.started": "2023-07-02T16:42:03.356433Z"
    },
    "id": "KIWjkbV9PBdd",
    "outputId": "54a13aeb-18c6-4248-8d42-4818886df3d7",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAANwAAAAoCAIAAAAaOwPZAAAFO0lEQVR4nO1c25GjOhDV3toAyGCUgZUBZMBkQAiE4HIIDoEQyAAykDPQZNAZzP045S4tM8aSwI1mV+cLKEs03Uf9ksq/Pj8/VUFBTvjvaAEKCpYopCzIDoWUBdmhkLIgOxRSFmSHQsqC7FBIWZAdCikLskMhZUF2+C32JmstLrTWVVWJvbfgx0GIlNZaIhrHkYjqujbGZEVNIsJFPiI555xzSqmqqowxB0sjCzlP6ZwjIih6nue6rsHLQ9gJSZiLkKppGpUHLyGetXae577vrbUyvFyo5SjrCJFyoVN2A1prYXbCZzvn5nmGDEBd19Za8FIAnMw8YptzbhxHpdTlcjmfz6+Wx18GrBatddu2TdMI81LOUyJkG2OstfhsIrLWWmvByPCw/tSiKwOJaBgGn44MrXXUbMngZKZt22+9IBQiI4y6M3Kaptvt5j/nhSHMSzlSKk/XPjPwBNe3263runVePrXoOsZxZEYaY9q29WWTUT0RXS4XpZS1drsXTF6iC5FwUdd1XddKqev1yqEc3mSLkFEQJaW6Z2y8IrXWfd8rpeZ5xhOoeEUFC4tG8VJrfTqdpmlSd0buW0Zw0ixmwo1LFOAF2TQNfIS1tu/76/Wq7sXAnkI/gzQpOWQ758BIrEJjDCxqjFlxV5yJYzjyrXBj+O5wF/aAE3zrnIOzlyxNdnG6i2yhaZppmo6q+aRJ6VcYoAhYCBPCwCu6gNX5Wmu90UlsARgJTgBg/HZhmOhElMwMebe9F6R3dFhT6m5CX+mL22/x9vbm38LvJsuDUJ4G30vxw9PpBMefLAOCyTAMfliIDaBENE3TMAwL8cJFWpgpdoYtkPaUAIIFmkFRA2Hsj48P1poxpmmaWGcAGyP6MyfS8kuYnIj6vg/sHiBEYOBCBnUPJtyjwG0U0ZmRVVUNw9B1XfhAXgZIKNEVChy+F0RJ6S96aDlhCUJN1lqYv23bKGYzITgrxXPUPQmRF28HIwPHov/FbPYTAJ7Q10y4p8TP2EcSUbh+mIt4HaYiItSgkv5SlJRY9LhOcJMAq+brRQh8QuwCvD2qacL11iOqoS/jpzqBwO/RXFRKdV0X3mJ0zoHKLKS65xJEpLUW46UcKbnuVsftX/mv5iXBReuBu8x+4cy+nJsygYDvZ7easBnTdR2He9SjcJnwoIgG4bMlQ3Tvm91kDqcxIADYkLa76KeGsF/sR63IAPMHTggWDsPQti2UjAUWJQ8227B5gVukFsxLsW6lXPXtB6Pk2P0KJO93IxPA9WInfbsMgXxilw/ScPyNTbUBTvSZ0+fzOaQlsi+ESOkvtQNj977gD6mq6hWOBPOvOyp/YfCqiEolV3BUMiNEylfEbrZWWtXCw7kdg6x3mqZpmnashL5FLOEeeeLFCtda78VIde9WciUuBukDGWqP2O23dZCAq/hlDRtjOHjJu03J7aFw+E2AR23IqJCCaLuRkV93TbnS+jtbQl3XjeOIDY+Nn8cdXczDHZCoHXDOcf0+iBiqP3fhH7kiZHXYSv32B9xwJaJdGDmOox8lqjtwSiFt5ljIHfJFZaf2yFSqqkJlgHNoaKGFD+ew+NV+2jt0HCKnT5pXOJIVvZF3LFftl0cuoGPOue6FXz/3rwAXxVMU1zmVXLA5oQjbeOKYCZ0gP6fpe50PX4RvdVBV+oNJuRH07ESSANIIja1tjhJ1Xb+/v/8F3QzGMQcyckAOVkzOZLTWp9NJKYXYmsO37Ih/11P+aPg7ET/uuORTFFIWZIfyty0F2aGQsiA7FFIWZIdCyoLs8D9FlV0D7PxKyQAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<PIL.Image.Image image mode=RGB size=220x40>"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Image.open(train['file_name'][0]).convert(\"RGB\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "O7nOrwBpPBdd"
   },
   "source": [
    "##### Show Label for Above Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2023-07-02T16:42:05.806458Z",
     "iopub.status.busy": "2023-07-02T16:42:05.806065Z",
     "iopub.status.idle": "2023-07-02T16:42:05.811038Z",
     "shell.execute_reply": "2023-07-02T16:42:05.810402Z",
     "shell.execute_reply.started": "2023-07-02T16:42:05.806432Z"
    },
    "id": "tjOyc1ksPBde",
    "outputId": "2771a745-dc3f-4058-f1fe-c27d28e57ffa",
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "vy26xDW\n"
     ]
    }
   ],
   "source": [
    "labels = encoding['labels']\n",
    "labels[labels == -100] = processor.tokenizer.pad_token_id\n",
    "label_str = processor.decode(labels, skip_special_tokens=True)\n",
    "print(label_str)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Sc3E_5KCPBde"
   },
   "source": [
    "#### Instantiate Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2023-07-02T16:42:08.592330Z",
     "iopub.status.busy": "2023-07-02T16:42:08.591936Z",
     "iopub.status.idle": "2023-07-02T16:42:15.226651Z",
     "shell.execute_reply": "2023-07-02T16:42:15.225927Z",
     "shell.execute_reply.started": "2023-07-02T16:42:08.592307Z"
    },
    "id": "giCycPYSPBdf",
    "outputId": "0503d57a-9aaa-412c-afdc-60de94e1e7cb",
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some weights of VisionEncoderDecoderModel were not initialized from the model checkpoint at microsoft/trocr-base-printed and are newly initialized: ['encoder.pooler.dense.weight', 'encoder.pooler.dense.bias']\n",
      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
     ]
    }
   ],
   "source": [
    "model = VisionEncoderDecoderModel.from_pretrained(MODEL_CKPT)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-rkOX7UEPBdf"
   },
   "source": [
    "##### Model Configuration Modifications"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T16:42:15.228189Z",
     "iopub.status.busy": "2023-07-02T16:42:15.227891Z",
     "iopub.status.idle": "2023-07-02T16:42:15.233150Z",
     "shell.execute_reply": "2023-07-02T16:42:15.232303Z",
     "shell.execute_reply.started": "2023-07-02T16:42:15.228166Z"
    },
    "id": "BCimiZB2PBdg",
    "tags": []
   },
   "outputs": [],
   "source": [
    "model.config.decoder_start_token_id = processor.tokenizer.cls_token_id\n",
    "model.config.pad_token_id = processor.tokenizer.pad_token_id\n",
    "\n",
    "model.config.vocab_size = model.config.decoder.vocab_size\n",
    "\n",
    "model.config.eos_token_id = processor.tokenizer.sep_token_id\n",
    "model.config.max_length = 64\n",
    "model.config.early_stopping = True\n",
    "model.config.no_repeat_ngram_size = 3\n",
    "model.config.length_penalty = 2.0\n",
    "model.config.num_beams = 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TdujhGLEPBdh"
   },
   "source": [
    "##### Define Metrics Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T16:42:17.957734Z",
     "iopub.status.busy": "2023-07-02T16:42:17.957346Z",
     "iopub.status.idle": "2023-07-02T16:42:19.107223Z",
     "shell.execute_reply": "2023-07-02T16:42:19.106630Z",
     "shell.execute_reply.started": "2023-07-02T16:42:17.957710Z"
    },
    "id": "cSRjcoBVPBdh",
    "tags": []
   },
   "outputs": [],
   "source": [
    "cer_metric = evaluate.load(\"cer\")\n",
    "\n",
    "def compute_metrics(pred):\n",
    "    label_ids = pred.label_ids\n",
    "    pred_ids = pred.predictions\n",
    "\n",
    "    pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)\n",
    "    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id\n",
    "    label_str = processor.batch_decode(label_ids, skip_special_tokens=True)\n",
    "\n",
    "    cer = cer_metric.compute(predictions=pred_str, references=label_str)\n",
    "\n",
    "    return {\"cer\" : cer}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "96ooUEZBPBdh"
   },
   "source": [
    "##### Define Training Arguments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T16:42:21.819117Z",
     "iopub.status.busy": "2023-07-02T16:42:21.818715Z",
     "iopub.status.idle": "2023-07-02T16:42:21.825251Z",
     "shell.execute_reply": "2023-07-02T16:42:21.824571Z",
     "shell.execute_reply.started": "2023-07-02T16:42:21.819072Z"
    },
    "id": "FEEailluPBdi",
    "tags": []
   },
   "outputs": [],
   "source": [
    "args = Seq2SeqTrainingArguments(\n",
    "    output_dir = MODEL_NAME,\n",
    "    num_train_epochs=NUM_OF_EPOCHS,\n",
    "    predict_with_generate=True,\n",
    "    evaluation_strategy=\"epoch\",\n",
    "    save_strategy=\"steps\",  # Change here\n",
    "    save_steps=1e6,  # Add this line, set to a large number\n",
    "    per_device_train_batch_size=8,\n",
    "    per_device_eval_batch_size=8,\n",
    "    logging_first_step=True,\n",
    "    hub_private_repo=False,\n",
    "    push_to_hub=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OnJTMfK9PBdj"
   },
   "source": [
    "##### Define Trainer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2023-07-02T16:42:26.816215Z",
     "iopub.status.busy": "2023-07-02T16:42:26.815829Z",
     "iopub.status.idle": "2023-07-02T16:42:27.179177Z",
     "shell.execute_reply": "2023-07-02T16:42:27.178432Z",
     "shell.execute_reply.started": "2023-07-02T16:42:26.816192Z"
    },
    "id": "ukKLfcRLPBdj",
    "outputId": "0913bf3b-0e6c-4e78-e6d5-a87864dc5b42",
    "tags": []
   },
   "outputs": [],
   "source": [
    "trainer = Seq2SeqTrainer(\n",
    "    model=model,\n",
    "    tokenizer=processor.feature_extractor,\n",
    "    args=args,\n",
    "    compute_metrics=compute_metrics,\n",
    "    train_dataset=train_ds,\n",
    "    eval_dataset=test_ds,\n",
    "    data_collator=default_data_collator\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pm6MBc4mPBdk"
   },
   "source": [
    "##### Fit/Train Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 134
    },
    "execution": {
     "iopub.execute_input": "2023-07-02T16:42:30.324513Z",
     "iopub.status.busy": "2023-07-02T16:42:30.324129Z",
     "iopub.status.idle": "2023-07-02T17:06:12.146679Z",
     "shell.execute_reply": "2023-07-02T17:06:12.146045Z",
     "shell.execute_reply.started": "2023-07-02T16:42:30.324486Z"
    },
    "id": "RdwPyzCDPBdl",
    "outputId": "a1406ff0-bb27-4322-9362-83dcf6af14e7",
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/saturncloud/envs/saturn/lib/python3.9/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "\n",
       "    <div>\n",
       "      \n",
       "      <progress value='580' max='580' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
       "      [580/580 23:39, Epoch 5/5]\n",
       "    </div>\n",
       "    <table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       " <tr style=\"text-align: left;\">\n",
       "      <th>Epoch</th>\n",
       "      <th>Training Loss</th>\n",
       "      <th>Validation Loss</th>\n",
       "      <th>Cer</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>11.165900</td>\n",
       "      <td>1.063115</td>\n",
       "      <td>0.163884</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>11.165900</td>\n",
       "      <td>0.590734</td>\n",
       "      <td>0.097712</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>11.165900</td>\n",
       "      <td>0.485241</td>\n",
       "      <td>0.053803</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>11.165900</td>\n",
       "      <td>0.321299</td>\n",
       "      <td>0.032158</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>5</td>\n",
       "      <td>0.438800</td>\n",
       "      <td>0.294638</td>\n",
       "      <td>0.028448</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table><p>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/saturncloud/envs/saturn/lib/python3.9/site-packages/transformers/generation/utils.py:1259: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "TrainOutput(global_step=580, training_loss=0.40041021108627317, metrics={'train_runtime': 1421.6668, 'train_samples_per_second': 3.243, 'train_steps_per_second': 0.408, 'total_flos': 3.4495947304914125e+18, 'train_loss': 0.40041021108627317, 'epoch': 5.0})"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.train()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T17:10:16.834072Z",
     "iopub.status.busy": "2023-07-02T17:10:16.833664Z",
     "iopub.status.idle": "2023-07-02T17:10:17.821400Z",
     "shell.execute_reply": "2023-07-02T17:10:17.820510Z",
     "shell.execute_reply.started": "2023-07-02T17:10:16.834044Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "! rm -rf xtrocr-base-printed_captcha_ocr2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jjAvL7blPBdl"
   },
   "source": [
    "##### Save Model & Model State"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T17:10:20.149589Z",
     "iopub.status.busy": "2023-07-02T17:10:20.149176Z",
     "iopub.status.idle": "2023-07-02T17:10:22.081015Z",
     "shell.execute_reply": "2023-07-02T17:10:22.080412Z",
     "shell.execute_reply.started": "2023-07-02T17:10:20.149561Z"
    },
    "id": "CAL-qtqePBdl",
    "tags": []
   },
   "outputs": [],
   "source": [
    "trainer.save_model(f'/home/jovyan/workspace/x{MODEL_NAME}')\n",
    "trainer.save_state()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "gqcy34rdPBdn"
   },
   "source": [
    "##### Evaluate Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T17:10:25.211492Z",
     "iopub.status.busy": "2023-07-02T17:10:25.211105Z",
     "iopub.status.idle": "2023-07-02T17:11:50.250586Z",
     "shell.execute_reply": "2023-07-02T17:11:50.249820Z",
     "shell.execute_reply.started": "2023-07-02T17:10:25.211461Z"
    },
    "id": "Vh-Hi13zPBdn",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "    <div>\n",
       "      \n",
       "      <progress value='29' max='29' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
       "      [29/29 01:21]\n",
       "    </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "{'eval_loss': 0.29463836550712585,\n",
       " 'eval_cer': 0.02844774273345702,\n",
       " 'eval_runtime': 85.0328,\n",
       " 'eval_samples_per_second': 2.717,\n",
       " 'eval_steps_per_second': 0.341,\n",
       " 'epoch': 5.0}"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.evaluate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T17:29:33.805483Z",
     "iopub.status.busy": "2023-07-02T17:29:33.805113Z",
     "iopub.status.idle": "2023-07-02T17:29:34.704793Z",
     "shell.execute_reply": "2023-07-02T17:29:34.703925Z",
     "shell.execute_reply.started": "2023-07-02T17:29:33.805459Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/jovyan/workspace\n"
     ]
    }
   ],
   "source": [
    "! pwd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T17:29:36.103773Z",
     "iopub.status.busy": "2023-07-02T17:29:36.103347Z",
     "iopub.status.idle": "2023-07-02T17:29:42.509355Z",
     "shell.execute_reply": "2023-07-02T17:29:42.508568Z",
     "shell.execute_reply.started": "2023-07-02T17:29:36.103745Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from transformers import VisionEncoderDecoderModel\n",
    "\n",
    "model = VisionEncoderDecoderModel.from_pretrained(f'/home/jovyan/workspace/x{MODEL_NAME}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T17:29:42.511546Z",
     "iopub.status.busy": "2023-07-02T17:29:42.511120Z",
     "iopub.status.idle": "2023-07-02T17:29:42.517343Z",
     "shell.execute_reply": "2023-07-02T17:29:42.516488Z",
     "shell.execute_reply.started": "2023-07-02T17:29:42.511512Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'TtErHy3'"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train['text'][0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T17:29:44.621137Z",
     "iopub.status.busy": "2023-07-02T17:29:44.620754Z",
     "iopub.status.idle": "2023-07-02T17:29:44.950276Z",
     "shell.execute_reply": "2023-07-02T17:29:44.949560Z",
     "shell.execute_reply.started": "2023-07-02T17:29:44.621109Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import cv2\n",
    "\n",
    "def preprocessing(image_path):\n",
    "   image = cv2.imread(image_path, 0)\n",
    "   # _, image = cv2.threshold(image, 230, 255, cv2.THRESH_BINARY)\n",
    "   # ret, image = cv2.threshold(cv2.GaussianBlur(image, (3,3), 0), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)\n",
    "   # ret, image = cv2.threshold(cv2.GaussianBlur(image, (5,5), 0), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)\n",
    "   # image = cv2.dilate(cv2.erode(cv2.dilate(image, kernel, iterations=1), kernel), kernel)\n",
    "   return  processor(cv2.merge([image, image, image]), return_tensors=\"pt\").pixel_values\n",
    "\n",
    "def solve_captcha(file_path):\n",
    "    generated_ids = model.generate(preprocessing(file_path))\n",
    "    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]\n",
    "    return generated_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T17:46:35.049756Z",
     "iopub.status.busy": "2023-07-02T17:46:35.049363Z",
     "iopub.status.idle": "2023-07-02T19:29:16.001986Z",
     "shell.execute_reply": "2023-07-02T19:29:16.001278Z",
     "shell.execute_reply.started": "2023-07-02T17:46:35.049731Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "predicted_values = [solve_captcha(img_file) for img_file in df['file_name']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T19:53:42.196913Z",
     "iopub.status.busy": "2023-07-02T19:53:42.196492Z",
     "iopub.status.idle": "2023-07-02T19:53:42.216670Z",
     "shell.execute_reply": "2023-07-02T19:53:42.215702Z",
     "shell.execute_reply.started": "2023-07-02T19:53:42.196881Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('utPHmUu', 'utPHUu'),\n",
       " ('quWhtpt', 'quWhtPt'),\n",
       " ('hvv2JJK', 'hVv2JK'),\n",
       " ('vqpqpq4', 'VqPqpq4'),\n",
       " ('GypHTKU', 'GyPHTKU'),\n",
       " ('aXBKaaX', 'aXBKaX'),\n",
       " ('C4hpRrp', 'G4hPRrP'),\n",
       " ('pL4UEsv', 'PL4UEsv'),\n",
       " ('f2wuxFu', 'f2WuxFu'),\n",
       " ('QePx5cC', 'QePxcC'),\n",
       " ('s6M2pWW', 's66M2pWW'),\n",
       " ('exfCW4M', 'exfCWAM'),\n",
       " ('CCrJjCf', 'CCrJJCf'),\n",
       " ('UMWaxF8', 'UMWAxF8'),\n",
       " ('t9tnnkr', 't9tnKr'),\n",
       " ('VMHhjHK', 'VMHjHK'),\n",
       " ('aeqrxnS', 'aeqrxS'),\n",
       " ('Ecnmhyj', 'Ecnhnyj'),\n",
       " ('K2WQY4W', 'k2wQY4w'),\n",
       " ('sd6baaX', 'sd6baX'),\n",
       " ('pje2kc2', 'Pje2kc2'),\n",
       " ('QvaVv6k', 'QvaVvvv6k'),\n",
       " ('fYrHLKe', 'fYrHLke'),\n",
       " ('qsCGPWU', 'qsCGPwU'),\n",
       " ('yb6wvjp', 'yb6wyjP'),\n",
       " ('vQ7EKPv', 'vQ7EKpv'),\n",
       " ('8eewnfL', '8ewnfL'),\n",
       " ('e64wsQK', 'e64wSQK'),\n",
       " ('yqdEkAp', 'yqdEkap'),\n",
       " ('yRVCEmH', 'yRVCEmh'),\n",
       " ('JpqkKHJ', 'JPqkKHJ'),\n",
       " ('JYcknbj', 'JYcksnbj'),\n",
       " ('SSHD6P2', 'sSHD6P2'),\n",
       " ('d8yxwxF', 'd8yxF'),\n",
       " ('mvTMBu7', 'nvTMBu7'),\n",
       " ('pDVhkAa', 'pDVAkAa'),\n",
       " ('8LM2EBH', '8LM2EEBH'),\n",
       " ('qpexxeH', 'qpexceH'),\n",
       " ('qJpVPSh', 'qJPVPSh'),\n",
       " ('xGfxcHB', 'xGfxchB'),\n",
       " ('TXsqhMG', 'TXsqMG'),\n",
       " ('aVaUaeP', 'aVaUaep'),\n",
       " ('wmsBbEb', 'WmsBbEb'),\n",
       " ('E5Mqdun', 'ESMqdun'),\n",
       " ('jf99937', 'jf9937'),\n",
       " ('jwv7ywf', 'jWV7Ywf')]"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "accuracy = list(filter(lambda x: x[1] == 0, [[i,int(df['text'][i]==predicted_values[i])] for i in range(0, len(predicted_values))]))\n",
    "list(map(lambda x: (df['text'][x[0]],predicted_values[x[0]]), accuracy))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-02T19:52:08.024581Z",
     "iopub.status.busy": "2023-07-02T19:52:08.024178Z",
     "iopub.status.idle": "2023-07-02T19:52:08.029239Z",
     "shell.execute_reply": "2023-07-02T19:52:08.028504Z",
     "shell.execute_reply.started": "2023-07-02T19:52:08.024556Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('jwv7ywf', 'jWV7Ywf')"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['text'][1117],predicted_values[1117]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-06-29T22:13:56.023823Z",
     "iopub.status.busy": "2023-06-29T22:13:56.023442Z",
     "iopub.status.idle": "2023-06-29T22:13:56.029904Z",
     "shell.execute_reply": "2023-06-29T22:13:56.028956Z",
     "shell.execute_reply.started": "2023-06-29T22:13:56.023798Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAIcAAAAmCAAAAADLzNedAAAEzUlEQVR4nO2XwXXrSA5Fb8/x/r8MhAyEDMgM1BmoM3AIPg7BGfTPwMyAzADOAMoAE8HMokiKkixLi5lZDTb2oQpVFyjUK9QfAPbX0YD8/XeymPVdb1xbjlNixs5N2885/V0FRptp/V6VwAl230y2tRerorKA5c88RchvRleessjE4AxSWVMgCXyvzWCK+GcCSV/8aC9/xVhEmKiI7eCMw83ojGoRKrxbUlIZpwwwE3tfOCpjArLF5g8weDnuGclBPTklADJBxR2Hee3I8EOv9v8pspDZL/UrcU6fuXHbP+IwwcgIWtC9MxjQN6OtH4meyqIiiwUkCrwzLZhUTp9ZoLZ/1rk/4EB9A+mGBcNNpd3pcEsi6+sAOUUWNUIvKqZI1B/MzwmoiiygWlSy27muOUBdBmMSAOS+RQ7fhNBqtzymiAYClcv6dd62YUnu60OExjFCDqiWgggfvjoT9u3GLDQGBDXKbEYIkadfDTNjimgDn8Tg5R1AFuuXQCHrLrNRuT2oIK8AKmLOIvkZuHpB5UfbFNQf7CmKti8YqdosGoh1WyoLcoLL0L4SID9e+ch5nMWxfYs1qgqP4hsluuHodwZrVKs1Fausdpwr8Y/X2+kqPzirX99fjqgQOSVv8RDk5SgRPlx+NfmeImPKRWW3CQPUtaBXnRHms6Yf02e2/Ayy+ve3Rxi8GOCjUeeUCJlBxhRrrDIdN25Vl1iYTIe2cR5vQ8xnqFWRHmK0+kDHYTd4tAVJioKVQiass+0uV06XIHJ1/byex2E3kOcI3h6DvMyex9Lv5vXnbkgqzreee2fXSpRTZoOey6POGOBZToKhmwh+5MBZwvOda0haSu5hVI4JqD9wGuctuDjWVgHC9PqElm44OKuWm1nB+4KhvLmj6nPIAvXHnmAk4bJeJApKeqV/AmLLsSyRJRzGt/cZA4YB6wyWqzhjGBuGgH1l89wczmp3d4O71sCnOBbrxxnjyJTQci91ttxyQE1dwBQA5HAYz+6/o4Cq97eRnKCF8RSH7EoiAPSGey2dCUxT1SIoYpYQoEn82apUQNq7BPDF17bZ+4nDuouZFutxam2ncqgoWjXpTHuzhFgLn47pYRt0mY97g9ZzV3S/CwzXmuth7ib7buORQ6jA7CgZjj3qDO/Vxz2iE2Due1tqLw8QgHdLCwAUB0aQdXK1G9P50e5yfFculeOYyG1na/KMXcwOG+cWvQmJp+wf935oJ3XzlKiqzKkKIq+Jr94cAG4Ap+cofsqHexY58KcaxojVV5OnSO9Mc7Cyr3tz5LMq9g1H5Xw9Wje3oG6iMgZ1WqYnQ2bsTKa5Tla/2SKvRfZZDpkK1TS3+C0hNaZ3Rk6RzmEfauexAgnJDv3svPpdBpXP9qfnf63LQOd3qB1qLCoyRCUU0pHQHGcVYO2pprrqqmVdxjdV8wyHHMNYW1v1MFbrZQTmr4bhQIzQhCpHCSylq4NhPlj+0PL/wIHjbE/aAtLmbe/Z3oG9gIwCA9gTSJetueRQ5hcfn+S40Rr1KBvG8qqejwgwKil6YVU7sPWFPYMfUNlDPV/sj3/9+POywbeiXzXCCTqnKoyNms4DYuKX/Mk6fcDxI2N7SAqyxDVG6zsuVPa/xPEftbu6/j+2/3Nc2r8BfnyIF4iSw0YAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<PIL.PngImagePlugin.PngImageFile image mode=L size=135x38>"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Image.open(df['file_name'][18])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-06-29T22:18:23.458106Z",
     "iopub.status.busy": "2023-06-29T22:18:23.457704Z",
     "iopub.status.idle": "2023-06-29T22:18:28.624912Z",
     "shell.execute_reply": "2023-06-29T22:18:28.624253Z",
     "shell.execute_reply.started": "2023-06-29T22:18:23.458083Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/saturncloud/envs/saturn/lib/python3.9/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (64) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'3Vn7kw'"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "solve_captcha('3yvh7KW.png')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JJbH8MtBPBdo"
   },
   "source": [
    "##### Push Model to Hub (My Profile!!!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "63SWFV94PBdo"
   },
   "outputs": [],
   "source": [
    "kwargs = {\n",
    "    \"finetuned_from\" : model.config._name_or_path,\n",
    "    \"tasks\" : \"image-to-text\",\n",
    "    \"tags\" : [\"image-to-text\"],\n",
    "}\n",
    "\n",
    "if args.push_to_hub:\n",
    "    trainer.push_to_hub(\"All Dunn!!!\")\n",
    "else:\n",
    "    trainer.create_model_card(**kwargs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5aSPONY_PBdp"
   },
   "source": [
    "### Notes & Other Takeaways From This Project\n",
    "****\n",
    "- The Character Error Rate (CER) was 0.0075. I am pleased with that result.\n",
    "- Context about metric: Zero (0) is perfection. One the worst score (unless there is an insertion error).\n",
    "****"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SKuM2BqePBdp"
   },
   "source": [
    "### Citations\n",
    "\n",
    "##### For Transformer Checkpoint\n",
    "- @misc{li2021trocr,\n",
    "      title={TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models},\n",
    "      author={Minghao Li and Tengchao Lv and Lei Cui and Yijuan Lu and Dinei Florencio and Cha Zhang and Zhoujun Li and Furu Wei},\n",
    "      year={2021},\n",
    "      eprint={2109.10282},\n",
    "      archivePrefix={arXiv},\n",
    "      primaryClass={cs.CL}\n",
    "}\n",
    "\n",
    "##### For CER Metric\n",
    "- @inproceedings{morris2004,\n",
    "author = {Morris, Andrew and Maier, Viktoria and Green, Phil},\n",
    "year = {2004},\n",
    "month = {01},\n",
    "pages = {},\n",
    "title = {From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition.}\n",
    "}"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "saturn (Python 3)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15"
  },
  "vscode": {
   "interpreter": {
    "hash": "a52fe47989fdc78fafbb981021cec52a6b82df6453830b9ffbd04250493e6cab"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}