Skip to content

Instantly share code, notes, and snippets.

@sanchezcarlosjr
Created August 3, 2023 16:09
Show Gist options
  • Save sanchezcarlosjr/32e6c1e1af704d5341dfa42809fd5062 to your computer and use it in GitHub Desktop.
Save sanchezcarlosjr/32e6c1e1af704d5341dfa42809fd5062 to your computer and use it in GitHub Desktop.
OCR captcha.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "JEAmyTyePBdF"
},
"source": [
"## OCR of Captcha Images\n",
"\n",
"Dataset Source: https://www.kaggle.com/datasets/alizahidraja/captcha-data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qOSTkrnkPBdH"
},
"source": [
"##### Install Necessary Libraries"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2023-07-02T16:38:00.419279Z",
"iopub.status.busy": "2023-07-02T16:38:00.418892Z",
"iopub.status.idle": "2023-07-02T16:38:10.993604Z",
"shell.execute_reply": "2023-07-02T16:38:10.992638Z",
"shell.execute_reply.started": "2023-07-02T16:38:00.419255Z"
},
"id": "iAmNUfbePBdI",
"outputId": "16bd7b12-9d8c-4d1a-adb2-29aed21b2d39",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: torch in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (1.11.0)\n",
"Requirement already satisfied: torchvision in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (0.12.0)\n",
"Requirement already satisfied: torchaudio in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (0.11.0)\n",
"Requirement already satisfied: typing_extensions in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from torch) (4.7.0)\n",
"Requirement already satisfied: numpy in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from torchvision) (1.21.6)\n",
"Requirement already satisfied: requests in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from torchvision) (2.31.0)\n",
"Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from torchvision) (9.1.1)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->torchvision) (2.0.12)\n",
"Requirement already satisfied: idna<4,>=2.5 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->torchvision) (3.4)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->torchvision) (1.26.15)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->torchvision) (2023.5.7)\n",
"Note: you may need to restart the kernel to use updated packages.\n",
"\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
"dask-cuda 22.4.0 requires click==8.0.4, but you have click 8.1.3 which is incompatible.\u001b[0m\u001b[31m\n",
"\u001b[0mNote: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install torch torchvision torchaudio\n",
"%pip install -q datasets jiwer"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2023-07-02T16:38:23.527710Z",
"iopub.status.busy": "2023-07-02T16:38:23.527353Z",
"iopub.status.idle": "2023-07-02T16:38:30.048267Z",
"shell.execute_reply": "2023-07-02T16:38:30.047483Z",
"shell.execute_reply.started": "2023-07-02T16:38:23.527669Z"
},
"id": "6NXldfxQPPyf",
"outputId": "156a5d1d-b3c9-4cd0-914b-efa8f8f92924",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: transformers in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (4.30.2)\n",
"Requirement already satisfied: filelock in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (3.12.2)\n",
"Requirement already satisfied: huggingface-hub<1.0,>=0.14.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (0.15.1)\n",
"Requirement already satisfied: numpy>=1.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (1.21.6)\n",
"Requirement already satisfied: packaging>=20.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (23.1)\n",
"Requirement already satisfied: pyyaml>=5.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (6.0)\n",
"Requirement already satisfied: regex!=2019.12.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (2023.6.3)\n",
"Requirement already satisfied: requests in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (2.31.0)\n",
"Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (0.13.3)\n",
"Requirement already satisfied: safetensors>=0.3.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (0.3.1)\n",
"Requirement already satisfied: tqdm>=4.27 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from transformers) (4.65.0)\n",
"Requirement already satisfied: fsspec in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (2022.3.0)\n",
"Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (4.7.0)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->transformers) (2.0.12)\n",
"Requirement already satisfied: idna<4,>=2.5 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->transformers) (3.4)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->transformers) (1.26.15)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests->transformers) (2023.5.7)\n",
"Requirement already satisfied: accelerate in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (0.20.3)\n",
"Requirement already satisfied: numpy>=1.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from accelerate) (1.21.6)\n",
"Requirement already satisfied: packaging>=20.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from accelerate) (23.1)\n",
"Requirement already satisfied: psutil in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from accelerate) (5.9.5)\n",
"Requirement already satisfied: pyyaml in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from accelerate) (6.0)\n",
"Requirement already satisfied: torch>=1.6.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from accelerate) (1.11.0)\n",
"Requirement already satisfied: typing_extensions in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from torch>=1.6.0->accelerate) (4.7.0)\n"
]
}
],
"source": [
"! pip install transformers\n",
"! pip install accelerate -U"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2023-07-02T16:38:30.050265Z",
"iopub.status.busy": "2023-07-02T16:38:30.049932Z",
"iopub.status.idle": "2023-07-02T16:38:33.677374Z",
"shell.execute_reply": "2023-07-02T16:38:33.676484Z",
"shell.execute_reply.started": "2023-07-02T16:38:30.050235Z"
},
"id": "l_I_Z3R-PfBA",
"outputId": "872df2f1-8599-4913-c792-0bbb80a57b4b",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting evaluate\n",
" Using cached evaluate-0.4.0-py3-none-any.whl (81 kB)\n",
"Requirement already satisfied: datasets>=2.0.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (2.13.1)\n",
"Requirement already satisfied: numpy>=1.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (1.21.6)\n",
"Requirement already satisfied: dill in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (0.3.6)\n",
"Requirement already satisfied: pandas in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (1.4.2)\n",
"Requirement already satisfied: requests>=2.19.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (2.31.0)\n",
"Requirement already satisfied: tqdm>=4.62.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (4.65.0)\n",
"Requirement already satisfied: xxhash in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (3.2.0)\n",
"Requirement already satisfied: multiprocess in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (0.70.14)\n",
"Requirement already satisfied: fsspec[http]>=2021.05.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (2022.3.0)\n",
"Requirement already satisfied: huggingface-hub>=0.7.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (0.15.1)\n",
"Requirement already satisfied: packaging in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from evaluate) (23.1)\n",
"Collecting responses<0.19 (from evaluate)\n",
" Using cached responses-0.18.0-py3-none-any.whl (38 kB)\n",
"Requirement already satisfied: pyarrow>=8.0.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from datasets>=2.0.0->evaluate) (12.0.1)\n",
"Requirement already satisfied: aiohttp in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from datasets>=2.0.0->evaluate) (3.8.4)\n",
"Requirement already satisfied: pyyaml>=5.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from datasets>=2.0.0->evaluate) (6.0)\n",
"Requirement already satisfied: filelock in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from huggingface-hub>=0.7.0->evaluate) (3.12.2)\n",
"Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from huggingface-hub>=0.7.0->evaluate) (4.7.0)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests>=2.19.0->evaluate) (2.0.12)\n",
"Requirement already satisfied: idna<4,>=2.5 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests>=2.19.0->evaluate) (3.4)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests>=2.19.0->evaluate) (1.26.15)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from requests>=2.19.0->evaluate) (2023.5.7)\n",
"Requirement already satisfied: python-dateutil>=2.8.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from pandas->evaluate) (2.8.2)\n",
"Requirement already satisfied: pytz>=2020.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from pandas->evaluate) (2023.3)\n",
"Requirement already satisfied: six>=1.5 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas->evaluate) (1.16.0)\n",
"Requirement already satisfied: attrs>=17.3.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (23.1.0)\n",
"Requirement already satisfied: multidict<7.0,>=4.5 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (6.0.4)\n",
"Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (4.0.2)\n",
"Requirement already satisfied: yarl<2.0,>=1.0 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.9.2)\n",
"Requirement already satisfied: frozenlist>=1.1.1 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.3.3)\n",
"Requirement already satisfied: aiosignal>=1.1.2 in /opt/saturncloud/envs/saturn/lib/python3.9/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.3.1)\n",
"Installing collected packages: responses, evaluate\n",
"Successfully installed evaluate-0.4.0 responses-0.18.0\n"
]
}
],
"source": [
"! pip install evaluate"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "E2WGyBdhPBdK"
},
"source": [
"##### Import Necessary Libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T16:38:56.612960Z",
"iopub.status.busy": "2023-07-02T16:38:56.612617Z",
"iopub.status.idle": "2023-07-02T16:39:00.587841Z",
"shell.execute_reply": "2023-07-02T16:39:00.587056Z",
"shell.execute_reply.started": "2023-07-02T16:38:56.612928Z"
},
"id": "e7_XKGqgPBdL",
"tags": []
},
"outputs": [],
"source": [
"import os, sys, itertools\n",
"os.environ['TOKENIZERS_PARALLELISM']='false'\n",
"\n",
"import pandas as pd\n",
"\n",
"from PIL import Image\n",
"\n",
"import torch\n",
"from torch.utils.data import Dataset\n",
"\n",
"import datasets\n",
"from datasets import load_dataset\n",
"\n",
"import transformers\n",
"from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer\n",
"from transformers import VisionEncoderDecoderModel, TrOCRProcessor, default_data_collator\n",
"\n",
"import evaluate"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "iv2paOsWPBdM"
},
"source": [
"##### Display Versions of Relevant Software & Libraries"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2023-07-02T16:39:03.132896Z",
"iopub.status.busy": "2023-07-02T16:39:03.132523Z",
"iopub.status.idle": "2023-07-02T16:39:03.138346Z",
"shell.execute_reply": "2023-07-02T16:39:03.137525Z",
"shell.execute_reply.started": "2023-07-02T16:39:03.132871Z"
},
"id": "z4c1UdR6PBdN",
"outputId": "3272a877-2a41-40fc-ba5a-68f9135aeb84",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Python: 3.9.15\n",
" Pandas: 1.4.2\n",
" Datasets: 2.13.1\n",
" Transformers: 4.30.2\n",
" Torch: 1.11.0\n"
]
}
],
"source": [
"print(\"Python:\".rjust(15), sys.version[0:6])\n",
"print(\"Pandas:\".rjust(15), pd.__version__)\n",
"print(\"Datasets:\".rjust(15), datasets.__version__)\n",
"print(\"Transformers:\".rjust(15), transformers.__version__)\n",
"print(\"Torch:\".rjust(15), torch.__version__)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T16:39:05.051096Z",
"iopub.status.busy": "2023-07-02T16:39:05.050710Z",
"iopub.status.idle": "2023-07-02T16:39:05.840681Z",
"shell.execute_reply": "2023-07-02T16:39:05.839435Z",
"shell.execute_reply.started": "2023-07-02T16:39:05.051074Z"
},
"id": "oTg4CqR5QD1n",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/home/jovyan/workspace\n"
]
}
],
"source": [
"! pwd"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "y4VWEzsHPBdP"
},
"source": [
"##### Ingest & Preprocess Training DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2023-07-02T16:39:53.671763Z",
"iopub.status.busy": "2023-07-02T16:39:53.671379Z",
"iopub.status.idle": "2023-07-02T16:39:53.826411Z",
"shell.execute_reply": "2023-07-02T16:39:53.825668Z",
"shell.execute_reply.started": "2023-07-02T16:39:53.671738Z"
},
"id": "Atcq5U47Pw0m",
"outputId": "0a73f060-abd2-4e5c-d156-3090962328f3",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"1153"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import zipfile\n",
"\n",
"with zipfile.ZipFile('dataset-imss.zip', 'r') as zip_ref:\n",
" zip_ref.extractall('captchas')\n",
"\n",
"import glob\n",
"images = glob.glob('/home/jovyan/workspace/captchas/dataset4/**.png', recursive=True)\n",
"len(images)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"execution": {
"iopub.execute_input": "2023-07-02T16:39:55.929613Z",
"iopub.status.busy": "2023-07-02T16:39:55.929242Z",
"iopub.status.idle": "2023-07-02T16:39:55.955770Z",
"shell.execute_reply": "2023-07-02T16:39:55.955138Z",
"shell.execute_reply.started": "2023-07-02T16:39:55.929587Z"
},
"id": "HRbxg-trPBdR",
"outputId": "ead0b4ba-d387-48bb-b3ba-bfca324c6ca8",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>file_name</th>\n",
" <th>text</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>/home/jovyan/workspace/captchas/dataset4/VMyk4...</td>\n",
" <td>VMyk4Dc</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>/home/jovyan/workspace/captchas/dataset4/WCYaD...</td>\n",
" <td>WCYaDHH</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>/home/jovyan/workspace/captchas/dataset4/FP8sR...</td>\n",
" <td>FP8sReS</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>/home/jovyan/workspace/captchas/dataset4/hM89J...</td>\n",
" <td>hM89JvG</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>/home/jovyan/workspace/captchas/dataset4/6VMeA...</td>\n",
" <td>6VMeACE</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" file_name text\n",
"0 /home/jovyan/workspace/captchas/dataset4/VMyk4... VMyk4Dc\n",
"1 /home/jovyan/workspace/captchas/dataset4/WCYaD... WCYaDHH\n",
"2 /home/jovyan/workspace/captchas/dataset4/FP8sR... FP8sReS\n",
"3 /home/jovyan/workspace/captchas/dataset4/hM89J... hM89JvG\n",
"4 /home/jovyan/workspace/captchas/dataset4/6VMeA... 6VMeACE"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import re\n",
"\n",
"df = pd.DataFrame(images, columns=['file_name'])\n",
"df['text'] = df['file_name'].map(lambda x: re.search(r'.*/(.*).png', x).group(1))\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T16:40:01.052879Z",
"iopub.status.busy": "2023-07-02T16:40:01.052507Z",
"iopub.status.idle": "2023-07-02T16:40:01.082474Z",
"shell.execute_reply": "2023-07-02T16:40:01.081807Z",
"shell.execute_reply.started": "2023-07-02T16:40:01.052854Z"
},
"id": "pK3lbuuVRThP",
"tags": []
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"train, test = train_test_split(df, test_size=0.2)\n",
"train = train.reset_index()\n",
"test = test.reset_index()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 424
},
"execution": {
"iopub.execute_input": "2023-07-02T16:40:04.454583Z",
"iopub.status.busy": "2023-07-02T16:40:04.454171Z",
"iopub.status.idle": "2023-07-02T16:40:04.466088Z",
"shell.execute_reply": "2023-07-02T16:40:04.465430Z",
"shell.execute_reply.started": "2023-07-02T16:40:04.454558Z"
},
"id": "PVwUn2rOZVlY",
"outputId": "91a4b500-a27c-4bf7-98e0-87e543c99bd3",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>index</th>\n",
" <th>file_name</th>\n",
" <th>text</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>33</td>\n",
" <td>/home/jovyan/workspace/captchas/dataset4/TtErH...</td>\n",
" <td>TtErHy3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>790</td>\n",
" <td>/home/jovyan/workspace/captchas/dataset4/wyUEp...</td>\n",
" <td>wyUEpjM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>63</td>\n",
" <td>/home/jovyan/workspace/captchas/dataset4/wsndT...</td>\n",
" <td>wsndTMe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>362</td>\n",
" <td>/home/jovyan/workspace/captchas/dataset4/rjUU4...</td>\n",
" <td>rjUU4Ru</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>519</td>\n",
" <td>/home/jovyan/workspace/captchas/dataset4/GrVPa...</td>\n",
" <td>GrVPa9Y</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>917</th>\n",
" <td>851</td>\n",
" <td>/home/jovyan/workspace/captchas/dataset4/Hwbva...</td>\n",
" <td>Hwbvav4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>918</th>\n",
" <td>10</td>\n",
" <td>/home/jovyan/workspace/captchas/dataset4/TCesP...</td>\n",
" <td>TCesPS4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>919</th>\n",
" <td>754</td>\n",
" <td>/home/jovyan/workspace/captchas/dataset4/Estmq...</td>\n",
" <td>Estmq6S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>920</th>\n",
" <td>477</td>\n",
" <td>/home/jovyan/workspace/captchas/dataset4/pUQyG...</td>\n",
" <td>pUQyGSe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>921</th>\n",
" <td>40</td>\n",
" <td>/home/jovyan/workspace/captchas/dataset4/KCM3s...</td>\n",
" <td>KCM3s9h</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>922 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" index file_name text\n",
"0 33 /home/jovyan/workspace/captchas/dataset4/TtErH... TtErHy3\n",
"1 790 /home/jovyan/workspace/captchas/dataset4/wyUEp... wyUEpjM\n",
"2 63 /home/jovyan/workspace/captchas/dataset4/wsndT... wsndTMe\n",
"3 362 /home/jovyan/workspace/captchas/dataset4/rjUU4... rjUU4Ru\n",
"4 519 /home/jovyan/workspace/captchas/dataset4/GrVPa... GrVPa9Y\n",
".. ... ... ...\n",
"917 851 /home/jovyan/workspace/captchas/dataset4/Hwbva... Hwbvav4\n",
"918 10 /home/jovyan/workspace/captchas/dataset4/TCesP... TCesPS4\n",
"919 754 /home/jovyan/workspace/captchas/dataset4/Estmq... Estmq6S\n",
"920 477 /home/jovyan/workspace/captchas/dataset4/pUQyG... pUQyGSe\n",
"921 40 /home/jovyan/workspace/captchas/dataset4/KCM3s... KCM3s9h\n",
"\n",
"[922 rows x 3 columns]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T16:40:09.619510Z",
"iopub.status.busy": "2023-07-02T16:40:09.619106Z",
"iopub.status.idle": "2023-07-02T16:40:09.624809Z",
"shell.execute_reply": "2023-07-02T16:40:09.623938Z",
"shell.execute_reply.started": "2023-07-02T16:40:09.619481Z"
},
"tags": []
},
"outputs": [],
"source": [
"def tokenize(file_path, text):\n",
" # prepare image (i.e. resize + normalize)\n",
" image = Image.open(file_path).convert(\"RGB\")\n",
" pixel_values = self.processor(image, return_tensors=\"pt\").pixel_values\n",
" # add labels (input_ids) by encoding the text\n",
" labels = self.processor.tokenizer(text, padding=\"max_length\", max_length=self.max_target_length).input_ids\n",
" # important: make sure that PAD tokens are ignored by the loss function\n",
" labels = [label if label != self.processor.tokenizer.pad_token_id\n",
" else -100 for label in labels]\n",
"\n",
" encoding = {\"pixel_values\" : pixel_values.squeeze(), \"labels\" : torch.tensor(labels)}\n",
" return encoding "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T16:40:36.770297Z",
"iopub.status.busy": "2023-07-02T16:40:36.769909Z",
"iopub.status.idle": "2023-07-02T16:40:36.776311Z",
"shell.execute_reply": "2023-07-02T16:40:36.775675Z",
"shell.execute_reply.started": "2023-07-02T16:40:36.770274Z"
},
"id": "CS86kRsJPBdX",
"tags": []
},
"outputs": [],
"source": [
"class Captcha_Dataset(Dataset):\n",
"\n",
" def __init__(self,df, processor, max_target_length=128):\n",
" self.df = df\n",
" self.processor = processor\n",
" self.max_target_length = max_target_length\n",
"\n",
" def __len__(self):\n",
" return len(self.df)\n",
"\n",
" def __getitem__(self, idx):\n",
" # get file name + text\n",
" file_path = self.df['file_name'][idx]\n",
" text = self.df['text'][idx]\n",
" # prepare image (i.e. resize + normalize)\n",
" image = Image.open(file_path).convert(\"RGB\")\n",
" pixel_values = self.processor(image, return_tensors=\"pt\").pixel_values\n",
" # add labels (input_ids) by encoding the text\n",
" labels = self.processor.tokenizer(text, padding=\"max_length\", max_length=self.max_target_length).input_ids\n",
" # important: make sure that PAD tokens are ignored by the loss function\n",
" labels = [label if label != self.processor.tokenizer.pad_token_id\n",
" else -100 for label in labels]\n",
"\n",
" encoding = {\"pixel_values\" : pixel_values.squeeze(), \"labels\" : torch.tensor(labels)}\n",
" return encoding"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qMuiii6UPBdY"
},
"source": [
"##### Basic Values/Constants"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T16:41:43.503277Z",
"iopub.status.busy": "2023-07-02T16:41:43.502894Z",
"iopub.status.idle": "2023-07-02T16:41:43.506967Z",
"shell.execute_reply": "2023-07-02T16:41:43.506238Z",
"shell.execute_reply.started": "2023-07-02T16:41:43.503252Z"
},
"id": "Pz10NsdfPBdZ",
"tags": []
},
"outputs": [],
"source": [
"MODEL_CKPT = \"microsoft/trocr-base-printed\"\n",
"MODEL_NAME = MODEL_CKPT.split(\"/\")[-1] + \"_captcha_ocr\"\n",
"NUM_OF_EPOCHS = 5"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mHdN01enPBdZ"
},
"source": [
"##### Instantiate Processor, Create Training, & Testing Dataset Instances"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2023-07-02T16:41:45.762132Z",
"iopub.status.busy": "2023-07-02T16:41:45.761743Z",
"iopub.status.idle": "2023-07-02T16:41:45.951471Z",
"shell.execute_reply": "2023-07-02T16:41:45.950844Z",
"shell.execute_reply.started": "2023-07-02T16:41:45.762108Z"
},
"id": "Vo6_X-rjPBdZ",
"outputId": "9b427a64-daa6-424a-e45b-e61dea91487a",
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.\n"
]
}
],
"source": [
"processor = TrOCRProcessor.from_pretrained(MODEL_CKPT)\n",
"train_ds = Captcha_Dataset(df=train,processor=processor)\n",
"test_ds = Captcha_Dataset(df=test,processor=processor)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lx7hfHVGPBda"
},
"source": [
"##### Print Length of Training & Testing Datasets"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2023-07-02T16:41:48.729007Z",
"iopub.status.busy": "2023-07-02T16:41:48.728622Z",
"iopub.status.idle": "2023-07-02T16:41:48.733380Z",
"shell.execute_reply": "2023-07-02T16:41:48.732602Z",
"shell.execute_reply.started": "2023-07-02T16:41:48.728985Z"
},
"id": "wgHpCqKfPBdb",
"outputId": "106f80b7-f3a7-4efe-eec2-92df29061702",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The training dataset has 922 samples in it.\n",
"The testing dataset has 231 samples in it.\n"
]
}
],
"source": [
"print(f\"The training dataset has {len(train_ds)} samples in it.\")\n",
"print(f\"The testing dataset has {len(test_ds)} samples in it.\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rUAIaTqwPBdb"
},
"source": [
"##### Example of Input Data Shapes"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2023-07-02T16:41:59.911117Z",
"iopub.status.busy": "2023-07-02T16:41:59.910734Z",
"iopub.status.idle": "2023-07-02T16:41:59.930886Z",
"shell.execute_reply": "2023-07-02T16:41:59.930111Z",
"shell.execute_reply.started": "2023-07-02T16:41:59.911090Z"
},
"id": "FnEUP8PKPBdc",
"outputId": "cd817977-eeb4-47df-bdda-8c33fc4c18dd",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"pixel_values : torch.Size([3, 384, 384])\n",
"labels : torch.Size([128])\n"
]
}
],
"source": [
"encoding = train_ds[10]\n",
"\n",
"for k,v in encoding.items():\n",
" print(k, \" : \", v.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "N38-0NPhPBdc"
},
"source": [
"##### Show Example"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 67
},
"execution": {
"iopub.execute_input": "2023-07-02T16:42:03.356471Z",
"iopub.status.busy": "2023-07-02T16:42:03.355908Z",
"iopub.status.idle": "2023-07-02T16:42:03.365782Z",
"shell.execute_reply": "2023-07-02T16:42:03.365173Z",
"shell.execute_reply.started": "2023-07-02T16:42:03.356433Z"
},
"id": "KIWjkbV9PBdd",
"outputId": "54a13aeb-18c6-4248-8d42-4818886df3d7",
"tags": []
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAANwAAAAoCAIAAAAaOwPZAAAFO0lEQVR4nO1c25GjOhDV3toAyGCUgZUBZMBkQAiE4HIIDoEQyAAykDPQZNAZzP045S4tM8aSwI1mV+cLKEs03Uf9ksq/Pj8/VUFBTvjvaAEKCpYopCzIDoWUBdmhkLIgOxRSFmSHQsqC7FBIWZAdCikLskMhZUF2+C32JmstLrTWVVWJvbfgx0GIlNZaIhrHkYjqujbGZEVNIsJFPiI555xzSqmqqowxB0sjCzlP6ZwjIih6nue6rsHLQ9gJSZiLkKppGpUHLyGetXae577vrbUyvFyo5SjrCJFyoVN2A1prYXbCZzvn5nmGDEBd19Za8FIAnMw8YptzbhxHpdTlcjmfz6+Wx18GrBatddu2TdMI81LOUyJkG2OstfhsIrLWWmvByPCw/tSiKwOJaBgGn44MrXXUbMngZKZt22+9IBQiI4y6M3Kaptvt5j/nhSHMSzlSKk/XPjPwBNe3263runVePrXoOsZxZEYaY9q29WWTUT0RXS4XpZS1drsXTF6iC5FwUdd1XddKqev1yqEc3mSLkFEQJaW6Z2y8IrXWfd8rpeZ5xhOoeEUFC4tG8VJrfTqdpmlSd0buW0Zw0ixmwo1LFOAF2TQNfIS1tu/76/Wq7sXAnkI/gzQpOWQ758BIrEJjDCxqjFlxV5yJYzjyrXBj+O5wF/aAE3zrnIOzlyxNdnG6i2yhaZppmo6q+aRJ6VcYoAhYCBPCwCu6gNX5Wmu90UlsARgJTgBg/HZhmOhElMwMebe9F6R3dFhT6m5CX+mL22/x9vbm38LvJsuDUJ4G30vxw9PpBMefLAOCyTAMfliIDaBENE3TMAwL8cJFWpgpdoYtkPaUAIIFmkFRA2Hsj48P1poxpmmaWGcAGyP6MyfS8kuYnIj6vg/sHiBEYOBCBnUPJtyjwG0U0ZmRVVUNw9B1XfhAXgZIKNEVChy+F0RJ6S96aDlhCUJN1lqYv23bKGYzITgrxXPUPQmRF28HIwPHov/FbPYTAJ7Q10y4p8TP2EcSUbh+mIt4HaYiItSgkv5SlJRY9LhOcJMAq+brRQh8QuwCvD2qacL11iOqoS/jpzqBwO/RXFRKdV0X3mJ0zoHKLKS65xJEpLUW46UcKbnuVsftX/mv5iXBReuBu8x+4cy+nJsygYDvZ7easBnTdR2He9SjcJnwoIgG4bMlQ3Tvm91kDqcxIADYkLa76KeGsF/sR63IAPMHTggWDsPQti2UjAUWJQ8227B5gVukFsxLsW6lXPXtB6Pk2P0KJO93IxPA9WInfbsMgXxilw/ScPyNTbUBTvSZ0+fzOaQlsi+ESOkvtQNj977gD6mq6hWOBPOvOyp/YfCqiEolV3BUMiNEylfEbrZWWtXCw7kdg6x3mqZpmnashL5FLOEeeeLFCtda78VIde9WciUuBukDGWqP2O23dZCAq/hlDRtjOHjJu03J7aFw+E2AR23IqJCCaLuRkV93TbnS+jtbQl3XjeOIDY+Nn8cdXczDHZCoHXDOcf0+iBiqP3fhH7kiZHXYSv32B9xwJaJdGDmOox8lqjtwSiFt5ljIHfJFZaf2yFSqqkJlgHNoaKGFD+ew+NV+2jt0HCKnT5pXOJIVvZF3LFftl0cuoGPOue6FXz/3rwAXxVMU1zmVXLA5oQjbeOKYCZ0gP6fpe50PX4RvdVBV+oNJuRH07ESSANIIja1tjhJ1Xb+/v/8F3QzGMQcyckAOVkzOZLTWp9NJKYXYmsO37Ih/11P+aPg7ET/uuORTFFIWZIfyty0F2aGQsiA7FFIWZIdCyoLs8D9FlV0D7PxKyQAAAABJRU5ErkJggg==\n",
"text/plain": [
"<PIL.Image.Image image mode=RGB size=220x40>"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Image.open(train['file_name'][0]).convert(\"RGB\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "O7nOrwBpPBdd"
},
"source": [
"##### Show Label for Above Example"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2023-07-02T16:42:05.806458Z",
"iopub.status.busy": "2023-07-02T16:42:05.806065Z",
"iopub.status.idle": "2023-07-02T16:42:05.811038Z",
"shell.execute_reply": "2023-07-02T16:42:05.810402Z",
"shell.execute_reply.started": "2023-07-02T16:42:05.806432Z"
},
"id": "tjOyc1ksPBde",
"outputId": "2771a745-dc3f-4058-f1fe-c27d28e57ffa",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"vy26xDW\n"
]
}
],
"source": [
"labels = encoding['labels']\n",
"labels[labels == -100] = processor.tokenizer.pad_token_id\n",
"label_str = processor.decode(labels, skip_special_tokens=True)\n",
"print(label_str)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Sc3E_5KCPBde"
},
"source": [
"#### Instantiate Model"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2023-07-02T16:42:08.592330Z",
"iopub.status.busy": "2023-07-02T16:42:08.591936Z",
"iopub.status.idle": "2023-07-02T16:42:15.226651Z",
"shell.execute_reply": "2023-07-02T16:42:15.225927Z",
"shell.execute_reply.started": "2023-07-02T16:42:08.592307Z"
},
"id": "giCycPYSPBdf",
"outputId": "0503d57a-9aaa-412c-afdc-60de94e1e7cb",
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Some weights of VisionEncoderDecoderModel were not initialized from the model checkpoint at microsoft/trocr-base-printed and are newly initialized: ['encoder.pooler.dense.weight', 'encoder.pooler.dense.bias']\n",
"You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
]
}
],
"source": [
"model = VisionEncoderDecoderModel.from_pretrained(MODEL_CKPT)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-rkOX7UEPBdf"
},
"source": [
"##### Model Configuration Modifications"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T16:42:15.228189Z",
"iopub.status.busy": "2023-07-02T16:42:15.227891Z",
"iopub.status.idle": "2023-07-02T16:42:15.233150Z",
"shell.execute_reply": "2023-07-02T16:42:15.232303Z",
"shell.execute_reply.started": "2023-07-02T16:42:15.228166Z"
},
"id": "BCimiZB2PBdg",
"tags": []
},
"outputs": [],
"source": [
"model.config.decoder_start_token_id = processor.tokenizer.cls_token_id\n",
"model.config.pad_token_id = processor.tokenizer.pad_token_id\n",
"\n",
"model.config.vocab_size = model.config.decoder.vocab_size\n",
"\n",
"model.config.eos_token_id = processor.tokenizer.sep_token_id\n",
"model.config.max_length = 64\n",
"model.config.early_stopping = True\n",
"model.config.no_repeat_ngram_size = 3\n",
"model.config.length_penalty = 2.0\n",
"model.config.num_beams = 4"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TdujhGLEPBdh"
},
"source": [
"##### Define Metrics Evaluation"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T16:42:17.957734Z",
"iopub.status.busy": "2023-07-02T16:42:17.957346Z",
"iopub.status.idle": "2023-07-02T16:42:19.107223Z",
"shell.execute_reply": "2023-07-02T16:42:19.106630Z",
"shell.execute_reply.started": "2023-07-02T16:42:17.957710Z"
},
"id": "cSRjcoBVPBdh",
"tags": []
},
"outputs": [],
"source": [
"cer_metric = evaluate.load(\"cer\")\n",
"\n",
"def compute_metrics(pred):\n",
" label_ids = pred.label_ids\n",
" pred_ids = pred.predictions\n",
"\n",
" pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)\n",
" label_ids[label_ids == -100] = processor.tokenizer.pad_token_id\n",
" label_str = processor.batch_decode(label_ids, skip_special_tokens=True)\n",
"\n",
" cer = cer_metric.compute(predictions=pred_str, references=label_str)\n",
"\n",
" return {\"cer\" : cer}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "96ooUEZBPBdh"
},
"source": [
"##### Define Training Arguments"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T16:42:21.819117Z",
"iopub.status.busy": "2023-07-02T16:42:21.818715Z",
"iopub.status.idle": "2023-07-02T16:42:21.825251Z",
"shell.execute_reply": "2023-07-02T16:42:21.824571Z",
"shell.execute_reply.started": "2023-07-02T16:42:21.819072Z"
},
"id": "FEEailluPBdi",
"tags": []
},
"outputs": [],
"source": [
"args = Seq2SeqTrainingArguments(\n",
" output_dir = MODEL_NAME,\n",
" num_train_epochs=NUM_OF_EPOCHS,\n",
" predict_with_generate=True,\n",
" evaluation_strategy=\"epoch\",\n",
" save_strategy=\"steps\", # Change here\n",
" save_steps=1e6, # Add this line, set to a large number\n",
" per_device_train_batch_size=8,\n",
" per_device_eval_batch_size=8,\n",
" logging_first_step=True,\n",
" hub_private_repo=False,\n",
" push_to_hub=False\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OnJTMfK9PBdj"
},
"source": [
"##### Define Trainer"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"execution": {
"iopub.execute_input": "2023-07-02T16:42:26.816215Z",
"iopub.status.busy": "2023-07-02T16:42:26.815829Z",
"iopub.status.idle": "2023-07-02T16:42:27.179177Z",
"shell.execute_reply": "2023-07-02T16:42:27.178432Z",
"shell.execute_reply.started": "2023-07-02T16:42:26.816192Z"
},
"id": "ukKLfcRLPBdj",
"outputId": "0913bf3b-0e6c-4e78-e6d5-a87864dc5b42",
"tags": []
},
"outputs": [],
"source": [
"trainer = Seq2SeqTrainer(\n",
" model=model,\n",
" tokenizer=processor.feature_extractor,\n",
" args=args,\n",
" compute_metrics=compute_metrics,\n",
" train_dataset=train_ds,\n",
" eval_dataset=test_ds,\n",
" data_collator=default_data_collator\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pm6MBc4mPBdk"
},
"source": [
"##### Fit/Train Model"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 134
},
"execution": {
"iopub.execute_input": "2023-07-02T16:42:30.324513Z",
"iopub.status.busy": "2023-07-02T16:42:30.324129Z",
"iopub.status.idle": "2023-07-02T17:06:12.146679Z",
"shell.execute_reply": "2023-07-02T17:06:12.146045Z",
"shell.execute_reply.started": "2023-07-02T16:42:30.324486Z"
},
"id": "RdwPyzCDPBdl",
"outputId": "a1406ff0-bb27-4322-9362-83dcf6af14e7",
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/saturncloud/envs/saturn/lib/python3.9/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n",
" warnings.warn(\n"
]
},
{
"data": {
"text/html": [
"\n",
" <div>\n",
" \n",
" <progress value='580' max='580' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
" [580/580 23:39, Epoch 5/5]\n",
" </div>\n",
" <table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th>Epoch</th>\n",
" <th>Training Loss</th>\n",
" <th>Validation Loss</th>\n",
" <th>Cer</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>11.165900</td>\n",
" <td>1.063115</td>\n",
" <td>0.163884</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>11.165900</td>\n",
" <td>0.590734</td>\n",
" <td>0.097712</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>11.165900</td>\n",
" <td>0.485241</td>\n",
" <td>0.053803</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>11.165900</td>\n",
" <td>0.321299</td>\n",
" <td>0.032158</td>\n",
" </tr>\n",
" <tr>\n",
" <td>5</td>\n",
" <td>0.438800</td>\n",
" <td>0.294638</td>\n",
" <td>0.028448</td>\n",
" </tr>\n",
" </tbody>\n",
"</table><p>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/saturncloud/envs/saturn/lib/python3.9/site-packages/transformers/generation/utils.py:1259: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)\n",
" warnings.warn(\n"
]
},
{
"data": {
"text/plain": [
"TrainOutput(global_step=580, training_loss=0.40041021108627317, metrics={'train_runtime': 1421.6668, 'train_samples_per_second': 3.243, 'train_steps_per_second': 0.408, 'total_flos': 3.4495947304914125e+18, 'train_loss': 0.40041021108627317, 'epoch': 5.0})"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trainer.train()"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T17:10:16.834072Z",
"iopub.status.busy": "2023-07-02T17:10:16.833664Z",
"iopub.status.idle": "2023-07-02T17:10:17.821400Z",
"shell.execute_reply": "2023-07-02T17:10:17.820510Z",
"shell.execute_reply.started": "2023-07-02T17:10:16.834044Z"
},
"tags": []
},
"outputs": [],
"source": [
"! rm -rf xtrocr-base-printed_captcha_ocr2"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jjAvL7blPBdl"
},
"source": [
"##### Save Model & Model State"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T17:10:20.149589Z",
"iopub.status.busy": "2023-07-02T17:10:20.149176Z",
"iopub.status.idle": "2023-07-02T17:10:22.081015Z",
"shell.execute_reply": "2023-07-02T17:10:22.080412Z",
"shell.execute_reply.started": "2023-07-02T17:10:20.149561Z"
},
"id": "CAL-qtqePBdl",
"tags": []
},
"outputs": [],
"source": [
"trainer.save_model(f'/home/jovyan/workspace/x{MODEL_NAME}')\n",
"trainer.save_state()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gqcy34rdPBdn"
},
"source": [
"##### Evaluate Model"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T17:10:25.211492Z",
"iopub.status.busy": "2023-07-02T17:10:25.211105Z",
"iopub.status.idle": "2023-07-02T17:11:50.250586Z",
"shell.execute_reply": "2023-07-02T17:11:50.249820Z",
"shell.execute_reply.started": "2023-07-02T17:10:25.211461Z"
},
"id": "Vh-Hi13zPBdn",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <div>\n",
" \n",
" <progress value='29' max='29' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
" [29/29 01:21]\n",
" </div>\n",
" "
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"{'eval_loss': 0.29463836550712585,\n",
" 'eval_cer': 0.02844774273345702,\n",
" 'eval_runtime': 85.0328,\n",
" 'eval_samples_per_second': 2.717,\n",
" 'eval_steps_per_second': 0.341,\n",
" 'epoch': 5.0}"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trainer.evaluate()"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T17:29:33.805483Z",
"iopub.status.busy": "2023-07-02T17:29:33.805113Z",
"iopub.status.idle": "2023-07-02T17:29:34.704793Z",
"shell.execute_reply": "2023-07-02T17:29:34.703925Z",
"shell.execute_reply.started": "2023-07-02T17:29:33.805459Z"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/home/jovyan/workspace\n"
]
}
],
"source": [
"! pwd"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T17:29:36.103773Z",
"iopub.status.busy": "2023-07-02T17:29:36.103347Z",
"iopub.status.idle": "2023-07-02T17:29:42.509355Z",
"shell.execute_reply": "2023-07-02T17:29:42.508568Z",
"shell.execute_reply.started": "2023-07-02T17:29:36.103745Z"
},
"tags": []
},
"outputs": [],
"source": [
"from transformers import VisionEncoderDecoderModel\n",
"\n",
"model = VisionEncoderDecoderModel.from_pretrained(f'/home/jovyan/workspace/x{MODEL_NAME}')"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T17:29:42.511546Z",
"iopub.status.busy": "2023-07-02T17:29:42.511120Z",
"iopub.status.idle": "2023-07-02T17:29:42.517343Z",
"shell.execute_reply": "2023-07-02T17:29:42.516488Z",
"shell.execute_reply.started": "2023-07-02T17:29:42.511512Z"
},
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'TtErHy3'"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train['text'][0]"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T17:29:44.621137Z",
"iopub.status.busy": "2023-07-02T17:29:44.620754Z",
"iopub.status.idle": "2023-07-02T17:29:44.950276Z",
"shell.execute_reply": "2023-07-02T17:29:44.949560Z",
"shell.execute_reply.started": "2023-07-02T17:29:44.621109Z"
},
"tags": []
},
"outputs": [],
"source": [
"import cv2\n",
"\n",
"def preprocessing(image_path):\n",
" image = cv2.imread(image_path, 0)\n",
" # _, image = cv2.threshold(image, 230, 255, cv2.THRESH_BINARY)\n",
" # ret, image = cv2.threshold(cv2.GaussianBlur(image, (3,3), 0), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)\n",
" # ret, image = cv2.threshold(cv2.GaussianBlur(image, (5,5), 0), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)\n",
" # image = cv2.dilate(cv2.erode(cv2.dilate(image, kernel, iterations=1), kernel), kernel)\n",
" return processor(cv2.merge([image, image, image]), return_tensors=\"pt\").pixel_values\n",
"\n",
"def solve_captcha(file_path):\n",
" generated_ids = model.generate(preprocessing(file_path))\n",
" generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]\n",
" return generated_text"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T17:46:35.049756Z",
"iopub.status.busy": "2023-07-02T17:46:35.049363Z",
"iopub.status.idle": "2023-07-02T19:29:16.001986Z",
"shell.execute_reply": "2023-07-02T19:29:16.001278Z",
"shell.execute_reply.started": "2023-07-02T17:46:35.049731Z"
},
"tags": []
},
"outputs": [],
"source": [
"predicted_values = [solve_captcha(img_file) for img_file in df['file_name']]"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T19:53:42.196913Z",
"iopub.status.busy": "2023-07-02T19:53:42.196492Z",
"iopub.status.idle": "2023-07-02T19:53:42.216670Z",
"shell.execute_reply": "2023-07-02T19:53:42.215702Z",
"shell.execute_reply.started": "2023-07-02T19:53:42.196881Z"
},
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"[('utPHmUu', 'utPHUu'),\n",
" ('quWhtpt', 'quWhtPt'),\n",
" ('hvv2JJK', 'hVv2JK'),\n",
" ('vqpqpq4', 'VqPqpq4'),\n",
" ('GypHTKU', 'GyPHTKU'),\n",
" ('aXBKaaX', 'aXBKaX'),\n",
" ('C4hpRrp', 'G4hPRrP'),\n",
" ('pL4UEsv', 'PL4UEsv'),\n",
" ('f2wuxFu', 'f2WuxFu'),\n",
" ('QePx5cC', 'QePxcC'),\n",
" ('s6M2pWW', 's66M2pWW'),\n",
" ('exfCW4M', 'exfCWAM'),\n",
" ('CCrJjCf', 'CCrJJCf'),\n",
" ('UMWaxF8', 'UMWAxF8'),\n",
" ('t9tnnkr', 't9tnKr'),\n",
" ('VMHhjHK', 'VMHjHK'),\n",
" ('aeqrxnS', 'aeqrxS'),\n",
" ('Ecnmhyj', 'Ecnhnyj'),\n",
" ('K2WQY4W', 'k2wQY4w'),\n",
" ('sd6baaX', 'sd6baX'),\n",
" ('pje2kc2', 'Pje2kc2'),\n",
" ('QvaVv6k', 'QvaVvvv6k'),\n",
" ('fYrHLKe', 'fYrHLke'),\n",
" ('qsCGPWU', 'qsCGPwU'),\n",
" ('yb6wvjp', 'yb6wyjP'),\n",
" ('vQ7EKPv', 'vQ7EKpv'),\n",
" ('8eewnfL', '8ewnfL'),\n",
" ('e64wsQK', 'e64wSQK'),\n",
" ('yqdEkAp', 'yqdEkap'),\n",
" ('yRVCEmH', 'yRVCEmh'),\n",
" ('JpqkKHJ', 'JPqkKHJ'),\n",
" ('JYcknbj', 'JYcksnbj'),\n",
" ('SSHD6P2', 'sSHD6P2'),\n",
" ('d8yxwxF', 'd8yxF'),\n",
" ('mvTMBu7', 'nvTMBu7'),\n",
" ('pDVhkAa', 'pDVAkAa'),\n",
" ('8LM2EBH', '8LM2EEBH'),\n",
" ('qpexxeH', 'qpexceH'),\n",
" ('qJpVPSh', 'qJPVPSh'),\n",
" ('xGfxcHB', 'xGfxchB'),\n",
" ('TXsqhMG', 'TXsqMG'),\n",
" ('aVaUaeP', 'aVaUaep'),\n",
" ('wmsBbEb', 'WmsBbEb'),\n",
" ('E5Mqdun', 'ESMqdun'),\n",
" ('jf99937', 'jf9937'),\n",
" ('jwv7ywf', 'jWV7Ywf')]"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"accuracy = list(filter(lambda x: x[1] == 0, [[i,int(df['text'][i]==predicted_values[i])] for i in range(0, len(predicted_values))]))\n",
"list(map(lambda x: (df['text'][x[0]],predicted_values[x[0]]), accuracy))"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"execution": {
"iopub.execute_input": "2023-07-02T19:52:08.024581Z",
"iopub.status.busy": "2023-07-02T19:52:08.024178Z",
"iopub.status.idle": "2023-07-02T19:52:08.029239Z",
"shell.execute_reply": "2023-07-02T19:52:08.028504Z",
"shell.execute_reply.started": "2023-07-02T19:52:08.024556Z"
},
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"('jwv7ywf', 'jWV7Ywf')"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['text'][1117],predicted_values[1117]"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {
"execution": {
"iopub.execute_input": "2023-06-29T22:13:56.023823Z",
"iopub.status.busy": "2023-06-29T22:13:56.023442Z",
"iopub.status.idle": "2023-06-29T22:13:56.029904Z",
"shell.execute_reply": "2023-06-29T22:13:56.028956Z",
"shell.execute_reply.started": "2023-06-29T22:13:56.023798Z"
},
"tags": []
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAIcAAAAmCAAAAADLzNedAAAEzUlEQVR4nO2XwXXrSA5Fb8/x/r8MhAyEDMgM1BmoM3AIPg7BGfTPwMyAzADOAMoAE8HMokiKkixLi5lZDTb2oQpVFyjUK9QfAPbX0YD8/XeymPVdb1xbjlNixs5N2885/V0FRptp/V6VwAl230y2tRerorKA5c88RchvRleessjE4AxSWVMgCXyvzWCK+GcCSV/8aC9/xVhEmKiI7eCMw83ojGoRKrxbUlIZpwwwE3tfOCpjArLF5g8weDnuGclBPTklADJBxR2Hee3I8EOv9v8pspDZL/UrcU6fuXHbP+IwwcgIWtC9MxjQN6OtH4meyqIiiwUkCrwzLZhUTp9ZoLZ/1rk/4EB9A+mGBcNNpd3pcEsi6+sAOUUWNUIvKqZI1B/MzwmoiiygWlSy27muOUBdBmMSAOS+RQ7fhNBqtzymiAYClcv6dd62YUnu60OExjFCDqiWgggfvjoT9u3GLDQGBDXKbEYIkadfDTNjimgDn8Tg5R1AFuuXQCHrLrNRuT2oIK8AKmLOIvkZuHpB5UfbFNQf7CmKti8YqdosGoh1WyoLcoLL0L4SID9e+ch5nMWxfYs1qgqP4hsluuHodwZrVKs1Fausdpwr8Y/X2+kqPzirX99fjqgQOSVv8RDk5SgRPlx+NfmeImPKRWW3CQPUtaBXnRHms6Yf02e2/Ayy+ve3Rxi8GOCjUeeUCJlBxhRrrDIdN25Vl1iYTIe2cR5vQ8xnqFWRHmK0+kDHYTd4tAVJioKVQiass+0uV06XIHJ1/byex2E3kOcI3h6DvMyex9Lv5vXnbkgqzreee2fXSpRTZoOey6POGOBZToKhmwh+5MBZwvOda0haSu5hVI4JqD9wGuctuDjWVgHC9PqElm44OKuWm1nB+4KhvLmj6nPIAvXHnmAk4bJeJApKeqV/AmLLsSyRJRzGt/cZA4YB6wyWqzhjGBuGgH1l89wczmp3d4O71sCnOBbrxxnjyJTQci91ttxyQE1dwBQA5HAYz+6/o4Cq97eRnKCF8RSH7EoiAPSGey2dCUxT1SIoYpYQoEn82apUQNq7BPDF17bZ+4nDuouZFutxam2ncqgoWjXpTHuzhFgLn47pYRt0mY97g9ZzV3S/CwzXmuth7ib7buORQ6jA7CgZjj3qDO/Vxz2iE2Due1tqLw8QgHdLCwAUB0aQdXK1G9P50e5yfFculeOYyG1na/KMXcwOG+cWvQmJp+wf935oJ3XzlKiqzKkKIq+Jr94cAG4Ap+cofsqHexY58KcaxojVV5OnSO9Mc7Cyr3tz5LMq9g1H5Xw9Wje3oG6iMgZ1WqYnQ2bsTKa5Tla/2SKvRfZZDpkK1TS3+C0hNaZ3Rk6RzmEfauexAgnJDv3svPpdBpXP9qfnf63LQOd3qB1qLCoyRCUU0pHQHGcVYO2pprrqqmVdxjdV8wyHHMNYW1v1MFbrZQTmr4bhQIzQhCpHCSylq4NhPlj+0PL/wIHjbE/aAtLmbe/Z3oG9gIwCA9gTSJetueRQ5hcfn+S40Rr1KBvG8qqejwgwKil6YVU7sPWFPYMfUNlDPV/sj3/9+POywbeiXzXCCTqnKoyNms4DYuKX/Mk6fcDxI2N7SAqyxDVG6zsuVPa/xPEftbu6/j+2/3Nc2r8BfnyIF4iSw0YAAAAASUVORK5CYII=\n",
"text/plain": [
"<PIL.PngImagePlugin.PngImageFile image mode=L size=135x38>"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Image.open(df['file_name'][18])"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"execution": {
"iopub.execute_input": "2023-06-29T22:18:23.458106Z",
"iopub.status.busy": "2023-06-29T22:18:23.457704Z",
"iopub.status.idle": "2023-06-29T22:18:28.624912Z",
"shell.execute_reply": "2023-06-29T22:18:28.624253Z",
"shell.execute_reply.started": "2023-06-29T22:18:23.458083Z"
},
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/saturncloud/envs/saturn/lib/python3.9/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (64) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.\n",
" warnings.warn(\n"
]
},
{
"data": {
"text/plain": [
"'3Vn7kw'"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"solve_captcha('3yvh7KW.png')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JJbH8MtBPBdo"
},
"source": [
"##### Push Model to Hub (My Profile!!!)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "63SWFV94PBdo"
},
"outputs": [],
"source": [
"kwargs = {\n",
" \"finetuned_from\" : model.config._name_or_path,\n",
" \"tasks\" : \"image-to-text\",\n",
" \"tags\" : [\"image-to-text\"],\n",
"}\n",
"\n",
"if args.push_to_hub:\n",
" trainer.push_to_hub(\"All Dunn!!!\")\n",
"else:\n",
" trainer.create_model_card(**kwargs)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5aSPONY_PBdp"
},
"source": [
"### Notes & Other Takeaways From This Project\n",
"****\n",
"- The Character Error Rate (CER) was 0.0075. I am pleased with that result.\n",
"- Context about metric: Zero (0) is perfection. One the worst score (unless there is an insertion error).\n",
"****"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SKuM2BqePBdp"
},
"source": [
"### Citations\n",
"\n",
"##### For Transformer Checkpoint\n",
"- @misc{li2021trocr,\n",
" title={TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models},\n",
" author={Minghao Li and Tengchao Lv and Lei Cui and Yijuan Lu and Dinei Florencio and Cha Zhang and Zhoujun Li and Furu Wei},\n",
" year={2021},\n",
" eprint={2109.10282},\n",
" archivePrefix={arXiv},\n",
" primaryClass={cs.CL}\n",
"}\n",
"\n",
"##### For CER Metric\n",
"- @inproceedings{morris2004,\n",
"author = {Morris, Andrew and Maier, Viktoria and Green, Phil},\n",
"year = {2004},\n",
"month = {01},\n",
"pages = {},\n",
"title = {From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition.}\n",
"}"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "saturn (Python 3)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.15"
},
"vscode": {
"interpreter": {
"hash": "a52fe47989fdc78fafbb981021cec52a6b82df6453830b9ffbd04250493e6cab"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment