BaezCrdrm/basic-text-recognition.ipynb

## basic-text-recognition.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Basic text recognition.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "authorship_tag": "ABX9TyMOMNCnD5eDmi0DcH4AbPSi",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/BaezCrdrm/c9386b02c7a3ba041a0f1e319f2ffee7/basic-text-recognition.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Imágenes a Texto. Reconocimiento básico.\n",
        "Este cuaderno tiene como objetivo experimentar de forma simple el reconocimiento de texto en una imagen utilizando `pytesseract`. Este desarrollo funciona únicamente en imágenes simples donde existe texto.\n",
        "\n",
        "Basado en https://towardsdatascience.com/building-a-simple-text-recognizer-in-python-93e453ddb759"
      ],
      "metadata": {
        "id": "6VucjPoAAnBl"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Importa librerías y configura\n",
        "\n"
      ],
      "metadata": {
        "id": "wolWktXzBfeJ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install Pillow pytesseract\n",
        "!sudo apt install tesseract-ocr"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "RbS9ZA_N27z_",
        "outputId": "0ed4e941-7e70-4626-bae0-9ba2982af63f"
      },
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Requirement already satisfied: Pillow in /usr/local/lib/python3.7/dist-packages (7.1.2)\n",
            "Requirement already satisfied: pytesseract in /usr/local/lib/python3.7/dist-packages (0.3.8)\n",
            "Reading package lists... Done\n",
            "Building dependency tree       \n",
            "Reading state information... Done\n",
            "tesseract-ocr is already the newest version (4.00~git2288-10f4998a-2).\n",
            "0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "BbwsDwKp2paY"
      },
      "outputs": [],
      "source": [
        "# adds image processing capabilities\n",
        "from PIL import Image, ImageEnhance\n",
        "# will convert the image to text string\n",
        "import pytesseract\n",
        "\n",
        "pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "from google.colab import files as FILE\n",
        "import requests\n",
        "import os"
      ],
      "metadata": {
        "id": "ulkW4Nd93GAI"
      },
      "execution_count": 3,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Descarga de imagen"
      ],
      "metadata": {
        "id": "wQPXGgR2BnHb"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "downloadUrl = \"https://inventwithpython.com/blogstatic/diploma_legal_notes.png\"\n",
        "img_data = requests.get(downloadUrl).content\n",
        "with open('image_name.jpg', 'wb') as handler:\n",
        "    handler.write(img_data)\n",
        "\n",
        "# FILE.download('image_name.jpg')"
      ],
      "metadata": {
        "id": "7iYB8wWr-DDR"
      },
      "execution_count": 4,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Lee la imagen descargada"
      ],
      "metadata": {
        "id": "efFaOth8Bo_w"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "img = Image.open('image_name.jpg')"
      ],
      "metadata": {
        "id": "XabFaO8r20D9"
      },
      "execution_count": 5,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Enhancer\n",
        "Esto permite mejorar la imagen para su uso con `pytesseract`."
      ],
      "metadata": {
        "id": "DEPrX15U8qss"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# adding some sharpness and contrast to the image \n",
        "enhancer1 = ImageEnhance.Sharpness(img)\n",
        "enhancer2 = ImageEnhance.Contrast(img)\n",
        "img_edit = enhancer1.enhance(20.0)\n",
        "img_edit = enhancer2.enhance(1.5)\n",
        "# save the new image\n",
        "img_edit.save(\"edited_image.png\")"
      ],
      "metadata": {
        "id": "_cStVICd8sTW"
      },
      "execution_count": 6,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Export"
      ],
      "metadata": {
        "id": "vj3RqU9F8soF"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Convierte la imagen a un resultado y lo guarda en una variable"
      ],
      "metadata": {
        "id": "hai13YrOBtQq"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# result = pytesseract.image_to_string(img)\n",
        "result = pytesseract.image_to_string(img_edit)"
      ],
      "metadata": {
        "id": "dJ3Z6HsE9BrF"
      },
      "execution_count": 7,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Guarda la variable como archivo"
      ],
      "metadata": {
        "id": "5GVofdnuBzLO"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "with open('text_result.txt', mode ='w') as file:\n",
        "    file.write(result)\n",
        "    print('ready!')\n",
        "FILE.download('text_result.txt')"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "id": "j-IIye7Y7xEL",
        "outputId": "6b69f3bb-133b-4f39-b074-29b5e5de30e0"
      },
      "execution_count": 8,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "ready!\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "application/javascript": [
              "\n",
              "    async function download(id, filename, size) {\n",
              "      if (!google.colab.kernel.accessAllowed) {\n",
              "        return;\n",
              "      }\n",
              "      const div = document.createElement('div');\n",
              "      const label = document.createElement('label');\n",
              "      label.textContent = `Downloading \"${filename}\": `;\n",
              "      div.appendChild(label);\n",
              "      const progress = document.createElement('progress');\n",
              "      progress.max = size;\n",
              "      div.appendChild(progress);\n",
              "      document.body.appendChild(div);\n",
              "\n",
              "      const buffers = [];\n",
              "      let downloaded = 0;\n",
              "\n",
              "      const channel = await google.colab.kernel.comms.open(id);\n",
              "      // Send a message to notify the kernel that we're ready.\n",
              "      channel.send({})\n",
              "\n",
              "      for await (const message of channel.messages) {\n",
              "        // Send a message to notify the kernel that we're ready.\n",
              "        channel.send({})\n",
              "        if (message.buffers) {\n",
              "          for (const buffer of message.buffers) {\n",
              "            buffers.push(buffer);\n",
              "            downloaded += buffer.byteLength;\n",
              "            progress.value = downloaded;\n",
              "          }\n",
              "        }\n",
              "      }\n",
              "      const blob = new Blob(buffers, {type: 'application/binary'});\n",
              "      const a = document.createElement('a');\n",
              "      a.href = window.URL.createObjectURL(blob);\n",
              "      a.download = filename;\n",
              "      div.appendChild(a);\n",
              "      a.click();\n",
              "      div.remove();\n",
              "    }\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.Javascript object>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "application/javascript": [
              "download(\"download_78f0ff86-ad75-4642-9552-9dea62dca2ca\", \"text_result.txt\", 709)"
            ],
            "text/plain": [
              "<IPython.core.display.Javascript object>"
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Remove images"
      ],
      "metadata": {
        "id": "JLfJkoZ09eeY"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "os.remove('edited_image.png')\n",
        "os.remove('image_name.jpg')"
      ],
      "metadata": {
        "id": "tvPeK5aq8QRS"
      },
      "execution_count": 9,
      "outputs": []
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "Basic text recognition.ipynb",
	"provenance": [],
	"collapsed_sections": [],
	"authorship_tag": "ABX9TyMOMNCnD5eDmi0DcH4AbPSi",
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	},
	"language_info": {
	"name": "python"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/BaezCrdrm/c9386b02c7a3ba041a0f1e319f2ffee7/basic-text-recognition.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"# Imágenes a Texto. Reconocimiento básico.\n",
	"Este cuaderno tiene como objetivo experimentar de forma simple el reconocimiento de texto en una imagen utilizando `pytesseract`. Este desarrollo funciona únicamente en imágenes simples donde existe texto.\n",
	"\n",
	"Basado en https://towardsdatascience.com/building-a-simple-text-recognizer-in-python-93e453ddb759"
	],
	"metadata": {
	"id": "6VucjPoAAnBl"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"Importa librerías y configura\n",
	"\n"
	],
	"metadata": {
	"id": "wolWktXzBfeJ"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"!pip install Pillow pytesseract\n",
	"!sudo apt install tesseract-ocr"
	],
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "RbS9ZA_N27z_",
	"outputId": "0ed4e941-7e70-4626-bae0-9ba2982af63f"
	},
	"execution_count": 1,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"Requirement already satisfied: Pillow in /usr/local/lib/python3.7/dist-packages (7.1.2)\n",
	"Requirement already satisfied: pytesseract in /usr/local/lib/python3.7/dist-packages (0.3.8)\n",
	"Reading package lists... Done\n",
	"Building dependency tree \n",
	"Reading state information... Done\n",
	"tesseract-ocr is already the newest version (4.00~git2288-10f4998a-2).\n",
	"0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.\n"
	]
	}
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"id": "BbwsDwKp2paY"
	},
	"outputs": [],
	"source": [
	"# adds image processing capabilities\n",
	"from PIL import Image, ImageEnhance\n",
	"# will convert the image to text string\n",
	"import pytesseract\n",
	"\n",
	"pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'"
	]
	},
	{
	"cell_type": "code",
	"source": [
	"from google.colab import files as FILE\n",
	"import requests\n",
	"import os"
	],
	"metadata": {
	"id": "ulkW4Nd93GAI"
	},
	"execution_count": 3,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"Descarga de imagen"
	],
	"metadata": {
	"id": "wQPXGgR2BnHb"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"downloadUrl = \"https://inventwithpython.com/blogstatic/diploma_legal_notes.png\"\n",
	"img_data = requests.get(downloadUrl).content\n",
	"with open('image_name.jpg', 'wb') as handler:\n",
	" handler.write(img_data)\n",
	"\n",
	"# FILE.download('image_name.jpg')"
	],
	"metadata": {
	"id": "7iYB8wWr-DDR"
	},
	"execution_count": 4,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"Lee la imagen descargada"
	],
	"metadata": {
	"id": "efFaOth8Bo_w"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"img = Image.open('image_name.jpg')"
	],
	"metadata": {
	"id": "XabFaO8r20D9"
	},
	"execution_count": 5,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"## Enhancer\n",
	"Esto permite mejorar la imagen para su uso con `pytesseract`."
	],
	"metadata": {
	"id": "DEPrX15U8qss"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"# adding some sharpness and contrast to the image \n",
	"enhancer1 = ImageEnhance.Sharpness(img)\n",
	"enhancer2 = ImageEnhance.Contrast(img)\n",
	"img_edit = enhancer1.enhance(20.0)\n",
	"img_edit = enhancer2.enhance(1.5)\n",
	"# save the new image\n",
	"img_edit.save(\"edited_image.png\")"
	],
	"metadata": {
	"id": "_cStVICd8sTW"
	},
	"execution_count": 6,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"## Export"
	],
	"metadata": {
	"id": "vj3RqU9F8soF"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"Convierte la imagen a un resultado y lo guarda en una variable"
	],
	"metadata": {
	"id": "hai13YrOBtQq"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"# result = pytesseract.image_to_string(img)\n",
	"result = pytesseract.image_to_string(img_edit)"
	],
	"metadata": {
	"id": "dJ3Z6HsE9BrF"
	},
	"execution_count": 7,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"Guarda la variable como archivo"
	],
	"metadata": {
	"id": "5GVofdnuBzLO"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"with open('text_result.txt', mode ='w') as file:\n",
	" file.write(result)\n",
	" print('ready!')\n",
	"FILE.download('text_result.txt')"
	],
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 34
	},
	"id": "j-IIye7Y7xEL",
	"outputId": "6b69f3bb-133b-4f39-b074-29b5e5de30e0"
	},
	"execution_count": 8,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"ready!\n"
	]
	},
	{
	"output_type": "display_data",
	"data": {
	"application/javascript": [
	"\n",
	" async function download(id, filename, size) {\n",
	" if (!google.colab.kernel.accessAllowed) {\n",
	" return;\n",
	" }\n",
	" const div = document.createElement('div');\n",
	" const label = document.createElement('label');\n",
	" label.textContent = `Downloading \"${filename}\": `;\n",
	" div.appendChild(label);\n",
	" const progress = document.createElement('progress');\n",
	" progress.max = size;\n",
	" div.appendChild(progress);\n",
	" document.body.appendChild(div);\n",
	"\n",
	" const buffers = [];\n",
	" let downloaded = 0;\n",
	"\n",
	" const channel = await google.colab.kernel.comms.open(id);\n",
	" // Send a message to notify the kernel that we're ready.\n",
	" channel.send({})\n",
	"\n",
	" for await (const message of channel.messages) {\n",
	" // Send a message to notify the kernel that we're ready.\n",
	" channel.send({})\n",
	" if (message.buffers) {\n",
	" for (const buffer of message.buffers) {\n",
	" buffers.push(buffer);\n",
	" downloaded += buffer.byteLength;\n",
	" progress.value = downloaded;\n",
	" }\n",
	" }\n",
	" }\n",
	" const blob = new Blob(buffers, {type: 'application/binary'});\n",
	" const a = document.createElement('a');\n",
	" a.href = window.URL.createObjectURL(blob);\n",
	" a.download = filename;\n",
	" div.appendChild(a);\n",
	" a.click();\n",
	" div.remove();\n",
	" }\n",
	" "
	],
	"text/plain": [
	"<IPython.core.display.Javascript object>"
	]
	},
	"metadata": {}
	},
	{
	"output_type": "display_data",
	"data": {
	"application/javascript": [
	"download(\"download_78f0ff86-ad75-4642-9552-9dea62dca2ca\", \"text_result.txt\", 709)"
	],
	"text/plain": [
	"<IPython.core.display.Javascript object>"
	]
	},
	"metadata": {}
	}
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"## Remove images"
	],
	"metadata": {
	"id": "JLfJkoZ09eeY"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"os.remove('edited_image.png')\n",
	"os.remove('image_name.jpg')"
	],
	"metadata": {
	"id": "tvPeK5aq8QRS"
	},
	"execution_count": 9,
	"outputs": []
	}
	]
	}