JannesKlaas/Group Specialops.ipynb

## Group Specialops.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "ORTEC (2).ipynb",
      "version": "0.3.2",
      "views": {},
      "default_view": {},
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  },
  "cells": [
    {
      "metadata": {
        "id": "gUH91JWlJAQX",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "# Invoice reading NLP system\n",
        "Remember this image? IT IS BACK!!!\n",
        "![System image](https://storage.googleapis.com/aibootcamp/general_assets/ml_system_architecture.png)\n",
        "\n",
        "\n",
        "This week is all about system building. Because hardly ever does a ML system stand alone. Your success in building a system for Ortec Finance depends as much on what is around your neural net as it depends on the neural net itself. This baseline is my approach to the problem. Much in this notebook was hacked together so I am sure you can improve on many points. Perhaps you even come up with a completely different approach.\n",
        "\n",
        "## The approach, character wise classification:\n",
        "The goal of the task is to extract information from the invoice. The invoice has been run through optical character recognition (OCR). OCR turns PDFs into texts but often messes up the order and confuses come characters. **To extract information from this text, we classify each character by category**. \n",
        "\n",
        "Take an example, if we just wanted to get the amount we would classify the characters like this:\n",
        "\n",
        "|T|O|T|A|L|:| |€| |4|3|6|.|0|0|\n",
        "|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|\n",
        "|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|\n",
        "\n",
        "We classify our text into 6 classes here:\n",
        "\n",
        "Ignore:           0\n",
        "Sender Name:      1 \n",
        "Sender KVK:       2 \n",
        "Sender IBAN:      3 \n",
        "Invoice Reference:4\n",
        "Total:            5\n",
        "\n",
        "These are the classes that the training data generator tags. But the class of a character does not only depend on the character. It depends on its surroundings as well. To train our model, we create substrings of our invoice that include a certain amount of preceeding and succeeding characters. The amount of preceding and succeeding characters is defined in the `PADDING` global variable. \n",
        "\n",
        "If for example we wanted to classify the character '€' from the example above and had `PADDING = 3` we would feed\n",
        "'L: € 43' into our network. You can see how the amount of padding has a great influence on the performance of our system.\n",
        "\n",
        "## Post processing:\n",
        "A significant part of model performance stems from what is done with the outputs of the neural net. This approach groups predictions to prediction sequences and only keeps predictions in which 5 consecutive characters were grouped into the same category. An approach to try would be to allow sequences to be interrupted by one character. Another nice add on would be to rank predicted sequences by the total confidence the neural network has in the sequence. \n",
        "\n",
        "## Some tips:\n",
        "For this assignment you can dive pretty deep into software development. \n",
        "You might find these jupyter tricks helpful: https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/\n",
        "\n",
        "Especially debugging with `pdb` really makes things easier: https://docs.python.org/3.5/library/pdb.html#debugger-commands\n",
        "\n",
        "Basically, if anything crashes, you can start a new cell and enter `%debug`. You then come to a command line in which you can look around what happened at the crash.\n",
        "The debugger has some special commands. For example `p my_var` prints out a variable. This also works for other python operations, e.g. `p len(my_list)`.\n",
        "\n",
        "Good luck with building a great system!"
      ]
    },
    {
      "metadata": {
        "id": "J_mZzOVTJT6D",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          },
          "output_extras": [
            {
              "item_id": 2
            }
          ],
          "base_uri": "https://localhost:8080/",
          "height": 207
        },
        "outputId": "e6145fb3-3cbd-41fa-b32c-58249221f642",
        "executionInfo": {
          "status": "ok",
          "timestamp": 1521490934760,
          "user_tz": -60,
          "elapsed": 1979,
          "user": {
            "displayName": "Fernando Lasso",
            "photoUrl": "//lh5.googleusercontent.com/-nEu68SAVl9Y/AAAAAAAAAAI/AAAAAAAABSo/fc8BpWFBAC8/s50-c-k-no/photo.jpg",
            "userId": "102520388349909004575"
          }
        }
      },
      "cell_type": "code",
      "source": [
        "# Setup \n",
        "!wget https://storage.googleapis.com/aibootcamp/data/ortec_templates_updated.zip"
      ],
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "--2018-03-19 20:21:50--  https://storage.googleapis.com/aibootcamp/data/ortec_templates_updated.zip\r\n",
            "Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.28.128, 2607:f8b0:400e:c03::80\r\n",
            "Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.28.128|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 20419 (20K) [application/zip]\n",
            "Saving to: ‘ortec_templates_updated.zip’\n",
            "\n",
            "ortec_templates_upd 100%[===================>]  19.94K  --.-KB/s    in 0s      \n",
            "\n",
            "2018-03-19 20:21:50 (61.8 MB/s) - ‘ortec_templates_updated.zip’ saved [20419/20419]\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "6hjsuqgc-uCS",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          },
          "output_extras": [
            {
              "item_id": 1
            }
          ],
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "0e2846ef-6559-4e83-f25e-2e38778d1677",
        "executionInfo": {
          "status": "ok",
          "timestamp": 1521492806716,
          "user_tz": -60,
          "elapsed": 1828,
          "user": {
            "displayName": "Fernando Lasso",
            "photoUrl": "//lh5.googleusercontent.com/-nEu68SAVl9Y/AAAAAAAAAAI/AAAAAAAABSo/fc8BpWFBAC8/s50-c-k-no/photo.jpg",
            "userId": "102520388349909004575"
          }
        }
      },
      "cell_type": "code",
      "source": [
        "!ls"
      ],
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "datalab  __MACOSX  ortec_templates_updated.zip\ttemplates\r\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "qXnlL2boJjII",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          },
          "output_extras": [
            {
              "item_id": 1
            }
          ],
          "base_uri": "https://localhost:8080/",
          "height": 328
        },
        "outputId": "79ba2535-2501-4669-b01c-760b284591d4",
        "executionInfo": {
          "status": "ok",
          "timestamp": 1521490938241,
          "user_tz": -60,
          "elapsed": 1465,
          "user": {
            "displayName": "Fernando Lasso",
            "photoUrl": "//lh5.googleusercontent.com/-nEu68SAVl9Y/AAAAAAAAAAI/AAAAAAAABSo/fc8BpWFBAC8/s50-c-k-no/photo.jpg",
            "userId": "102520388349909004575"
          }
        }
      },
      "cell_type": "code",
      "source": [
        "!unzip ortec_templates_updated.zip"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Archive:  ortec_templates_updated.zip\r\n",
            "   creating: templates/\r\n",
            "  inflating: templates/.DS_Store     \r\n",
            "   creating: __MACOSX/\r\n",
            "   creating: __MACOSX/templates/\r\n",
            "  inflating: __MACOSX/templates/._.DS_Store  \r\n",
            "  inflating: templates/invoicegen.py  \r\n",
            "  inflating: __MACOSX/templates/._invoicegen.py  \r\n",
            "   creating: templates/__pycache__/\r\n",
            "  inflating: templates/__pycache__/invoicegen.cpython-35.pyc  \r\n",
            "   creating: templates/.ipynb_checkpoints/\r\n",
            "  inflating: templates/.ipynb_checkpoints/Untitled1-checkpoint.ipynb  \r\n",
            "  inflating: templates/.ipynb_checkpoints/Untitled-checkpoint.ipynb  \r\n",
            "  inflating: templates/.ipynb_checkpoints/Baseline Ortec-checkpoint.ipynb  \r\n",
            "  inflating: templates/TEMPLATE_1.txt  \r\n",
            "  inflating: __MACOSX/templates/._TEMPLATE_1.txt  \r\n",
            "  inflating: templates/TEMPLATE_2.txt  \r\n",
            "  inflating: __MACOSX/templates/._TEMPLATE_2.txt  \r\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "_DnStomIJjRk",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          },
          "output_extras": [
            {
              "item_id": 1
            }
          ],
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "7bc7725b-6c9d-4241-8b64-b3242e520a2e",
        "executionInfo": {
          "status": "ok",
          "timestamp": 1521490939762,
          "user_tz": -60,
          "elapsed": 1489,
          "user": {
            "displayName": "Fernando Lasso",
            "photoUrl": "//lh5.googleusercontent.com/-nEu68SAVl9Y/AAAAAAAAAAI/AAAAAAAABSo/fc8BpWFBAC8/s50-c-k-no/photo.jpg",
            "userId": "102520388349909004575"
          }
        }
      },
      "cell_type": "code",
      "source": [
        "!ls"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "datalab  __MACOSX  ortec_templates_updated.zip\ttemplates\r\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "caddquRKJ2vC",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "!pip install -q keras"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "i3EuDqPXJAQZ",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## Loading templates"
      ]
    },
    {
      "metadata": {
        "id": "Mq9VfteLJAQZ",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "# System hyper parameters here\n",
        "\n",
        "# How many characters before and after the main char to feed the NN\n",
        "PADDING = 20 \n",
        "\n",
        "\n",
        "'''\n",
        "Ignore:           0\n",
        "Sender Name:      1 \n",
        "Sender KVK:       2 \n",
        "Sender IBAN:      3 \n",
        "Invoice Reference:4\n",
        "Total:            5\n",
        "'''\n",
        "N_CLASSES = 6"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "B5nyjD3ZJAQd",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "# Invoice data generator\n",
        "from templates.invoicegen import create_invoice"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "U80fgzU4JAQg",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          },
          "output_extras": [
            {
              "item_id": 1
            }
          ],
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "5af409b9-23d0-442f-a748-10446a091b9e",
        "executionInfo": {
          "status": "ok",
          "timestamp": 1521492814005,
          "user_tz": -60,
          "elapsed": 1641,
          "user": {
            "displayName": "Fernando Lasso",
            "photoUrl": "//lh5.googleusercontent.com/-nEu68SAVl9Y/AAAAAAAAAAI/AAAAAAAABSo/fc8BpWFBAC8/s50-c-k-no/photo.jpg",
            "userId": "102520388349909004575"
          }
        }
      },
      "cell_type": "code",
      "source": [
        "\n",
        "# Your friendly tokenizer\n",
        "from keras.preprocessing.text import Tokenizer\n",
        "\n",
        "# Numpy\n",
        "import numpy as np\n",
        "\n",
        "#Pandas of course too\n",
        "import pandas as pd"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Using TensorFlow backend.\n"
          ],
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {
        "id": "h8AtNKmOJAQn",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "# Create 100 invoices for each template\n",
        "\n",
        "invoices = []\n",
        "targets = []\n",
        "\n",
        "# Load template 1\n",
        "with open('templates/TEMPLATE_1.txt', 'r') as content_file:\n",
        "    content = content_file.read()\n",
        "\n",
        "# Create invoices from template\n",
        "for i in range(100):\n",
        "    inv, tar = create_invoice(content)\n",
        "    invoices.append(inv)\n",
        "    targets.append(tar)\n",
        "    \n",
        "# Load template 2\n",
        "with open('templates/TEMPLATE_2.txt', 'r') as content_file:\n",
        "    content = content_file.read()\n",
        "    \n",
        "# Create invoices from template\n",
        "for i in range(100):\n",
        "    inv, tar = create_invoice(content)\n",
        "    invoices.append(inv)\n",
        "    targets.append(tar)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "5U69d52YJAQr",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          },
          "output_extras": [
            {
              "item_id": 1
            }
          ],
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "33a4fdcb-c92b-4833-f204-c08042728a9a",
        "executionInfo": {
          "status": "ok",
          "timestamp": 1521492815439,
          "user_tz": -60,
          "elapsed": 756,
          "user": {
            "displayName": "Fernando Lasso",
            "photoUrl": "//lh5.googleusercontent.com/-nEu68SAVl9Y/AAAAAAAAAAI/AAAAAAAABSo/fc8BpWFBAC8/s50-c-k-no/photo.jpg",
            "userId": "102520388349909004575"
          }
        }
      },
      "cell_type": "code",
      "source": [
        "len(targets)"
      ],
      "execution_count": 6,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "200"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 6
        }
      ]
    },
    {
      "metadata": {
        "id": "Xwm5ToJGJAQv",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## Generate substring"
      ]
    },
    {
      "metadata": {
        "id": "Lv7HCR6ZJAQv",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "# Create our tokenizer\n",
        "# We will tokenize on character level!\n",
        "# We will NOT remove any characters\n",
        "tokenizer = Tokenizer(char_level=True, filters=None)\n",
        "tokenizer.fit_on_texts(invoices)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "fardNlOkJAQz",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "def gen_sub(inv,tar,pad, m = None):\n",
        "    '''\n",
        "    Generates a substring from invoice inv and target list tar \n",
        "    using the character at index m as a midpoint.\n",
        "    \n",
        "    Params:\n",
        "    inv - an invoice string\n",
        "    tar - a target list specifying the type of each item\n",
        "    pad - the amount of padding to attach before and after the focus character\n",
        "    \n",
        "    Returns:\n",
        "    sub - a string with pad characters, the focus character, pad characters\n",
        "    '''\n",
        "    # If no focus character index is set, choose at random\n",
        "    if m == None:\n",
        "        m = np.random.randint(0,len(inv))\n",
        "        \n",
        "    l = m - pad # define the lower bound of our substring\n",
        "    h = m + pad + 1 # define the upper (high) of our substring\n",
        "\n",
        "    # Sometimes, our lower bound could be below zero\n",
        "    # In this case we attach the remaining characters from the back of the string\n",
        "    if l < 0:\n",
        "        s1 = ''.zfill(-1*l)\n",
        "        s2 = inv[None:h]\n",
        "        # Create substring\n",
        "        sub = s1 + s2\n",
        "        \n",
        "        # Ensure the substring has the right length\n",
        "        assert(len(sub) == pad*2 +1)\n",
        "        return sub, tar[m]\n",
        "    \n",
        "    # Our lower bound might be positive but our upper bound might \n",
        "    # still be above the length of the invoice\n",
        "    elif h >= len(inv):\n",
        "        # Calc how many chars are too much\n",
        "        overlap = h - len(inv)\n",
        "        # Get string from lower bound to end\n",
        "        s1 = inv[l:None]\n",
        "        # Get string from the front of the doc\n",
        "        s2 = ''.zfill(overlap)\n",
        "        sub = s1 + s2\n",
        "        # Make sure our string has the correct length\n",
        "        assert(len(sub) == pad*2 +1)\n",
        "        return sub, tar[m]\n",
        "    \n",
        "    # Upper and lower bound lie within the length of the invoice\n",
        "    else: \n",
        "        sub = inv[l:h]\n",
        "        assert(len(sub) == pad*2 +1)\n",
        "        return sub, tar[m]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "LloL1ejTJAQ5",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## Generate dataset for training"
      ]
    },
    {
      "metadata": {
        "id": "O1rC068zJAQ5",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "def gen_dataset(sample_size, n_classes, invoices, targets, tokenizer):\n",
        "    '''\n",
        "    Generate a dataset of inputs and outputs for our neural network\n",
        "    \n",
        "    Params:\n",
        "    sample_size - desired sample size\n",
        "    n_classes - number of classes\n",
        "    invoices - list of invoices to sample from\n",
        "    targets - list of corresonding targets to sample from\n",
        "    tokenizer - a keras tokenizer fit on the invoices\n",
        "    \n",
        "    The function creates balanced samples by randomly sampling untill \n",
        "    an equal amount of samples of all types is created.\n",
        "    \n",
        "    Characters are one hot encoded\n",
        "    \n",
        "    Returns:\n",
        "    x_arr: a numpy array of shape (sample_size, seqence length, number of unique characters)\n",
        "    y_arr: a numpy array of shape (sample_size,)\n",
        "    '''\n",
        "    \n",
        "    # Create a budget\n",
        "    budget = [sample_size / n_classes] * n_classes\n",
        "    \n",
        "    # Setup holding variables\n",
        "    X_train = []\n",
        "    y_train = []\n",
        "\n",
        "    # While there is still a budget left...\n",
        "    while sum(budget) > 0:\n",
        "        # ... get a random invoice and target list\n",
        "        index = np.random.randint(0,len(invoices))\n",
        "        inv = invoices[index]\n",
        "        tar = targets[index]\n",
        "        # ... sample up to 10 items from this invoice \n",
        "        for j in range(10):\n",
        "            # Get an item\n",
        "            x, y = gen_sub(inv,tar,PADDING)\n",
        "            # if we still have a budget for this items target\n",
        "            if budget[y] > 0:\n",
        "                # Tokenize to one hot\n",
        "                xm = tokenizer.texts_to_matrix(x)\n",
        "                # Add data and target\n",
        "                X_train.append(xm)\n",
        "                y_train.append(y)\n",
        "                budget[y] -= 1\n",
        "      \n",
        "    # Create numpy arrays from all data and targets\n",
        "    x_arr = np.array(X_train)\n",
        "    y_arr = np.array(y_train)\n",
        "    return x_arr,y_arr"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "yGA5pV-3JAQ8",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "# Ger data\n",
        "train_size = 12000\n",
        "val_size = 120\n",
        "\n",
        "x_tr, y_tr = gen_dataset(train_size, N_CLASSES, invoices, targets, tokenizer)\n",
        "x_val, y_val = gen_dataset(val_size, N_CLASSES, invoices, targets, tokenizer)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "crz627iyJAQ_",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          },
          "output_extras": [
            {
              "item_id": 1
            }
          ],
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "0b33203c-ad3a-4849-bda9-57bec518190d",
        "executionInfo": {
          "status": "ok",
          "timestamp": 1521492822922,
          "user_tz": -60,
          "elapsed": 418,
          "user": {
            "displayName": "Fernando Lasso",
            "photoUrl": "//lh5.googleusercontent.com/-nEu68SAVl9Y/AAAAAAAAAAI/AAAAAAAABSo/fc8BpWFBAC8/s50-c-k-no/photo.jpg",
            "userId": "102520388349909004575"
          }
        }
      },
      "cell_type": "code",
      "source": [
        "x_tr.shape"
      ],
      "execution_count": 11,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(12000, 41, 83)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 11
        }
      ]
    },
    {
      "metadata": {
        "id": "82CcOTIgJARD",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## Model building"
      ]
    },
    {
      "metadata": {
        "id": "JJ977SYzJARD",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "from keras.models import Sequential\n",
        "from keras.layers import SimpleRNN, Dense,Activation, Conv1D, MaxPool1D,LSTM"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "IFzNMcqSJARG",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "# Model to recognize types\n",
        "model = Sequential()\n",
        "model.add(Conv1D(63,2,input_shape=(None,83)))\n",
        "model.add(MaxPool1D(2))\n",
        "model.add(Conv1D(42,2))\n",
        "model.add(MaxPool1D(2))\n",
        "model.add(LSTM(20))\n",
        "model.add(Dense(6))\n",
        "model.add(Activation('softmax'))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "UNYPtr9zJARJ",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "# sparse_categorical_crossentropy is like categorical crossentropy but without converting targets to one hot\n",
        "model.compile(loss='sparse_categorical_crossentropy',optimizer='adam', metrics=['acc'])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "p2JL9ObQJARL",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          },
          "output_extras": [
            {
              "item_id": 76
            },
            {
              "item_id": 155
            },
            {
              "item_id": 187
            },
            {
              "item_id": 188
            }
          ],
          "base_uri": "https://localhost:8080/",
          "height": 348
        },
        "outputId": "451f06bc-3312-412f-ddf3-1c5c72d29bc2",
        "executionInfo": {
          "status": "ok",
          "timestamp": 1521492900427,
          "user_tz": -60,
          "elapsed": 48817,
          "user": {
            "displayName": "Fernando Lasso",
            "photoUrl": "//lh5.googleusercontent.com/-nEu68SAVl9Y/AAAAAAAAAAI/AAAAAAAABSo/fc8BpWFBAC8/s50-c-k-no/photo.jpg",
            "userId": "102520388349909004575"
          }
        }
      },
      "cell_type": "code",
      "source": [
        "#Not to many epochs (avoid overfitting)\n",
        "model.fit(x_tr,y_tr,batch_size=32,epochs=8,validation_data=(x_val,y_val))"
      ],
      "execution_count": 19,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Train on 12000 samples, validate on 120 samples\n",
            "Epoch 1/8\n",
            "12000/12000 [==============================] - 6s 504us/step - loss: 0.3104 - acc: 0.9229 - val_loss: 0.0897 - val_acc: 0.9750\n",
            "Epoch 2/8\n",
            "12000/12000 [==============================] - 6s 481us/step - loss: 0.0506 - acc: 0.9878 - val_loss: 0.0960 - val_acc: 0.9750\n",
            "Epoch 3/8\n",
            "12000/12000 [==============================] - 6s 505us/step - loss: 0.0272 - acc: 0.9942 - val_loss: 0.0675 - val_acc: 0.9917\n",
            "Epoch 4/8\n",
            " 2880/12000 [======>.......................] - ETA: 4s - loss: 0.0208 - acc: 0.9955"
          ],
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": [
            "12000/12000 [==============================] - 6s 501us/step - loss: 0.0222 - acc: 0.9951 - val_loss: 0.0751 - val_acc: 0.9833\n",
            "Epoch 5/8\n",
            "12000/12000 [==============================] - 6s 496us/step - loss: 0.0200 - acc: 0.9952 - val_loss: 0.0518 - val_acc: 0.9917\n",
            "Epoch 6/8\n",
            "12000/12000 [==============================] - 6s 500us/step - loss: 0.0141 - acc: 0.9966 - val_loss: 0.0616 - val_acc: 0.9917\n",
            "Epoch 7/8\n",
            " 8160/12000 [===================>..........] - ETA: 1s - loss: 0.0097 - acc: 0.9979"
          ],
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": [
            "12000/12000 [==============================] - 6s 479us/step - loss: 0.0094 - acc: 0.9980 - val_loss: 0.0578 - val_acc: 0.9917\n",
            "Epoch 8/8\n",
            "12000/12000 [==============================] - 6s 488us/step - loss: 0.0104 - acc: 0.9975 - val_loss: 0.0617 - val_acc: 0.9917\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "<keras.callbacks.History at 0x7f6067cbb400>"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 19
        }
      ]
    },
    {
      "metadata": {
        "id": "64vCXL2TJARO",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## Generate demo invoice"
      ]
    },
    {
      "metadata": {
        "id": "VVq9JpSCJARO",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          },
          "output_extras": [
            {
              "item_id": 1
            }
          ],
          "base_uri": "https://localhost:8080/",
          "height": 288
        },
        "outputId": "1e2c669b-1738-45d7-8983-7e096b9c8251",
        "executionInfo": {
          "status": "ok",
          "timestamp": 1521492900920,
          "user_tz": -60,
          "elapsed": 458,
          "user": {
            "displayName": "Fernando Lasso",
            "photoUrl": "//lh5.googleusercontent.com/-nEu68SAVl9Y/AAAAAAAAAAI/AAAAAAAABSo/fc8BpWFBAC8/s50-c-k-no/photo.jpg",
            "userId": "102520388349909004575"
          }
        }
      },
      "cell_type": "code",
      "source": [
        "'''\n",
        "To make predictions from our model, we need to create \n",
        "sequences around every character from the invoice.\n",
        "\n",
        "We the making predictions for every charater based on their invoice\n",
        "'''\n",
        "\n",
        "def get_invoice(specific=-1):\n",
        "    if(specific>-1):\n",
        "        index = specific\n",
        "    else:\n",
        "        # Choose a random invoice:\n",
        "        index = np.random.randint(0,len(invoices))\n",
        "    inv = invoices[index]\n",
        "    tar = targets[index]\n",
        "\n",
        "\n",
        "    chars = [] # Holds the individual characters\n",
        "    data_seq = [] # Holds the sequences around the characters\n",
        "    y_actual = [] # Holds the true targets for each character\n",
        "\n",
        "    # Loop over characters indices\n",
        "    for i in range(len(inv) -1):\n",
        "        # Create sequence around this character\n",
        "        x,y = gen_sub(inv,tar,PADDING,m=i)\n",
        "        # Tokenize the sequence to one hot\n",
        "        xm = tokenizer.texts_to_matrix(x)\n",
        "        # Get the character itself\n",
        "        c = inv[i]\n",
        "\n",
        "        chars.append(c)\n",
        "        data_seq.append(xm)\n",
        "        y_actual.append(y)\n",
        "    return chars, data_seq, y_actual\n",
        "\n",
        "# Look at specific invoice\n",
        "characters,data, y_true = get_invoice(100)\n",
        "\n",
        "# For demo purposes we can look what our invoice looks like\n",
        "df = pd.DataFrame({'Char':characters,'Target':y_true})\n",
        "\n",
        "# Show all characters belonging to the amount\n",
        "df[df.Target == 5]"
      ],
      "execution_count": 20,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Char</th>\n",
              "      <th>Target</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>308</th>\n",
              "      <td>1</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>309</th>\n",
              "      <td>7</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>310</th>\n",
              "      <td>0</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>311</th>\n",
              "      <td>5</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>312</th>\n",
              "      <td>0</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>313</th>\n",
              "      <td>.</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>314</th>\n",
              "      <td>7</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>315</th>\n",
              "      <td>7</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "    Char  Target\n",
              "308    1       5\n",
              "309    7       5\n",
              "310    0       5\n",
              "311    5       5\n",
              "312    0       5\n",
              "313    .       5\n",
              "314    7       5\n",
              "315    7       5"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 20
        }
      ]
    },
    {
      "metadata": {
        "id": "_KdrkkkYJARm",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## Making predictions"
      ]
    },
    {
      "metadata": {
        "id": "e-4IlZFkJARs",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "#Generate prediction from data\n",
        "def get_preds(data):\n",
        "    x_test = np.array(data)\n",
        "    #print(x_test.shape)\n",
        "    # Make predictions\n",
        "    y_pred = model.predict(x_test)\n",
        "\n",
        "    # Get the maximum likely class\n",
        "    y_pred = y_pred.argmax(axis=1)\n",
        "\n",
        "    # Show how our model predictions look like\n",
        "    #df['Predicted'] = y_pred\n",
        "\n",
        "    # Show all chars that are predicted to belong to the amount\n",
        "    #df[df.Predicted == 5]\n",
        "    \n",
        "    return y_pred\n",
        "\n",
        "y_pred = get_preds(data)# Create test data for predictions with neural net"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "PkkBXAx47QEV",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          },
          "output_extras": [
            {
              "item_id": 1
            }
          ],
          "base_uri": "https://localhost:8080/",
          "height": 498
        },
        "outputId": "9ad8c9d3-2064-45f8-ad86-06e9628377e4",
        "executionInfo": {
          "status": "ok",
          "timestamp": 1521492902329,
          "user_tz": -60,
          "elapsed": 419,
          "user": {
            "displayName": "Fernando Lasso",
            "photoUrl": "//lh5.googleusercontent.com/-nEu68SAVl9Y/AAAAAAAAAAI/AAAAAAAABSo/fc8BpWFBAC8/s50-c-k-no/photo.jpg",
            "userId": "102520388349909004575"
          }
        }
      },
      "cell_type": "code",
      "source": [
        "#Function to group the target observations into one dataframe\n",
        "def melt_info(inv, pred,cut_off=0):\n",
        "    cut_off=int(cut_off)\n",
        "    fromIdx = []\n",
        "    toIdx = []\n",
        "    types = []\n",
        "    fromIdx.append(0)\n",
        "    lastType = pred[0]\n",
        "    types.append(lastType)\n",
        "    # Loop over the target values to get groups\n",
        "    for i in range(len(inv) -1):\n",
        "        curType = pred[i]\n",
        "        if(curType != lastType):\n",
        "            toIdx.append(i)\n",
        "            fromIdx.append(i)\n",
        "            types.append(curType)\n",
        "            lastType = curType\n",
        "    toIdx.append(len(inv)-1)\n",
        "    melted = pd.DataFrame({'From':fromIdx,'To':toIdx,'Type':types})\n",
        "    melted['length'] = [row['To']-row['From'] for idx, row in melted.iterrows()]\n",
        "    melted['Text'] = [inv[row['From']:row['To']] for idx, row in melted.iterrows()]\n",
        "    if(cut_off>0):\n",
        "        for index, row in melted.iterrows():\n",
        "            if(row['length']<cut_off+1):\n",
        "                pred[row['From']:row['To']] = 0\n",
        "        return melt_info(inv, pred)\n",
        "    else:\n",
        "        return melted, pred\n",
        "\n",
        "melted,preds = melt_info(characters,y_pred,0)\n",
        "display(melted)\n"
      ],
      "execution_count": 22,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>From</th>\n",
              "      <th>To</th>\n",
              "      <th>Type</th>\n",
              "      <th>length</th>\n",
              "      <th>Text</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>0</td>\n",
              "      <td>127</td>\n",
              "      <td>0</td>\n",
              "      <td>127</td>\n",
              "      <td>[A, m, a, z, o, n,  , W, e, b,  , S, e, r, v, ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>127</td>\n",
              "      <td>128</td>\n",
              "      <td>4</td>\n",
              "      <td>1</td>\n",
              "      <td>[:]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>128</td>\n",
              "      <td>129</td>\n",
              "      <td>0</td>\n",
              "      <td>1</td>\n",
              "      <td>[\\n]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>129</td>\n",
              "      <td>139</td>\n",
              "      <td>4</td>\n",
              "      <td>10</td>\n",
              "      <td>[\\n, B, D, S, 9, G, \\n, B, i, l]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>139</td>\n",
              "      <td>140</td>\n",
              "      <td>0</td>\n",
              "      <td>1</td>\n",
              "      <td>[l]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>140</td>\n",
              "      <td>141</td>\n",
              "      <td>4</td>\n",
              "      <td>1</td>\n",
              "      <td>[ ]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>6</th>\n",
              "      <td>141</td>\n",
              "      <td>177</td>\n",
              "      <td>0</td>\n",
              "      <td>36</td>\n",
              "      <td>[t, o,  , A, d, d, r, e, s, s, :, \\n, \\n, A, T...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7</th>\n",
              "      <td>177</td>\n",
              "      <td>178</td>\n",
              "      <td>1</td>\n",
              "      <td>1</td>\n",
              "      <td>[.]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>8</th>\n",
              "      <td>178</td>\n",
              "      <td>185</td>\n",
              "      <td>0</td>\n",
              "      <td>7</td>\n",
              "      <td>[\\n, B, o, o, m, p, j]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9</th>\n",
              "      <td>185</td>\n",
              "      <td>186</td>\n",
              "      <td>1</td>\n",
              "      <td>1</td>\n",
              "      <td>[e]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>10</th>\n",
              "      <td>186</td>\n",
              "      <td>306</td>\n",
              "      <td>0</td>\n",
              "      <td>120</td>\n",
              "      <td>[s,  , 4, 0, \\n, 3, 0, 1, 1, X, B,  , R, o, t,...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>11</th>\n",
              "      <td>306</td>\n",
              "      <td>307</td>\n",
              "      <td>5</td>\n",
              "      <td>1</td>\n",
              "      <td>[\\n]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>12</th>\n",
              "      <td>307</td>\n",
              "      <td>308</td>\n",
              "      <td>0</td>\n",
              "      <td>1</td>\n",
              "      <td>[\\n]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>13</th>\n",
              "      <td>308</td>\n",
              "      <td>319</td>\n",
              "      <td>5</td>\n",
              "      <td>11</td>\n",
              "      <td>[1, 7, 0, 5, 0, ., 7, 7, \\n, \\n, T]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>14</th>\n",
              "      <td>319</td>\n",
              "      <td>1577</td>\n",
              "      <td>0</td>\n",
              "      <td>1258</td>\n",
              "      <td>[h, i, s,  , i, n, v, o, i, c, e,  , i, s,  , ...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "    From    To  Type  length  \\\n",
              "0      0   127     0     127   \n",
              "1    127   128     4       1   \n",
              "2    128   129     0       1   \n",
              "3    129   139     4      10   \n",
              "4    139   140     0       1   \n",
              "5    140   141     4       1   \n",
              "6    141   177     0      36   \n",
              "7    177   178     1       1   \n",
              "8    178   185     0       7   \n",
              "9    185   186     1       1   \n",
              "10   186   306     0     120   \n",
              "11   306   307     5       1   \n",
              "12   307   308     0       1   \n",
              "13   308   319     5      11   \n",
              "14   319  1577     0    1258   \n",
              "\n",
              "                                                 Text  \n",
              "0   [A, m, a, z, o, n,  , W, e, b,  , S, e, r, v, ...  \n",
              "1                                                 [:]  \n",
              "2                                                [\\n]  \n",
              "3                    [\\n, B, D, S, 9, G, \\n, B, i, l]  \n",
              "4                                                 [l]  \n",
              "5                                                 [ ]  \n",
              "6   [t, o,  , A, d, d, r, e, s, s, :, \\n, \\n, A, T...  \n",
              "7                                                 [.]  \n",
              "8                              [\\n, B, o, o, m, p, j]  \n",
              "9                                                 [e]  \n",
              "10  [s,  , 4, 0, \\n, 3, 0, 1, 1, X, B,  , R, o, t,...  \n",
              "11                                               [\\n]  \n",
              "12                                               [\\n]  \n",
              "13                [1, 7, 0, 5, 0, ., 7, 7, \\n, \\n, T]  \n",
              "14  [h, i, s,  , i, n, v, o, i, c, e,  , i, s,  , ...  "
            ]
          },
          "metadata": {
            "tags": []
          }
        }
      ]
    },
    {
      "metadata": {
        "id": "huqTFe-K7QEZ",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "# Method to clean the names in the categorisation.\n",
        "def cleanNames(inv,pred,melted):\n",
        "    #Find one name only\n",
        "    found_name = False\n",
        "    #Iterate over rows\n",
        "    for index, row in melted.iterrows():\n",
        "        #Check if categorised as name\n",
        "        if(row['Type']==1):\n",
        "            if(found_name or row['length']<9):\n",
        "                pred[row['From']:row['To']] = 0\n",
        "            else:\n",
        "                found_name = True\n",
        "    return pred\n",
        "       \n",
        "y_cleaned = cleanNames(characters,y_pred,melted)\n",
        "#y_cleaned"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "WcQM9lvw7QEc",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "#Use regex to narrow down the prediction\n",
        "import re\n",
        "def cleanByRegex(inv,pred,melted,type_idx,pattern,padding):\n",
        "    #Iterate over rows melted frame\n",
        "    for index,row in melted.iterrows():\n",
        "        if(row['Type']==type_idx):\n",
        "            subString = ''.join(inv[row['From']-padding:row['To']+padding])\n",
        "            find = re.search(pattern,subString,flags=0)\n",
        "            if(find):\n",
        "                print(\"success on \"+str(type_idx))\n",
        "                from_idx = row['From']-padding+find.span()[0]\n",
        "                to_idx = row['From']-padding+find.span()[1]\n",
        "                for index, elem in enumerate(pred):\n",
        "                    if(index not in range(from_idx,to_idx) and elem==type_idx):\n",
        "                        pred[index]=0\n",
        "                \n",
        "    return pred "
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "7ejmgGDW7QEf",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "#This method is still in progress. Eventually, desirable to post-process the\n",
        "#features containing new lines in them\n",
        "def cleanByNewline(inv,pred,melted,type_idx,pattern,padding):\n",
        "    #Iterate over rows melted frame\n",
        "    for index,row in melted.iterrows():\n",
        "        if(row['Type']==type_idx):\n",
        "            subString = ''.join(inv[row['From']-padding:row['To']+padding])\n",
        "            find = re.search(pattern,subString,flags=0)\n",
        "            if(find):\n",
        "                print(\"success on \"+str(type_idx))\n",
        "                from_idx = row['From']-padding+find.span()[0]\n",
        "                to_idx = row['From']-padding+find.span()[1]\n",
        "                for index, elem in enumerate(pred):\n",
        "                    if(index not in range(from_idx,to_idx) and elem==type_idx):\n",
        "                        pred[index]=0\n",
        "    return pred "
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "luFpLAOT7QEi",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          },
          "output_extras": [
            {
              "item_id": 1
            },
            {
              "item_id": 2
            }
          ],
          "base_uri": "https://localhost:8080/",
          "height": 393
        },
        "outputId": "52df7deb-684d-4bc3-ef7c-d86215d8cca1",
        "executionInfo": {
          "status": "ok",
          "timestamp": 1521492905347,
          "user_tz": -60,
          "elapsed": 729,
          "user": {
            "displayName": "Fernando Lasso",
            "photoUrl": "//lh5.googleusercontent.com/-nEu68SAVl9Y/AAAAAAAAAAI/AAAAAAAABSo/fc8BpWFBAC8/s50-c-k-no/photo.jpg",
            "userId": "102520388349909004575"
          }
        }
      },
      "cell_type": "code",
      "source": [
        "## show confusion matrix\n",
        "from sklearn.metrics import accuracy_score, confusion_matrix\n",
        "import matplotlib\n",
        "import matplotlib.pyplot as plt\n",
        "import seaborn as sns\n",
        "\n",
        "cnf_matrix = confusion_matrix(y_true, y_pred)\n",
        "\n",
        "#abbreviation = ['BG', 'Ch', 'Cl', 'CC', 'CW', 'FH', 'LSB', 'M', 'SM', 'SP', 'SFC', 'SB']\n",
        "#pd.DataFrame({'class': species, 'abbreviation': abbreviation})\n",
        "\n",
        "fig, ax = plt.subplots(1)\n",
        "ax = sns.heatmap(cnf_matrix, ax=ax, cmap=plt.cm.Greens, annot=True)\n",
        "#ax.set_xticklabels(abbreviation)\n",
        "#ax.set_yticklabels(abbreviation)\n",
        "plt.title('Confusion matrix of validation set')\n",
        "plt.ylabel('True category')\n",
        "plt.xlabel('Predicted category')\n",
        "plt.show();\n",
        "\n",
        "'''\n",
        "Ignore:           0\n",
        "Sender Name:      1 \n",
        "Sender KVK:       2 \n",
        "Sender IBAN:      3 \n",
        "Invoice Reference:4\n",
        "Total:            5\n",
        "'''"
      ],
      "execution_count": 26,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAdUAAAFnCAYAAADwu9OJAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzt3XdYFFf7N/DvLkUEC6IuFtSgT0yI\nIgF7FwICtlgRoxgN9pKoqGB5NAYb1sSuiS1CbBgToyDGWGJBjJIomhh9YkSsgIAgvZz3D9/sTyK4\nuswyOHw/Xntd7OzOnHtY4eY+c84ZlRBCgIiIiEpMLXcARERESsGkSkREJBEmVSIiIokwqRIREUmE\nSZWIiEgiTKpEREQSYVIlCCGwdetW9OjRA+7u7nB1dcWnn36KtLS0Eh136tSp6Ny5M06dOvXK+16+\nfBm+vr4lal9qYWFhePLkSZGvLV++HDt37pSknZCQELRv3x7r168v8bFWr16NWbNmAQA+/PBDXL16\n9bn3XLhwAS4uLjqPdenSJVy7dg0AEBwcjM8//7zE8ekrJycH3333nWztExXHWO4ASH7Lli3D+fPn\nsXnzZlhbWyMjIwMLFizA6NGjERISApVKpddxDx06hIiICNSvX/+V923WrBk2b96sV7uGsmrVKjg5\nOaFSpUrPvebn5ydZO0eOHMGkSZMwYMAAyY4JANu3by/R/vv27UPz5s3x9ttvY8iQIRJFpZ/ff/8d\n3333HXr37i1rHET/xkq1nEtJScGOHTuwePFiWFtbAwDMzc0xZ84cjBgxAkIIZGdnY86cOXB3d4en\npycWL16M/Px8AICLiwt27dqF/v37o0OHDli8eDEAwMfHBwUFBfD19cXJkyfh4uKCCxcuaNv953le\nXh5mzZoFd3d3uLm5YcKECXjy5AmioqLg5uYGAHq1/28+Pj7YtGkTBg4ciDZt2iAkJATr1q2Dh4cH\nunXrhri4OADAzZs3MWjQIHh6esLNzQ0HDx4EAMyYMQN///03fHx8cOHCBQQEBGDRokXo2bMnwsPD\nERAQgHXr1uHy5cvo0qUL0tPTAQAbNmzAxx9//Fw8xZ3TkiVL8Ntvv+GLL77A6tWrC+3Tv39/RERE\naJ8fPXoUXl5eAIC9e/fC09MTXbt2xeDBg3H37t3n2nz2M1i3bh06d+6M3r174+zZs9r3ZGZmYtKk\nSXB3d4eLiwuCgoIAADt37sT333+PpUuXYuvWrYUq4Hv37sHX1xfu7u7o0aOHtoK8c+cOOnTogK+/\n/ho9e/ZEx44dERYWVuTns3LlSri7u8Pd3R1Dhw7Fw4cPAQAXL15Ev3794ObmBi8vL8TFxSExMRET\nJkzAb7/9hg8++KDI4xHJRlC5duLECeHm5vbC92zcuFGMHDlS5ObmiszMTNGvXz/x3XffCSGEcHZ2\nFlOmTBF5eXniwYMHokmTJuL+/ftCCCEaN26s/drZ2Vn88ssv2mP+8/z48eNi6NChoqCgQBQUFIiV\nK1eKn3/+WZw7d064urqWqP1nDRkyRIwYMULk5uaKY8eOCQcHB7Fv3z4hhBATJ04UK1euFEIIMXr0\naLFx40YhhBDnz58XzZo1Ezk5Oc+dj7+/v+jZs6fIysrSPl+7dq0QQojAwECxfPly8eDBA9GxY0fx\n8OHDV/qeDhkyRPv1szZt2iSmT5+ufT59+nSxZcsWkZiYKJo2baqNLSAgQMycOVMIIcSqVau0X//z\nPb9x44Zo2bKlSEhIEHl5eWLcuHHC2dlZCCHE5s2bxYgRI0RBQYFISUkRrVq10n5uz8b17HE/+ugj\nsWHDBiGEEHfu3BHNmzcXcXFxIi4uTrzzzjtix44dQgghwsLCivy/dv36ddG1a1ft9/nrr78W+/fv\nF2lpaaJly5bi9OnTQgghfvjhB9GnTx8hhBD79u0TH3744XPHIpIbK9VyLiUlBdWrV3/he06cOAEv\nLy8YGxvDzMwMPXv2xJkzZ7Sv9+zZE0ZGRrC2tkb16tVx//79l27fysoKf/31F3788UdtldSxY0eD\ntO/s7AxjY2M0btwYmZmZcHd3BwA0btwY8fHxAJ5WcP9cy23evDmys7ORkJBQ5PHatm2LChUqPLd9\n8uTJOHz4MGbMmIFx48ZBo9E89x5d51QUDw8PnDx5Evn5+cjLy8OJEyfg4eGB6tWr4+LFi6hVqxYA\noEWLFtrKuyi//PILWrZsiRo1asDIyAi9evXSvvbRRx9h3bp1UKlUqFq1Kt58803cuXOn2GPl5ubi\n7Nmz2oqxbt26aN26Nc6dOwcAyMvLQ9++fQEATZo0wb179547RpUqVZCUlIQffvgBjx8/ho+PD3r3\n7o2LFy/C2toa7du3BwD06NEDt2/fLvIYRGUFr6mWc9WqVdN2tRUnKSkJVatW1T6vWrUqHj16pH3+\n7DVGIyMjbdfsy2jWrBlmz56NHTt2wN/fHy4uLpg7d65B2rewsNC+59nnarUaBQUFAIBTp05h/fr1\nSE5OhkqlghBC+9q/PRvTv9vx9PTEtm3bnuvCfdlzKkq9evVQu3Zt/Prrr8jNzYWtrS1q166N/Px8\nrFq1CseOHUN+fj7S09Nha2tb7HEeP36MypUra59XqVJF+/WtW7ewePFi3Lx5E2q1Gg8ePNAmxaKk\npKRACPHc8ZKSkgA8/V6bm5sDKPx9fpa1tTVWr16NLVu2IDAwEC1btsS8efOQmpqKuLg4eHh4aN9r\namqqPTZRWcRKtZx799138ejRo+dGhebm5mLlypXIzMxEjRo1kJKSon0tJSUFNWrUeKV2/v0L9fHj\nx9qvPTw8sGPHDhw/fhyZmZnPDVCSov2XkZubi0mTJmHs2LGIiIjAgQMH9Bqk9fDhQ/zwww/o3r07\n1qxZU+R79D0nd3d3/PTTT/jpp5/g6ekJ4Omo5GPHjiE4OBgRERFFXsN9VpUqVQqN7E5OTtZ+/dln\nn+HNN99EeHg4Dh8+jLfffvuFx6pWrRrUanWhz/Nlej/+rU2bNti0aRPOnDmD2rVrY9myZdBoNGjY\nsCEOHz6sfZw9exZNmzZ9pWMTlSYm1XKuSpUqGDFiBPz9/REbGwvg6WCVOXPm4Pfff0fFihXRpUsX\nhIaGIj8/HxkZGfj+++/RuXPnV2qnZs2a2ukYYWFhyM7OBvB0ROnatWsBAJaWlmjYsOFz+0rR/svI\nzMxERkaG9pf29u3bYWJigoyMDACAsbExUlNTdR5nwYIFGDFiBGbOnInw8HD88ccfz71H33Nyd3dH\nZGQkjh8/rq3gHj16hLp168LKygrJyckIDw/XDpQqiqOjIy5evIikpCTk5+fjwIED2tcePXoEOzs7\nGBkZ4cyZM4iNjS10/v+eZmVsbIwOHTpg9+7dAIDbt2/jwoULaNeunc5z+cfp06cxb948FBQUwNzc\nHG+//TZUKhUcHByQkJCAS5cuAQDi4uIwbdo0CCFgbGyMJ0+eQPAmW1TGMKkSJk6cCC8vL4wdOxbu\n7u7o27cvqlevrq2yfHx8UKtWLXTv3h39+vVDly5dtFXSyxo3bhy2bduGHj164K+//sJ//vMfAMB7\n772Hq1evomvXrvD09MT//vc/DB8+vNC+UrT/Mv75A6N3797o3bs36tevD1dXV4wZMwYZGRnw8PCA\nt7d3sSNYgafXSu/cuQNvb29UqlQJkydPxuzZs5/rktb3nGxtbVFQUABra2vtaO0ePXogJSUFbm5u\n8PPzw6RJk/DgwYNiR0Lb2dnB29sbffr0Qd++feHk5KR9bezYsQgKCkKPHj1w/vx5TJgwAatXr8bF\nixfh6uqKZcuWYdGiRYWON2/ePERFRcHDwwPjx4/H/PnzUbt2bZ3n8o+WLVsiKysL7u7u6N69O8LC\nwvDJJ5/AzMwMq1atQmBgIDw9PTF+/Hh4eHhApVKhefPmiI+PR8eOHV/pcgORoakE/9QjIiKSBCtV\nIiIiiTCpEhERSYRJlYiISCJMqkRERBJhUiUiIpJImV1RSeVmI3cIZCAZh/+UOwQyEBX0u6MRlX1m\nRuYGO3ZJft+LH4tfRlMOZTapEhFROaHn7SXLInb/EhERSYSVKhERyUtB5R2TKhERyUtB3b9MqkRE\nJC/l5FQmVSIikpmCKlUF9WQTEdFrSV2Chw7Xr1+Hq6srgoODAQABAQHo2bMnfHx84OPjgxMnTgAA\nDhw4gH79+mHAgAHYu3cvgKf3WPbz88OgQYMwZMgQxMXF6WyPlSoREcnLQJVqRkYGAgMD0bZt20Lb\np0yZAmdn50LvW7t2LUJDQ2FiYoL+/fvDzc0Nx48fR5UqVbB8+XKcPn0ay5cvx+eff/7CNlmpEhGR\nIpmamuLLL7+ERqN54fsuXboEe3t7VK5cGWZmZnByckJ0dDQiIyPh5uYGAGjXrh2io6N1tsmkSkRE\n8lKV4PECxsbGMDMze257cHAwhg4dismTJyMpKQmJiYmwsrLSvm5lZYWEhIRC29VqNVQqFXJycl7c\n5sucLxERkcGoS2+g0vvvvw9LS0vY2dlh06ZNWLNmDRwdHQu9RwhR5L7FbX8WK1UiIpKXgSrVorRt\n2xZ2dnYAABcXF1y/fh0ajQaJiYna98THx0Oj0UCj0SAhIQHA00FLQgiYmpq+8PhMqkREJC+VSv/H\nK5o4caJ2FG9UVBTefPNNODg4ICYmBqmpqUhPT0d0dDRatGiB9u3b4/DhwwCA48ePo3Xr1jqPz+5f\nIiKSl4F6f69cuYKgoCDcvXsXxsbGiIiIwJAhQzBp0iRUrFgR5ubmWLRoEczMzODn5wdfX1+oVCqM\nHz8elStXRrdu3XD27FkMGjQIpqamWLx4se5TES/TSSwD3vpNuXjrN+Xird+Uy6C3fuvfUO99RehN\nCSMpOVaqREQkr1IcqGRoTKpERCQv5eRUJlUiIpKZgtb+ZVIlIiJ5sfuXiIhIIsrJqUyqREQkMwV1\n/3LxByIiIomwUiUiInkpp1BlUiUiIplxoBIREZFElJNTmVSJiEhmChqoxKRKRETyUtCQWSZVIiKS\nl4IqVQX9fUBERCQvVqpERCQv5RSqTKpERCQzBXX/MqkSEZG8FHQhkkmViIjkxUqViIhIIsrJqUyq\nREQkMwUtU6ignmwiIiJ5sVIlIiJ58ZoqERGRRJSTU5lUiYhIXipWqkRERNJgUiUiIpKIgnIqkyoR\nEclLraCsyik1REREEmFS1ZOxkTGWjf4vxI93ULdG7SLfU6miBXbPXo/YkCj8sfkE+nboVuJ2F/nO\nwLUtJ/HH5hNY+FGAdrtNzdo4tOBr/L75OP7YfAJjew4tcVukvx8jjqJ3976FHu++44T09HS5QyMJ\n/XzyFBzeccTdu/fkDuW1plKp9H6UNez+1dP3n23BL39eeuF7VoyZi/tJ8WgwuDUa2zTEhk8W4/uz\nEcgvyNerzYFdeqGLQ1s0G+0GIQROLg9Fv47dse/UIXw1ZSnCzx/HF/s3w6ZmbcRsOoqTl8/h99jr\nerVFJePm7go3d1ft84jwIzhy+AgsLCxkjIqklJmZiS9WrELVqlXlDuW1VxaTo75YqeopMPhzfPr1\n8mJfNzUxxSDn97Hgm1UAgOt3bsJlmpc2oY7sNhh/bD6Bv3dE4puZa2BmalZo/w+7DsBcnymFtg3o\n1APbjuxBTm4OcvNysePoPgzo1B0AsPFQCL4K3wkAuJNwH/+7dwuNbRpKdr6kv+zsbKxdtQ6T/D6R\nOxSS0Ia1G9GjV3dYWJjLHcprT0mVqkGTanp6OmJjYxEbG4uMjAxDNlXqzv0R/cLX36xri8zsLAzr\n6oWrXx1D1OqDeM+xAwCgQ9NWCBw2FS7TB8LWpy0ep6chcNhUnW02tmmIv+7Fap//dT8Wb9f7DwBg\n/+lwpGc9/R63sXNCbStrnL5yXt/TIwnt3/cd3nV0QL369eQOhSRy4/oNRJ49hyFDB8sdiiKoVPo/\nyhqDdP/GxMRgwYIFSE1NRbVq1SCEQHx8PKytrTFnzhy89dZbhmi2TLG0qALLSlWQlZONJiNc0LVF\nZ4TO2YiGQ9ujZxs37D7xA+4/eggA2HBwB76d+yWmbZqPw4uC0UBjg6oWlWFibALvLr2Ql58H+1Gu\nMK9ghqycbG0bmdlZsDD7v7+S69Wsg5PLQ2FZqQp8V0xD4uOkUj9vKqygoAA7tgXji7Wfyx0KSUQI\ngfnzFiBglj9MTEzkDkcRymLFqS+DJNWFCxdiwYIFaNSoUaHtV69exWeffYaQkBBDNFumPE5Pg5Ha\nCOt/+BoAcOTCSdyOv4c2dk6wrFQFfdp7oGvzTgAAtVoNU2NTAIDHjCEAnnb/vmFdD/N2rNAeMz0r\nE2amFbTPzStUxJOs/xv4EpdwDw2HtsMbteohfMEOZOVkI/z8MYOfKxXv0m+XYW5eEf95s5HuN9Nr\nIXTPPjRs1BBOzR3lDkUxlJRUDdL9K4R4LqECQJMmTZCfr98gnddNXMLT0YCVzStpt+UX5CO/IB/3\nHj3E9h/3ws63C+x8u+Ct4Z1Q74OWOo95Le5/+E+dN7TP36xri99jb8DUxBQfeXhDrX76cd56EIdD\n53/SJm2Sz6kTp9ChUwe5wyAJnTh2AsePnYBLR1e4dHTFgwcPMdhrMM5H/SJ3aFQGGCSpOjg4YMyY\nMQgNDcWxY8dw7Ngx7NmzB76+vmjVqpUhmixzHqenIuLCSUwdMBoA0OptR7xhbYNf/ryEA5FH0LeD\nJ2pUtQIA9GrbFdMHjtN5zD0nD2JU98EwN6sICzNzjOo+GDuPf4ec3BzMHDQBQ936AwAszMzRpVlb\nXL75h+FOkF7Kn39eh21DW7nDIAmt3bgGJ04fw7FTR3Hs1FHUqmWNkD0haNVa9x/GVDRVCf6VNQbp\n/p0xYwZ++eUXREZG4vLlywAAjUaDCRMmwNHx9e8y0VjWwMnlodrnJ5btRV5+Ht6b7o2IRcGwH/V0\nKoXviqn4evrn+HtHJB6np2HggnFITktBcloKFu5cgxPL9kKtViM+5RFGf+5fqI3tR/Y+1+6+U4fQ\n/E17/LbhCIQQ+ObYdzh47igAoO+8kVg9PhD+XuNgbGSMA5FHsO3IHgN+F+hlxD98iBo1qssdBlGZ\npqTuX5UQQsgdRFFUbjZyh0AGknH4T7lDIAMpi5UDScPMyHBTh6rObK33vo8XRkkYSclx8QciIpKV\nktb+ZVIlIiJZKan7l0mViIhkpaSkymUKiYiIJMJKlYiIZKWgQpVJlYiI5KWk7l8mVSIikhWTKhER\nkUSYVImIiCTCpEpERCQRBeVUTqkhIiLlun79OlxdXREcHAwAuH//PoYNG4YhQ4Zg2LBhSEhIAAAc\nOHAA/fr1w4ABA7B379O113Nzc+Hn54dBgwZhyJAhiIuL09kekyoREclKpVLp/XiRjIwMBAYGom3b\nttptn3/+Oby8vBAcHAw3Nzds3boVGRkZWLt2LbZt24YdO3Zg+/btSElJwcGDB1GlShXs3LkTY8aM\nwfLly3WeC5MqERHJylBJ1dTUFF9++SU0Go1229y5c+Hu7g4AqFatGlJSUnDp0iXY29ujcuXKMDMz\ng5OTE6KjoxEZGQk3NzcAQLt27RAdHa3zXJhUiYhIVmqVSu/HixgbG8PMzKzQNnNzcxgZGSE/Px/f\nfPMNevbsicTERFhZWWnfY2VlhYSEhELb1Wo1VCoVcnJyXnwuen4PiIiIJKFS6f/QR35+PqZPn442\nbdoU6hr+R3F3RH2ZO6UyqRIRkawM1f1bnBkzZqBBgwaYMGECAECj0SAxMVH7enx8PDQaDTQajXYg\nU25uLoQQMDU1feGxmVSJiEhWqhL8e1UHDhyAiYkJPv74Y+02BwcHxMTEIDU1Fenp6YiOjkaLFi3Q\nvn17HD58GABw/PhxtG6t+2bqnKdKRESKdOXKFQQFBeHu3bswNjZGREQEHj16hAoVKsDHxwcA0KhR\nI3z66afw8/ODr68vVCoVxo8fj8qVK6Nbt244e/YsBg0aBFNTUyxevFhnmyrxMp3EMlC52cgdAhlI\nxuE/5Q6BDESfyoFeD2ZG5gY7tm2Qq977/u1/VMJISo6VKhERyYrLFBIREUlEQTmVSZWIiOTFSpWI\niEgiTKpEREQSUVJS5TxVIiIiibBSJSIiWSmoUGVSJSIieSmp+5dJlYiIZMWkSkREJBEmVSIiIoko\nKKcyqRIRkbyUVKlySg0REZFEWKkSEZGslFSpMqkSEZGsmFSJiIgkoqCcyqRKRETyYqVKREQkFSZV\nIiIiaSipUuWUGiIiIomwUiUiIlkpqFBlUiUiInkpqfuXSZWIiGTFpEpERCQRJlUiIiKJKCinMqkS\nEZG8lFSpckoNERGRRMpspZp5+LrcIRARUSlQUqVaZpMqERGVD0yqREREEmFSJSIikoiCciqTKhER\nyYuVKhERkUSUlFQ5pYaIiEgirFSJiEhWSqpUmVSJiEhWCsqpTKpERCQvVqpERERSYVIlIiKSBitV\nIiIiiaiVk1M5pYaIiEgqrFSJiEhW7P4lIiKSiFpBSVVn929qamppxEFEROWUSqXS+1HW6Eyq3bp1\nw9SpU3Hu3LnSiIeIiMoZdQkeZY3OmI4fP47u3bvj22+/Rd++fbFhwwbEx8eXRmxERFQOqFUqvR9l\njc6kamJiAmdnZyxZsgTLly/Hzz//DDc3N0ydOhVJSUmlESMRESlYuer+zczMxHfffYehQ4fCz88P\nvXr1wpkzZ/Dee+/h448/Lo0YiYiIXgs6R/+6urqiS5cumDp1Kpo1a6bd7unpifDwcIMGR0REylcW\nu3H1pTOpjh49GkOHDi3ytVWrVkkeEBERlS+G6sYtKCjA3LlzcePGDZiYmODTTz+Fubk5pk+fjvz8\nfNSsWRNLly6FqakpDhw4gO3bt0OtVsPLywsDBgzQq02dSfXs2bPo06cPKleurFcDREREL2KoUbw/\n/fQT0tLSsGvXLty+fRsLFiyAlZUVPvjgA3h6emLFihUIDQ1F7969sXbtWoSGhsLExAT9+/eHm5sb\nLC0tX7lNnUk1KysLLi4usLW1hYmJiXZ7SEjIKzdGRET0b4bq/r1165b2smX9+vVx79493LhxA/Pm\nzQMAODs7Y8uWLbC1tYW9vb22eHRyckJ0dDRcXFxeuU2dSXXcuHGvfFAiIqKXZaju38aNG2P79u34\n8MMPERsbi7i4OGRmZsLU1BQAUL16dSQkJCAxMRFWVlba/aysrJCQkKBXmzqr7latWkGtVuPq1av4\n/fffYWJiglatWunVGBER0b8Zap5q586dYW9vj8GDB2P79u1o2LBhoR5XIUSR+xW3/WXorFS/+OIL\nnDlzBs2bNwcAzJ8/H127dsXo0aP1bpSIiKg0TJ48Wfu1q6srrK2tkZWVBTMzMzx8+BAajQYajQaJ\niYna98XHx+Pdd9/Vqz2dlWpUVBR27doFf39/+Pv7Y/fu3Th+/LhejREREf2bqgSPF7l27RpmzJgB\nAPj555/xzjvvoF27doiIiAAAHDlyBB07doSDgwNiYmKQmpqK9PR0REdHo0WLFnqdi85KtaCgAGr1\n/+VeY2PjMrmKBRERvZ4MNVCpcePGEEKgf//+qFChApYtWwYjIyNtgVinTh307t0bJiYm8PPzg6+v\nL1QqFcaPH6/3jBeV0NF5PH/+fNy5cwft2rUD8HSKTf369TFz5ky9GnxZWfkZBj0+ERG9PDMjc4Md\ne/Bh/QfEhniskzCSktNZqc6cORPh4eG4dOkSVCoVevXqBU9Pz9KIjYiIygEl9X7qTKp3795Fs2bN\nCi1ReP/+fVhbW8PIyMigwRERkfKVq2UKR40ahdjYWJibm0OlUiEjIwPW1tZIT0/HZ599Bnd399KI\nk4iIFEo5KfUlkmrnzp3Rvn17dOzYEQBw5swZnD9/Hj4+Phg7diyTKhER0f+nc0pNTEyMNqECQPv2\n7fHbb7+hRo0aMDbWmZOJiIheSEk3KX+pKTXBwcHalZV+/fVXpKSkIDo6ujTiIyIihSuLyVFfOqfU\nxMXFYdWqVbh27RoKCgrQqFEjTJgwATk5OTA3N0fDhg0NEhin1BARlR2GnFIz4qdP9N73q/e+kDCS\nktNZqdarVw9BQUFITEyERqMpjZiIiKgcUVKlqvOaamRkJFxdXbU3Kl+4cCGXKSQiIskYaplCOehM\nqitXrsSePXtQs2ZNAMCYMWOwfv16gwdGRETlg5IGKulMqubm5qhRo4b2uZWVVaFb5xAREdFTOq+p\nmpmZ4fz58wCAx48f49ChQ6hQoYLBAyMiovKhLFac+tJZqc6dOxebN29GTEwMunbtilOnTiEwMLA0\nYiMionJApVLp/ShrdFaqt2/fxsaNGwttO3r0KOrWrWuwoMqTqHPnsWLpSmRkZKBOndr4bME8WNey\nljssKqG7d++hl+f7sKlno93W1L4JFiyeL2NUJBX+3EpLZ3X3Gik2qd65cwdxcXEICgpCQEAA/pnO\nmpeXh4ULF8LV1bXUglSqjIxM+E8NwPpNa2H3jh1CdnyDwHkLsGb9KrlDIwloNDXx/aH9codBEuPP\nrfTKYsWpr2KTakJCAsLCwnD37l2sXbtWu12tVsPb21vvBlNTU1GlShW991eS81HnYWNjA7t37AAA\nffr2xoqlK5Geng4LCwuZoyOiovDnVnpKuqZabFJ1dHSEo6MjOnfu/FxVWpIlCidMmICvv/5a7/2V\nJPZWLOo90z1obmEOS0tL3I6Ng907b8sYGUnhyZN0TJowGX//fQt16tbBNH8/NGxkmBXIqPTw51Z6\n5SKp/qNNmzYICQlBcnIyACA3Nxf79u3D6dOni90nJCSk2NcePnyoR5jKlJWVBdMKpoW2VTCrgMzM\nTJkiIqlYWJjDs4cnPhw+FLVr18KO7cH4ZMJk7P9hH29E8Zrjzy29iM7rw5MmTcKff/6Jb7/9Funp\n6Th+/Dg+/fTTF+6zbds2/Pnnn0hOTn7ukZeXJ1Xsr72KFSsiJzun0LaszCyYmxtujU0qHZaWlpg5\nOwB169aBWq3G0GE+SHqUhNhbsXKHRiXEn1vplavRv9nZ2fjss8/g4+MDf39/pKSkIDAw8IUDldau\nXYv58+dj9uzZMDUt/BddVFRUyaNWCFvbNxARfkT7PC0tDampqajfoL5sMZE0Uh+nIjUtDTY2/zdK\nPj8/n1WqAtja8udWauoyueAQkknsAAAWOklEQVSgfnRWqrm5ucjIyEBBQQGSk5NhaWmJuLi4F+7T\nuHFjbNy4schfIAEBAfpHqzAtW7fE/Xv3EX3xVwBA8PYQdOrSEebmFWWOjErqypWrGDl8FJKSkgAA\n+/Z+i9q1axWaYkOvJ/7cSq9cVarvv/8+9uzZgwEDBqBbt26wsrJCgwYNdB64YsWi/4M1adLk1aNU\nKDMzMwQtX4xF8xchMyML9RrUQ+CCeXKHRRJo174tBnp74cPBw6FWq6HRaLD8i2UwMjKSOzQqIf7c\nSk9JA5V03k/1WQ8fPsSjR49gZ2dn8L8QeD9VIqKyw5D3U50ZOUvvfRe2XSBhJCWns/v3woUL8Pf3\nBwBYW1tjyZIluHDhgsEDIyKi8kFJ3b86k+ry5csxbtw47fP58+djxYoVBg2KiIjodaTzmqoQotA1\nVBsbG6jVSlqpkYiI5KSka6o6k2qdOnWwdOlStGrVCkIInDp1CrVq1SqN2IiIqBxQKWhJfZ1nsmjR\nIlhYWGDnzp3YtWsXrK2tMX8+77RBRETSUKtUej/KGp2VaoUKFQpdUyUiIpJSWRxwpC8u70JERLJS\nKWhFJSZVIiKSVVnsxtXXS10dTk5ORkxMDACgoKDAoAERERG9rnQm1YMHD2LgwIGYMWMGACAwMBB7\n9+41eGBERFQ+lKvFH7Zu3Yrvv/8e1apVAwD4+/tjz549Bg+MiIjKB3UJ/pU1Oq+pVq5cudDi+GZm\nZjAxMTFoUEREVH6UxYpTXzqTarVq1bB//35kZ2fj6tWrCAsLg5WVVWnERkRE5YCSkqrO2nnevHmI\niYlBeno6Zs+ejezsbC7+QEREklFDpfejrNFZqVapUgVz5swpjViIiKgcUlKlqjOpdu7cucgTPnHi\nhCHiISIiem3pTKrffPON9uvc3FxERkYiOzvboEEREVH5oaTFH3Qm1bp16xZ6/sYbb8DX1xfDhg0z\nVExERFSOlKtlCiMjIws9f/DgAW7fvm2wgIiIqHxRq8refFN96Uyq69at036tUqlQqVIlzJs3z6BB\nERFR+VGuBioFBASgSZMmpRELERGVQ0rq/tVZcwcFBZVGHERERK89nZVqnTp14OPjAwcHh0LLE37y\nyScGDYyIiMqHcjX618bGBjY2NqURCxERlUNK6v4tNqkeOHAAvXr1woQJE0ozHiIiKmeUVKkWe001\nNDS0NOMgIqJySqVS6/0oa3R2/xIRERmSIbt/Dxw4gK+++grGxsb4+OOP8dZbb2H69OnIz89HzZo1\nsXTpUpiamuLAgQPYvn071Go1vLy8MGDAAL3aUwkhRFEv2Nvbo3r16s9tF0JApVIZfO3frPwMgx6f\niIhenpmRucGO/dUf6/Xed4Td2GJfS05Ohre3N/bt24eMjAysXr0aeXl56NSpEzw9PbFixQrUqlUL\nvXv3Rp8+fRAaGgoTExP0798fwcHBsLS0fOV4iq1U33nnHaxYseKVD0hERFQWREZGom3btqhUqRIq\nVaqEwMBAuLi4aBcwcnZ2xpYtW2Brawt7e3tUrlwZAODk5ITo6Gi4uLi8cpvFJlVTU9Pn1v0lIiKS\nmqFWVLpz5w6ysrIwZswYpKamYuLEicjMzISpqSkAoHr16khISEBiYiKsrKy0+1lZWSEhIUGvNotN\nqs2aNdPrgERERK/CkDcbT0lJwZo1a3Dv3j0MHToUz17xLObqZ7HbX0axQ6emTZum90GJiIhelkql\n0vvxItWrV4ejoyOMjY1Rv359WFhYwMLCAllZWQCAhw8fQqPRQKPRIDExUbtffHw8NBqNXudS9sYj\nExFRuWKoKTUdOnTAuXPnUFBQgOTkZGRkZKBdu3aIiIgAABw5cgQdO3aEg4MDYmJikJqaivT0dERH\nR6NFixZ6nQun1BARkawM1f1rbW0Nd3d3eHl5AQBmz54Ne3t7+Pv7Y/fu3ahTpw569+4NExMT+Pn5\nwdfXFyqVCuPHj9cOWnpVxU6pkRun1BARlR2GnFITfGOL3vsOefMjCSMpOXb/EhERSYTdv0REJKty\nsaA+ERFRaTDUPFU5MKkSEZGsDDlPtbQxqRIRkazK4t1m9MWkSkREsuI1VSIiIoko6ZqqcmpuIiIi\nmbFSJSIiWbH7l4iISCJK6v5lUiUiIllxSg0RUREEyuRS4lTGsVIlIiKSiEpBY2aZVImISFZKqlSV\n8+cBERGRzFipEhGRrDilhoiISCJqBXX/MqkSEZGsWKkSERFJREkDlZhUiYhIVpxSQ0REJBElVarK\n+fOAiIhIZqxUiYhIVlz7l4iISCJK6v5lUiUiIllxSg0REZFEWKkSERFJhFNqiIiIJKKkZQqV8+cB\nERGRzFipEhGRrDhQiYiISCIcqERERCQRVqpEREQSYaVKREQkEbWCxswyqRIRkayUVKkq588DIiIi\nmbFSJSIiWXGgEhERkUSU1P3LpEpERLJipUpERCQRJlUiIiKpsPuXiIhIGkqqVDmlhoiISCKsVImI\nSFYc/UtERCQRJXX/MqkSEZGsmFSJiIgkwu5fIiIiibBSJSIikoihkmpmZiYCAgLw6NEjZGdnY9y4\ncXj77bcxffp05Ofno2bNmli6dClMTU1x4MABbN++HWq1Gl5eXhgwYIBebaqEEELi85BEVn6G3CEQ\n0SsSKJO/TkgCFY0sDHbsmKSLeu9rb9W82NfCwsJw9+5djBw5Enfv3sVHH30EJycndOrUCZ6enlix\nYgVq1aqF3r17o0+fPggNDYWJiQn69++P4OBgWFpavnI8nKdKRESyUqlUej9epFu3bhg5ciQA4P79\n+7C2tkZUVBTee+89AICzszMiIyNx6dIl2Nvbo3LlyjAzM4OTkxOio6P1Ohd2/xIRkawMfU3V29sb\nDx48wIYNGzB8+HCYmpoCAKpXr46EhAQkJibCyspK+34rKyskJCTo1RaTKhERycrQo3937dqFP/74\nA9OmTcOzVzyLu/pZkqui7P4lIiJZqUrw70WuXLmC+/fvAwDs7OyQn58PCwsLZGVlAQAePnwIjUYD\njUaDxMRE7X7x8fHQaDR6nQuTqsyizp3HwH6D0NPzfYz2HYOHDx7KHRJJhJ+tcn337ffo26Mf+vTo\ni9G+YxF7K1bukF5rhkqqFy5cwJYtWwAAiYmJyMjIQLt27RAREQEAOHLkCDp27AgHBwfExMQgNTUV\n6enpiI6ORosWLfQ7F47+lU9GRia6de2O9ZvWwu4dO4Ts+AaRZ89hzfpVcodGJVReP9vyMPr375t/\nY/gQX+zevwvW1hrs3RWKQwfDsC14i9yhGZQhR//++ThG733fqmpf7GtZWVmYNWsW7t+/j6ysLEyY\nMAFNmzaFv78/srOzUadOHSxatAgmJiY4fPgwNm/eDJVKhSFDhqBXr156xcNrqjI6H3UeNjY2sHvH\nDgDQp29vrFi6Eunp6bCwMNx/YDI8frbKdfOvm6jfoB6srZ92D7Zq0xJfrFT2H0uvKzMzMyxfvvy5\n7Vu3bn1um4eHBzw8PErcpkG7f4sqgh88eGDIJl8rsbdiUa+ejfa5uYU5LC0tcTs2TsaoSAr8bJXL\n3qEZ4uLu4H83/gchBI4e+Qlt2raRO6zXmqG6f+VgkKT6448/wtnZGW3btoW/vz+ePHmifW369OmG\naPK1lJWVBdMKpoW2VTCrgMzMTJkiIqnws1UujaYmJk6agIF9B6FT2y7YvXMPPpkyUe6wXmtMqjps\n2rQJ+/fvx9mzZ+Hk5ARfX1+kpaUBKNlQZaWpWLEicrJzCm3LysyCubm5TBGRVPjZKte136/hq42b\ncTDiAE6dO4lPJk/EJ+Mn83dbCRhq8Qc5GCSpGhkZwdLSEmq1GgMHDsTIkSPh6+uLpKSkMvlNkIut\n7Ru4ffv/ugPT0tKQmpqK+g3qyxYTScPWlp+tUkWdOw+Hdx1Qu05tAEBXz664+ddNJCenyBzZ60xV\ngkfZYpCk6uTkhNGjR2vnArm6umLixIkYNmwYbt26ZYgmX0stW7fE/Xv3EX3xVwBA8PYQdOrSEebm\nFWWOjEqKn61yvWHbAJd+u4SUlKdJ9PTPZ1CjRg1Uq/bq68TSU0qqVA02pSYqKgqtWrUqdNJPnjxB\nWFgYvLy8dO5fHqbUAMAv5y9gyaIlyMzIQr0G9RC4YB5q1Kwhd1gkgfL42ZaHKTUAsH7NBoQfOgyV\nSgWLShaY5u8Hx+aOcodlUIacUnMz7U+9921Y+S0JIyk5zlMlIsmUl6RaHjGpvhzOUyUiIlmVxVG8\n+mJSJSIiWZXFa6P6YlIlIiJZsVIlIiKSCJMqERGRRNj9S0REJBElVaq8nyoREZFEWKkSEZGs2P1L\nREQkESV1/zKpEhGRzJhUiYiIJKGclMqkSkREMuM1VSIiIskoJ6lySg0REZFEWKkSEZGslFOnMqkS\nEZHslJNWmVSJiEhWShqoxGuqREREEmGlSkREsuKKSkRERBJRUlJl9y8REZFEmFSJiIgkwu5fIiKS\nFUf/EhER0XNYqRIRkayUNFCJSZWIiGTGpEpERCQJ5aRUXlMlIiKSDCtVIiKSlZJG/zKpEhGRzJhU\niYiIJKGclMqkSkREslNOWmVSJSIiWSnpmipH/xIREUmESZWIiEgi7P4lIiJZcZlCIiIiyTCpEhER\nSUI5KZVJlYiIZKak0b9MqkREJDMmVSIiIkkoJ6VySg0REZFkWKkSEZHMDFerLly4EJcuXYJKpcLM\nmTPRrFkzg7UFMKkSEZHMDDVQ6fz584iNjcXu3bvx119/YebMmdi9e7dB2voHu3+JiEiRIiMj4erq\nCgBo1KgRHj9+jCdPnhi0TSZVIiKSlaoE/14kMTER1apV0z63srJCQkKCQc+lzHb/mhmZyx0CERGV\ngtL6fS+EMHgbrFSJiEiRNBoNEhMTtc/j4+NRs2ZNg7bJpEpERIrUvn17REREAACuXr0KjUaDSpUq\nGbTNMtv9S0REVBJOTk5o0qQJvL29oVKpMHfuXIO3qRKl0clMRERUDrD7l4iISCJMqkRERBJhUpXZ\nwoULMXDgQHh7e+Py5ctyh0MSu379OlxdXREcHCx3KCSxJUuWYODAgejXrx+OHDkidzhURnCgkozk\nWEKLSk9GRgYCAwPRtm1buUMhiZ07dw43btzA7t27kZycjD59+qBr165yh0VlACtVGcmxhBaVHlNT\nU3z55ZfQaDRyh0ISa9myJb744gsAQJUqVZCZmYn8/HyZo6KygElVRnIsoUWlx9jYGGZmZnKHQQZg\nZGQEc/OnqwCFhoaiU6dOMDIykjkqKgvY/VuGcHYT0evl6NGjCA0NxZYtW+QOhcoIJlUZybGEFhFJ\n49SpU9iwYQO++uorVK5cWe5wqIxg96+M5FhCi4hKLi0tDUuWLMHGjRthaWkpdzhUhrBSlZEcS2hR\n6bly5QqCgoJw9+5dGBsbIyIiAqtXr+YvYQUICwtDcnIyJk2apN0WFBSEOnXqyBgVlQVcppCIiEgi\n7P4lIiKSCJMqERGRRJhUiYiIJMKkSkREJBEmVSIiIokwqdJr486dO2jatCl8fHzg4+MDb29v+Pn5\nITU1Ve9j7t27FwEBAQCAyZMn4+HDh8W+Nzo6GnFxcS997Ly8PLz11lt6x6Zvu0QkHyZVeq1YWVlh\nx44d2LFjB3bt2gWNRoP169dLcuyVK1fC2tq62Ne//fZbWZKbXO0S0avj4g/0WmvZsqX2dnkuLi7w\n9PREXFwcVq1ahbCwMAQHB0MIASsrK8yfPx/VqlVDSEgIdu7ciVq1ahW6g4yLiwu2bt2KevXqYf78\n+bhy5QoAYPjw4TA2Nsbhw4dx+fJlzJgxAw0aNMC8efOQmZmJjIwMTJkyBe3atcPNmzcxbdo0VKxY\nEa1bty4y5qysLMyYMQP3798HAEyZMgWtWrXCN998g++//x4mJiaoUKECVq5ciaioqJdqNy4uDtOm\nTYNKpUKzZs1w8uRJbNy4ETY2Nli4cCGuXr0KAGjTpg0mTZqEqKgorFu3DhUqVED79u2xceNG/Pjj\nj7CwsEBOTg6cnZ1x6NAhLlRB9KoE0WsiLi5OdOzYUfs8Ly9PBAQEiI0bNwohhHB2dhZ79uwRQghx\n79490bNnT5GdnS2EEGLbtm1i0aJFIjU1VbRq1UokJSUJIYQYM2aM8Pf31+5/69YtsX//fjFx4kQh\nhBCPHz8WI0eOFHl5eWLIkCHizJkzQgghRo4cKSIjI4UQQsTHxwtnZ2eRm5srpkyZIkJCQoQQQkRE\nRIjGjRs/dx5r1qwRixcvFkII8ffff4upU6cKIYTYsmWLSEtLE0II8d///lfs2LFDCCFeql0/Pz+x\nfft2IYQQJ0+eFG+99Za4deuW+OGHH8SoUaNEQUGByMvLE/379xdRUVHi3LlzwsnJSSQnJwshhAgI\nCBD79u0TQgjx008/iSlTpuj3IRGVc6xU6bWSlJQEHx8fAEBBQQFatGiBYcOGaV93dHQEAPz6669I\nSEiAr68vACAnJwc2NjaIjY1F3bp1tbfca926Na5du1aojcuXL2urzCpVqmDTpk3PxREVFYX09HSs\nXbsWwNPbvD169AjXr1/HqFGjADytCoty+fJlDBo0CADwxhtvYOnSpQAAS0tLjBo1Cmq1Gnfv3i3y\n5grFtXvt2jWMGDECANCpUyftbckuXbqEtm3bQqVSwcjICC1atEBMTAyaNm0KW1tbbSXq7e2NZcuW\noW/fvggPD0f//v2L+QSI6EWYVOm18s811eKYmJgAeHqD8GbNmmHjxo2FXo+JiYFKpdI+LygoeO4Y\nKpWqyO3PMjU1xerVq2FlZVVouxACavXToQrF3bS6qOM/ePAAQUFBOHToEKpXr46goKBXaregoEDb\nLgDt18+e6z/x/bPtn+8VADg4OCAtLQ03b97EjRs3iv2DgIhejAOVSJHs7e1x+fJl7U3fw8PDcfTo\nUdSvXx937txBamoqhBCIjIx8bl9HR0ecOnUKAPDkyRMMGDAAOTk5UKlUyM3NBQA0b94c4eHhAJ5W\nzwsWLAAANGrUCL/99hsAFHnsfx//zp07+PDDD/Ho0SNUq1YN1atXR0pKCk6fPo2cnBwAeKl2GzZs\niF9//RUAcObMGaSnpwMA3n33XZw9exZCCOTl5eH8+fNwcHAoMi4vLy/MmjULXbt2fS4ZE9HLYaVK\nimRtbY1Zs2Zh9OjRqFixIszMzBAUFISqVatizJgxGDx4MOrWrYu6desiKyur0L6enp6Ijo6Gt7c3\n8vPzMXz4cJiamqJ9+/aYO3cuZs6ciVmzZmHOnDk4dOgQcnJyMHbsWADA+PHj4e/vj8OHD8PR0RHG\nxs//iPn4+OC///0vPvjgAxQUFGDSpEmws7NDgwYN0L9/f9SvXx8ff/wxPv30U3Tu3Pml2p04cSKm\nTZuGgwcPwtHREbVq1YKRkRE8PDwQHR2NQYMGoaCgAK6urmjevDmioqKei6tXr15YtGgRPv/8cwN8\nIkTlA+9SQ6QAMTExyM7ORosWLZCYmAhPT0+cPXu2UBevLv9U88uXLzdgpETKxkqVSAHMzc21XcG5\nubmYN2/eKyXUiRMn4tGjR1i1apWhQiQqF1ipEhERSYQDlYiIiCTCpEpERCQRJlUiIiKJMKkSERFJ\nhEmViIhIIkyqREREEvl/7zbEbaPeua8AAAAASUVORK5CYII=\n",
            "text/plain": [
              "<matplotlib.figure.Figure at 0x7f60652da358>"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'\\nIgnore:           0\\nSender Name:      1 \\nSender KVK:       2 \\nSender IBAN:      3 \\nInvoice Reference:4\\nTotal:            5\\n'"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 26
        }
      ]
    },
    {
      "metadata": {
        "id": "tvk0pFS87QEn",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "Test on many invoices and average the accuracies"
      ]
    },
    {
      "metadata": {
        "id": "CRp3ZKWX7QEo",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          },
          "output_extras": [
            {
              "item_id": 62
            },
            {
              "item_id": 85
            }
          ],
          "base_uri": "https://localhost:8080/",
          "height": 5717
        },
        "outputId": "845479be-6da0-4978-860f-545eaae97deb",
        "executionInfo": {
          "status": "ok",
          "timestamp": 1521493143132,
          "user_tz": -60,
          "elapsed": 28940,
          "user": {
            "displayName": "Fernando Lasso",
            "photoUrl": "//lh5.googleusercontent.com/-nEu68SAVl9Y/AAAAAAAAAAI/AAAAAAAABSo/fc8BpWFBAC8/s50-c-k-no/photo.jpg",
            "userId": "102520388349909004575"
          }
        }
      },
      "cell_type": "code",
      "source": [
        "#Function to generate random invoices and test the accuracy of the model versus\n",
        "#the accuracy of the post processing estimations\n",
        "def test_quality(inv_n,seed=0):\n",
        "    accuracies = []\n",
        "    accuracies_clean = []\n",
        "    '''\n",
        "    Ignore:           0\n",
        "    Sender Name:      1 \n",
        "    Sender KVK:       2 \n",
        "    Sender IBAN:      3 \n",
        "    Invoice Reference:4\n",
        "    Total:            5\n",
        "    '''\n",
        "    #Regex patterns\n",
        "    patterns = [r'[0-9]{8}',\n",
        "                r'[A-Z]{2}[0-9]{2}[A-Z]{4}[0-9]{10}',\n",
        "                r'([A-Z]|[0-9]){5}',\n",
        "                r'€ [0-9]']\n",
        "    #Paddings around regex search \n",
        "    paddings = [1,20,1,5]\n",
        "    \n",
        "    #Generate inv_n amount of invoices \n",
        "    for i in range(inv_n):\n",
        "        #Get specific invoice or generate random\n",
        "        if(seed>0):\n",
        "          chars,data,y_true = get_invoice(seed)\n",
        "        else:\n",
        "          chars,data,y_true = get_invoice()\n",
        "          \n",
        "        #Get predicted ys\n",
        "        y_pred = get_preds(data)\n",
        "        #Calculate accuracy in percentage\n",
        "        accuracies.append(sum(1 for x,y in zip(y_true,y_pred) if x == y) / len(y_true))\n",
        "        \n",
        "        #Get melted info and clean predictions from small length prediction\n",
        "        melted,pred =melt_info(chars,y_pred,5)\n",
        "        \n",
        "        #Clean name sections\n",
        "        y_cleaned = cleanNames(chars,y_pred,melted)\n",
        "        \n",
        "        #Match regex per category to have more precise estimate\n",
        "        for j in range(0,4):\n",
        "            y_cleaned = cleanByRegex(chars,y_cleaned,melted,j+2,patterns[j],paddings[j])\n",
        "        #Calculate accuracy of cleaned predictions\n",
        "        accuracies_clean.append(sum(1 for x,y in zip(y_true,y_cleaned) if x == y) / len(y_true))\n",
        "    print(\"Accuracy before cleaning: \"+str(round(sum(accuracies)/len(accuracies),4)))\n",
        "    print(\"Accuracy after cleaning:\" +str(round(sum(accuracies_clean)/len(accuracies_clean),4)))\n",
        "\n",
        "test_quality(100)\n",
        "\n",
        "#Get example\n",
        "#test_quality(1,20)"
      ],
      "execution_count": 29,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "success on 1\n",
            "success on 4\n",
            "success on 1\n",
            "success on 4\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 4\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 5\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 4\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 4\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 5\n",
            "success on 1\n",
            "success on 4\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 4\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 4\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 5\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 5\n",
            "success on 4\n",
            "success on 4\n",
            "success on 1\n",
            "success on 4\n",
            "success on 1\n",
            "success on 4\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 4\n",
            "success on 4\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 4\n",
            "success on 4\n",
            "success on 1\n",
            "success on 4\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 4\n",
            "success on 1\n",
            "success on 4\n",
            "success on 4\n",
            "success on 4\n",
            "success on 4\n",
            "success on 1\n",
            "success on 4\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 4\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 4\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 5\n",
            "success on 4\n",
            "success on 4\n",
            "success on 4\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 4\n",
            "success on 4\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 4\n",
            "success on 4\n",
            "success on 4\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 4\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 5\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 4\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 5\n",
            "success on 4\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": [
            "success on 4\n",
            "success on 4\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 4\n",
            "success on 1\n",
            "success on 4\n",
            "success on 1\n",
            "success on 4\n",
            "success on 1\n",
            "success on 4\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 4\n",
            "success on 1\n",
            "success on 4\n",
            "success on 4\n",
            "success on 1\n",
            "success on 4\n",
            "success on 1\n",
            "success on 4\n",
            "success on 4\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 4\n",
            "success on 1\n",
            "success on 4\n",
            "success on 1\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 4\n",
            "success on 5\n",
            "success on 1\n",
            "success on 2\n",
            "success on 3\n",
            "success on 5\n",
            "success on 1\n",
            "success on 4\n",
            "Accuracy before cleaning: 0.9815\n",
            "Accuracy after cleaning:0.979\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "zRYMUUrTJARy",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## Obtain system outputs from predictions"
      ]
    },
    {
      "metadata": {
        "id": "rZHf701mJARz",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "from itertools import groupby\n",
        "# Create groups by the predicted output\n",
        "# The this code will return a tuple with the format\n",
        "# (category, length, starting index)\n",
        "\n",
        "# TODO: This code is ugly and very hard to understand\n",
        "# But it works\n",
        "\n",
        "# Group by predicted category\n",
        "g = groupby(enumerate(y_pred), lambda x:x[1])\n",
        "\n",
        "# Create list of groups\n",
        "l = [(x[0], list(x[1])) for x in g]\n",
        "\n",
        "# Create list with tuples of groups\n",
        "groups = [(x[0], len(x[1]), x[1][0][0]) for x in l]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "M40M3Y8eJAR2",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "# Show grouping\n",
        "groups[:10]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "_J6W-RV2JAR4",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "'''\n",
        "We only want to consider sequences of predictions of the same type \n",
        "that have a minimum length. This way we remove the noise\n",
        "But we also might remove some good predictions\n",
        "\n",
        "The min length is set to 5 here, certainly a value to experiment with\n",
        "'''\n",
        "candidates = []\n",
        "# Loop over all groups\n",
        "for group in groups:\n",
        "    \n",
        "    # Unpack group\n",
        "    category, length, index = group\n",
        "    \n",
        "    # Ignore the ignore category and only consider category sequences longer than 5\n",
        "    if category != 0 and length > 5:\n",
        "        # Create text\n",
        "        candidate_text = ''.join(chars[index:index+length])\n",
        "        # Remove line breaks, this is just one way to prettify outputs!\n",
        "        candidate_text = candidate_text.replace('\\n','')\n",
        "        candidates.append((candidate_text,category))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "q92fYU5HJAR7",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        "# Show predictions\n",
        "\n",
        "'''\n",
        "Ignore:           0\n",
        "Sender Name:      1 \n",
        "Sender KVK:       2 \n",
        "Sender IBAN:      3 \n",
        "Invoice Reference:4\n",
        "Total:            5\n",
        "'''\n",
        "\n",
        "sorted(candidates, key=lambda tup: tup[1])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "pVtUhpFFJAR-",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "ru8U0J-BJASA",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "cell_type": "code",
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}