Maurits2014/public_assignment-1.ipynb

## public_assignment-1.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "PUBLIC_Assignment-1.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/Maurits2014/bc21d21a1bfd9e4dfaf339b06df998a3/public_assignment-1.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "h3oFJyy3EP3o",
        "colab_type": "text"
      },
      "source": [
        "**Name_Surname (s-number), Name_Surname (s-number), Name_Surname (s-number)**\n",
        "\n",
        "!!! Write here who contributed to which exercise and how (don't be too specific)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "A5WfZ7CTGxzv",
        "colab_type": "text"
      },
      "source": [
        "# Prelude"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "3vyJOhNEsIP9",
        "colab_type": "code",
        "outputId": "fc17a95c-78c5-41f5-f15d-5dcc34ce2170",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 153
        }
      },
      "source": [
        "# Necessary modules (optionals are commented out)\n",
        "import os\n",
        "import sys\n",
        "from os import path as op\n",
        "from collections import Counter, defaultdict\n",
        "import re\n",
        "import spacy\n",
        "spacy.cli.download(\"en_core_web_sm\")\n",
        "import nltk\n",
        "nltk.download('punkt')\n",
        "from nltk.tokenize import word_tokenize\n",
        "from nltk.tokenize import sent_tokenize\n",
        "nltk.download('averaged_perceptron_tagger')\n",
        "import warnings # to silence stanford pos tagger"
      ],
      "execution_count": 2,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
            "You can now load the model via spacy.load('en_core_web_sm')\n",
            "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
            "[nltk_data]   Package punkt is already up-to-date!\n",
            "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
            "[nltk_data]     /root/nltk_data...\n",
            "[nltk_data]   Package averaged_perceptron_tagger is already up-to-\n",
            "[nltk_data]       date!\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "EhUb0WWNFiCd",
        "colab_type": "code",
        "outputId": "cb5f4112-7b52-482e-a72c-a184e371c2e5",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 85
        }
      },
      "source": [
        "# Environment info (DELETEME)\n",
        "print(f\"Python version: {sys.version}\")\n",
        "print(f\"NLTK version: {nltk.__version__}\")\n",
        "print(f\"SpaCy: {spacy.__version__}\")"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Python version: 3.6.9 (default, Nov  7 2019, 10:44:02) \n",
            "[GCC 8.3.0]\n",
            "NLTK version: 3.2.5\n",
            "SpaCy: 2.2.4\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "LI92ICJpGkpG",
        "colab_type": "code",
        "outputId": "fd5c00df-f10c-487e-f771-822b9825ffa3",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "# IMPORT Google drive\n",
        "from google.colab import drive\n",
        "drive.mount('/content/gdrive')"
      ],
      "execution_count": 29,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount(\"/content/gdrive\", force_remount=True).\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "qezn-moTr9WP",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!unzip '../pmb.zip'"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "uVcVzuBfubUT",
        "colab_type": "code",
        "outputId": "86810bf4-a155-431e-b844-56e8f9300f43",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "%  ls PMB_DIR"
      ],
      "execution_count": 37,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "ls: cannot access 'PMB_DIR': No such file or directory\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Ogz93UCHjvOk",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "% less ../pmb/p27/d0857/nl.raw"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_ynVVx0T_nqC",
        "colab_type": "code",
        "outputId": "e20f0898-8cdf-4109-ded1-daea77c7cf5f",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "d = defaultdict(list)\n",
        "d['newkey1'].append('newtoken')\n",
        "d['newkey2']\n"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 11
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-2YAUJV1toc3",
        "colab_type": "code",
        "outputId": "1d8a450f-b0c6-4a6f-99c3-c4e209cd461e",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "d"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "defaultdict(list, {'newkey1': ['newtoken'], 'newkey2': []})"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 12
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CtlBo2-EFfCJ",
        "colab_type": "text"
      },
      "source": [
        "!!! **SET ENVIRONMENT VARIABLES**"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "iOmkgw3zFhB_",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "fb03f663-9803-4d62-cfe9-b889b2a83f1d"
      },
      "source": [
        "# Change me based on where this notebook is\n",
        "THIS_DIR = '/content/gdrive/My Drive/Anno4ML/assign-1/PUBLIC' \n",
        "PMB_DIR = '../pmb'  # you can change this\n",
        "TOOL_DIR = '../'    # you can change this"
      ],
      "execution_count": 38,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "ls: cannot access 'PMB_DIR': No such file or directory\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "kUyZf1c_mrmR",
        "colab_type": "code",
        "outputId": "9587f89d-284c-49bf-e9ba-1fa081a3415b",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "os.chdir(THIS_DIR)\n",
        "print(f\"Current working dir: {os.getcwd()}\")"
      ],
      "execution_count": 21,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Current working dir: /content/gdrive/My Drive/Anno4ML/assign-1/PUBLIC\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zkUe40gsqF66",
        "colab_type": "text"
      },
      "source": [
        "---\n",
        "# Ex1: Reading and analyzing a corpus\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yNI4N7nAloBr",
        "colab_type": "text"
      },
      "source": [
        "## 1A: Read pmb directory in a dictionary"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "RyLTcV2GF6YP",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Reading the directory will take some time \n",
        "# So you might want to initially limit your code to particular pXX\n",
        "# before the code will be functioning properly\n",
        "# write pmb.txt file in the current directory\n",
        "pmb['01/0777']"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Xm-ALqrp74FB",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uj8-kKr7luha",
        "colab_type": "text"
      },
      "source": [
        "## 1B: Write pmb dictionary in a single file"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "I-p_aIUNqrDv",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# write pmb dictionary in a single file"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "AHKMGFec1ah0",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "% less pmb.txt"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EFUPDtztcKXb",
        "colab_type": "text"
      },
      "source": [
        "## 1C: top 10 words"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rCwyn5hdL3wa",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# You can also print out table in an ASCII if you want\n",
        "#    DE        EN          IT         NL\n",
        "# token 99 | token 99 | token 99 | token 99 | \n",
        "# token 00 | token 99 | token 99 | token 99 |\n",
        "# or use the mardown as shown below"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EKb2zOYCKbwT",
        "colab_type": "text"
      },
      "source": [
        "DE | EN | IT | NL \n",
        "--- | --- | --- | ---\n",
        "token 99 | token 99 | token 99 | token 99\n",
        "token 99 | token 99 | token 99 | token 99\n",
        "token 99 | token 99 | token 99 | token 99\n",
        "\n",
        "Pointing out major similarities and differences (max 100 words)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2iDGE-YInGMv",
        "colab_type": "text"
      },
      "source": [
        "---\n",
        "# Ex2: Stand-off tokenization\n",
        "---\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oZo_DB_P_UrE",
        "colab_type": "text"
      },
      "source": [
        "### Read pmb.txt file"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "JPI42WY9kt0i",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# read pmb.txt data in pmb dictionary\n",
        "def read_pmb_from_file(pmb_file):\n",
        "  with open(pmb_file) as FILE:\n",
        "    pmb_list = re.split('\\n### (.+) ###\\n', '\\n'+FILE.read())[1:]\n",
        "  pmb = defaultdict(dict)\n",
        "  for info, raw in zip(pmb_list[::2], pmb_list[1::2]):\n",
        "    pd, lang, corpus = info.split()\n",
        "    pmb[pd]['met'] = corpus\n",
        "    pmb[pd][lang] = raw\n",
        "  return pmb"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "GqscbBgrm6QO",
        "colab_type": "code",
        "outputId": "f7f1375f-1bb2-483f-b323-dd722d5cdcc4",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "pmb = read_pmb_from_file('pmb.txt')\n",
        "print(f\"Number of docs = {len(pmb)}\")"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Number of docs = 2171\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "i2uF0ZOvxwBC",
        "colab_type": "code",
        "outputId": "8880db71-e809-4f80-c42b-94732ef285c4",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 108
        }
      },
      "source": [
        "pmb['03/0937']"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'DE': 'Er stritt es ab.\\n',\n",
              " 'EN': 'He denied it.\\n',\n",
              " 'IT': 'La negò.\\n',\n",
              " 'NL': 'Hij ontkende het.\\n',\n",
              " 'met': 'Tatoeba'}"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 27
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "I65M0-Yj_hVb",
        "colab_type": "text"
      },
      "source": [
        "## 2A: Tokenize pmb dictionary with NLTK or SpaCy"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4bCv4BkYsoi7",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "sample_sentence = \"John doesn't live in U.S. anymore.\\nHe  has moved to Groningen\""
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "gQlX5StGrAHQ",
        "colab_type": "code",
        "outputId": "d1acd6f2-95d0-46ad-83ce-603003d46ec9",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 53
        }
      },
      "source": [
        "# Example of running sentence and word tokenizers from NLTK\n",
        "for sen in sent_tokenize(sample_sentence):\n",
        "    print(word_tokenize(sen))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "['John', 'does', \"n't\", 'live', 'in', 'U.S.', 'anymore', '.']\n",
            "['He', 'has', 'moved', 'to', 'Groningen']\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "uY_-EfYVrtwz",
        "colab_type": "code",
        "outputId": "ff432b26-be36-44af-d0ed-5c90c56356e3",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 53
        }
      },
      "source": [
        "# Use SpaCy sentence and word tokenizer\n",
        "# This is not the most effcient way of running it \n",
        "# but it is simple and feasible for our corpus \n",
        "nlp = spacy.load(\"en_core_web_sm\")\n",
        "for sen in nlp(sample_sentence).sents:\n",
        "    print([ tok.text for tok in nlp(sen.text) ])"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "['John', 'does', \"n't\", 'live', 'in', 'U.S.', 'anymore', '.', '\\n']\n",
            "['He', ' ', 'has', 'moved', 'to', 'Groningen']\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "N-1yopkKv3XK",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "nltk_sen_tok = pmb_en_seg(pmb, 'nltk')\n",
        "spcy_sen_tok = pmb_en_seg(pmb, spacy.load(\"en_core_web_sm\"))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "kxzBh1SR0dxi",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "nltk_sen_tok['01/0777']"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "gUDtVUwxLtwq",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "spcy_sen_tok['01/0777']"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5ApgzVUXAgp6",
        "colab_type": "text"
      },
      "source": [
        "## 2B: Convert token list into TOIS stand-off annotation"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "cl_CgRCLQiwY",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "nltk_seg = sen_tok_to_off_seg(pmb, nltk_sen_tok, 'en.nltk.seg')\n",
        "spcy_seg = sen_tok_to_off_seg(pmb, spcy_sen_tok, 'en.spcy.seg')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "scd4N5_HhJ0i",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "pmb['01/0777']['EN']"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4BPjOsupQoVR",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "nltk_seg['01/0777']"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "uLEQJODw0Wgh",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "% less en.spcy.raw.seg"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "R_yPKhddMNUM",
        "colab_type": "text"
      },
      "source": [
        "## 2C*: Remedy for altering double quotes"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "NR7GWF2lMS1v",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def whatever_functions_you_want_to_define(x):\n",
        "    # Insert this in sen_tok_to_off_seg to fix the mentioned shortcoming\n",
        "    return None"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4RG75YGDVTqn",
        "colab_type": "text"
      },
      "source": [
        "---\n",
        "# Ex3: Part-of-speech tagging\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "K2XvQ3wiOnHF",
        "colab_type": "text"
      },
      "source": [
        "## 3A: Read token lists from stand-off segementation"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "idDjv0tOOqVI",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "nltk_sen_tok = apply_off_seg('pmb.txt', 'en.nltk.seg')\n",
        "spcy_sen_tok = apply_off_seg('pmb.txt', 'en.spcy.seg')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WBLiG72GO15A",
        "colab_type": "text"
      },
      "source": [
        "## 3B: Tag data with 5 taggers"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dNHM_UGUNNOd",
        "colab_type": "text"
      },
      "source": [
        "You need to obtain the necessary files for the POS taggers (links are in the function comments). **Don't change these functions**. Put the downloaded files in appropriate locations that the provided functions will work without changing.\n",
        "\n",
        "Also you need to make the following files executable. Put correct paths there and execute the provided command."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "UoyylqRZemZl",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "! chmod +x ../hunpos/hunpos-1.0-linux/hunpos-*; chmod +x ../senna/senna-linux64"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "F4rYKrC334Ja",
        "colab_type": "code",
        "outputId": "1ea91ffb-a249-43fb-b0d7-dfafdb2da462",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 90
        }
      },
      "source": [
        "% ls ../hunpos/hunpos-1.0-linux/hun"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "\u001b[0m\u001b[01;34mhunpos\u001b[0m/  \u001b[01;34mPUBLIC\u001b[0m/         \u001b[01;34mstanford-postagger-full-2018-10-16\u001b[0m/\n",
            "\u001b[01;34mmine\u001b[0m/    \u001b[01;34mQuincy\u001b[0m/         stanford-postagger-full-2018-10-16.zip\n",
            "\u001b[01;34mpmb\u001b[0m/     \u001b[01;34msenna\u001b[0m/\n",
            "pmb.zip  senna-v3.0.tgz\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "08SdOi52rcik",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "! tar -xzf ../hunpos/"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "g3APeTqDVTM3",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def hun_pos_tagger(sents, progress=False):\n",
        "    #https://code.google.com/archive/p/hunpos/downloads\n",
        "    hp = nltk.tag.HunposTagger(op.join(TOOL_DIR, 'hunpos/english.model'), \n",
        "                               path_to_bin=op.join(TOOL_DIR, 'hunpos/hunpos-1.0-linux/hunpos-tag'), \n",
        "                               encoding='utf-8')\n",
        "    pos_tags = []\n",
        "    for i, sen in enumerate(sents, start=1):\n",
        "        if progress and i % 100 == 0: print('.', end='')\n",
        "        pos_sen = [ pos.decode('utf-8') for (_, pos) in hp.tag(sen) ]\n",
        "        pos_tags.append(pos_sen)\n",
        "    return pos_tags\n",
        "\n",
        "def nltk_pos_tagger(sents, progress=False):\n",
        "    '''Uses averaged_perceptron_tagger'''\n",
        "    #nltk.download('averaged_perceptron_tagger')\n",
        "    pos_tags = []\n",
        "    for i, sen in enumerate(sents, start=1):\n",
        "        if progress and i % 100 == 0: print('.', end='')\n",
        "        pos_sen = [ pos for (_, pos) in nltk.pos_tag(sen) ]\n",
        "        pos_tags.append(pos_sen)\n",
        "    return pos_tags\n",
        "\n",
        "def senna_pos_tagger(sents, progress=False):\n",
        "    #https://ronan.collobert.com/senna/download.html\n",
        "    senna_tagger = nltk.tag.SennaTagger(op.join(TOOL_DIR, 'senna'))\n",
        "    pos_tags = []\n",
        "    for i, sen in enumerate(sents, start=1):\n",
        "        if progress and i % 100 == 0: print('.', end='')\n",
        "        pos_sen = [ pos for (_, pos) in senna_tagger.tag(sen) ]\n",
        "        pos_tags.append(pos_sen) \n",
        "    return pos_tags \n",
        "\n",
        "def stanford_pos_tagger(sents, progress=False):\n",
        "    #https://nlp.stanford.edu/software/tagger.html#Download\n",
        "    model = op.join(TOOL_DIR, 'stanford-postagger-full-2018-10-16/models/wsj-0-18-bidirectional-distsim.tagger')\n",
        "    jar_path = op.join(TOOL_DIR, 'stanford-postagger-full-2018-10-16/stanford-postagger.jar')\n",
        "    # ignore depriciate warning\n",
        "    with warnings.catch_warnings(record=True):\n",
        "        warnings.simplefilter(\"ignore\")\n",
        "        stanford_tagger = nltk.tag.StanfordPOSTagger(model, path_to_jar=jar_path)\n",
        "    pos_tags = []\n",
        "    for i, sen in enumerate(sents, start=1):\n",
        "        if progress and i % 100 == 0: print('.', end='')\n",
        "        pos_sen = [ pos for (_, pos) in stanford_tagger.tag(sen) ]\n",
        "        pos_tags.append(pos_sen) \n",
        "    return pos_tags\n",
        "\n",
        "def spacy_pos_tagger(sents, progress=False):\n",
        "    #https://spacy.io/usage/spacy-101#annotations-pos-deps\n",
        "    nlp = spacy.load('en_core_web_sm')\n",
        "    pos_tags = []\n",
        "    for i, sen in enumerate(sents, start=1):\n",
        "        if progress and i % 100 == 0: print('.', end='')\n",
        "        input = spacy.tokens.doc.Doc(nlp.vocab, sen)\n",
        "        pos_sen = [ t.tag_ for t in nlp.tagger(input) ]\n",
        "        pos_tags.append(pos_sen) \n",
        "    return pos_tags"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "aATvnVj2S2Wj",
        "colab_type": "code",
        "outputId": "5887af17-2797-435b-d15c-b6e140cf88f7",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 334
        }
      },
      "source": [
        "sample_docs = [\"This is a sample text .\".split(), \"There are no non ascii symbols\".split()]\n",
        "taggers = (hun_pos_tagger, nltk_pos_tagger, senna_pos_tagger, stanford_pos_tagger, spacy_pos_tagger)\n",
        "for pos_tagger in taggers:\n",
        "    print(f\"{pos_tagger.__name__}:\\t{pos_tagger(sample_docs)}\")"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "error",
          "ename": "NameError",
          "evalue": "ignored",
          "traceback": [
            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
            "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
            "\u001b[0;32m<ipython-input-3-42f7d62c9038>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0mtaggers\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mhun_pos_tagger\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnltk_pos_tagger\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msenna_pos_tagger\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstanford_pos_tagger\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mspacy_pos_tagger\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mpos_tagger\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtaggers\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m     \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf\"{pos_tagger.__name__}:\\t{pos_tagger(sample_docs)}\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
            "\u001b[0;32m<ipython-input-2-524cf1a5d6e6>\u001b[0m in \u001b[0;36mhun_pos_tagger\u001b[0;34m(sents, progress)\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mhun_pos_tagger\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msents\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mprogress\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m     \u001b[0;31m#https://code.google.com/archive/p/hunpos/downloads\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m     hp = nltk.tag.HunposTagger(op.join(TOOL_DIR, 'hunpos/english.model'), \n\u001b[0m\u001b[1;32m      4\u001b[0m                                \u001b[0mpath_to_bin\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mop\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mTOOL_DIR\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'hunpos/hunpos-1.0-linux/hunpos-tag'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m                                encoding='utf-8')\n",
            "\u001b[0;31mNameError\u001b[0m: name 'nltk' is not defined"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hsnIHocQPM7k",
        "colab_type": "text"
      },
      "source": [
        "---\n",
        "# Ex4: Compare Annotations\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sToo5Cb_PPOw",
        "colab_type": "text"
      },
      "source": [
        "## 4A: Contrast Segmentations"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "KTHZZI8QPPiW",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "contrast_seg('pmb.txt', 'en.nltk.seg', 'en.spcy.seg', 'seg.html')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f7HZam2hPPo0",
        "colab_type": "text"
      },
      "source": [
        "## 4B: Discussing different types of segemntation differences"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "qdhvFLRDTgha",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# code that reports the total number of TOIS labels \n",
        "# and the counts and % when the two segementations differ"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "E7sBSTsSQBfp",
        "colab_type": "text"
      },
      "source": [
        "Discussion of differences goes here"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KpQHaOMVPyH6",
        "colab_type": "text"
      },
      "source": [
        "## 4C: Contrast 5 taggers"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "GOc5_ibxP0us",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# pos5_nltk shows that nltk segmentation was used\n",
        "# to obtain tokens for POS tagging\n",
        "contrast_pos(tagged5_nltk, 'pmb.txt', 'pos5_nltk.html')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2lY79ZNlT_YM",
        "colab_type": "text"
      },
      "source": [
        "## 4D Different POS tag voting scenarios "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "yNNj-H9hUAMA",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# You can print a table structure using the code\n",
        "# or make the raw numbers print and then put them nicely in the table below"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SGK7mgsxY53Q",
        "colab_type": "text"
      },
      "source": [
        "Scenario | Count | % \n",
        "--- | ---   | --- \n",
        "Total       | 999 | 10%\n",
        "Unanimous   | 999 | 10%\n",
        "4-1      | 999 | 10%\n",
        "3-2      | 999 | 10%\n",
        "3-1-1 | 999 | 10%\n",
        "\n",
        "Order the scenarios according to their names, for example, 4-1 > 3-2 > 3-1-1 > 2-2-1 > 1-1-1-1-1. In the name numbers are always sorted X-Y, where X >= Y"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kXCy7uRWP008",
        "colab_type": "text"
      },
      "source": [
        "## 4E: Discussing different types of POS tagging votes"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4gIfOIYwP9al",
        "colab_type": "text"
      },
      "source": [
        "Discussion of differences and the arguments for correct POS tags go here "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AQXKJOx7hq1B",
        "colab_type": "text"
      },
      "source": [
        "For the sample file `pos_5_nltk.html`, the following explanations are satisfactory.\n",
        "\n",
        "1. ***sample*** should be NN as \"*Nouns that are used as modifiers, whether  in isolation or in sequences, should be  tagged as nouns (NN, NNS) rather than as adjectives (JJ)*\" (see p12).  \n",
        "2. ***non*** should be FW as \"*Use your judgment as to what is a foreign  word.  For me, yoga is an NN,  while bete noire and persona non grata should be tagged bete/FW noire/FW and persona/FW non/FW grata/FW, respectively.*\" (see p3). \n",
        "3. ***ascii*** should be NP as \"*Abbreviations and initials should be tagged as if  they were spelled out*\" (see p18 and p32).\n",
        "\n",
        "The reference pages are from the [Part-of-Speech Tagging Guidelines for the Penn Treebank Project](https://repository.upenn.edu/cgi/viewcontent.cgi?article=1603&context=cis_reports)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "JRCpV4_kUDLp",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}