p16i/something.ipynb

## something.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The NLTK corpus \"reuters\" comprises 10788 rather short news articles. They seem like a good input to our toy-example pipeline.  \n",
    "\n",
    "This document explores what steps we can perform on a single text to get some basic information out of it, e.g. in form of a vector. The input is a text, the output could be [word count, type/token ratio, readability score]  \n",
    "These operations will ultimately be performed on the entire corpus, presumably in a for loop, to get this information about each text. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Reuters Corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " The corpus consists of 10788 files\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "\n",
    "from nltk.corpus import reuters\n",
    "\n",
    "files = reuters.fileids()\n",
    "print(f' The corpus consists of {len(files)} files')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Though actually these articles are already tokenized, I am here taking an article in its raw format. This article will then go through the steps of tokenisation, stemming, lemmatization etc. as if it had just been scraped from somewhere. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'NOVA &lt;NVA.A.TO> NOT PLANNING DOME &lt;DMP> BID\\n  Nova, an Alberta Corp, chief executive\\n  Robert Blair expressed hope that Dome Petroleum Ltd &lt;DMP>\\n  remains under Canadian ownership, but added that his company\\n  plans no bid of its own for debt-troubled Dome.\\n      \"We\\'ve no plan to bid,\" Blair told reporters after a speech\\n  to a business group, although he stressed that Nova and 57\\n  pct-owned Husky Oil Ltd &lt;HYO> were interested in Dome\\'s\\n  extensive Western Canadian energy holdings.\\n      \"But being interested can sometimes be different from\\n  making a bid,\" Blair said.\\n      TransCanada PipeLines Ltd &lt;TRP> yesterday bid 4.30 billion\\n  dlrs for Dome, but Dome said it was discontinuing talks with\\n  TransCanada and was considering a proposal from another company\\n  and was also talking with another possible buyer, both rumored\\n  to be offshore.\\n      Asked by reporters if Dome should remain in Canadian hands,\\n  Blair replied, \"Yes. I think that we still need to be building\\n  as much Canadian position in this industry as we can and I\\n  think it would be best if Dome ends up in the hands of Canadian\\n  management.\"\\n      He said he did not know who other possible bidders were.\\n      Blair said that any move to put Dome\\'s financial house in\\n  order \"will remove one of the general problems of attitude that\\n  have hung over Western Canadian industry.\"\\n      He added, however, that the energy industry still faced \"a\\n  couple of tough, tough additional years.\"\\n      Asked about Nova\\'s 1987 prospects, Blair predicted that\\n  Nova\\'s net profit would rise this year to more than 150 mln\\n  dlrs from last year\\'s net profit of 100.2 mln dlrs due to\\n  improved product prices and continued cost-cutting.\\n  \\n\\n'"
      ]
     },
     "execution_count": 128,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw_text = reuters.raw('test/16577')\n",
    "raw_text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tokenization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 129,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['NOVA &lt;NVA.A.TO> NOT PLANNING DOME &lt;DMP> BID\\n  Nova, an Alberta Corp, chief executive\\n  Robert Blair expressed hope that Dome Petroleum Ltd &lt;DMP>\\n  remains under Canadian ownership, but added that his company\\n  plans no bid of its own for debt-troubled Dome.', '\"We\\'ve no plan to bid,\" Blair told reporters after a speech\\n  to a business group, although he stressed that Nova and 57\\n  pct-owned Husky Oil Ltd &lt;HYO> were interested in Dome\\'s\\n  extensive Western Canadian energy holdings.', '\"But being interested can sometimes be different from\\n  making a bid,\" Blair said.', 'TransCanada PipeLines Ltd &lt;TRP> yesterday bid 4.30 billion\\n  dlrs for Dome, but Dome said it was discontinuing talks with\\n  TransCanada and was considering a proposal from another company\\n  and was also talking with another possible buyer, both rumored\\n  to be offshore.', 'Asked by reporters if Dome should remain in Canadian hands,\\n  Blair replied, \"Yes.', 'I think that we still need to be building\\n  as much Canadian position in this industry as we can and I\\n  think it would be best if Dome ends up in the hands of Canadian\\n  management.\"', 'He said he did not know who other possible bidders were.', 'Blair said that any move to put Dome\\'s financial house in\\n  order \"will remove one of the general problems of attitude that\\n  have hung over Western Canadian industry.\"', 'He added, however, that the energy industry still faced \"a\\n  couple of tough, tough additional years.\"', \"Asked about Nova's 1987 prospects, Blair predicted that\\n  Nova's net profit would rise this year to more than 150 mln\\n  dlrs from last year's net profit of 100.2 mln dlrs due to\\n  improved product prices and continued cost-cutting.\"]\n"
     ]
    }
   ],
   "source": [
    "# tokenize the raw text into sentences\n",
    "from nltk import sent_tokenize\n",
    "\n",
    "sentences = sent_tokenize(raw_text)\n",
    "\n",
    "print(sentences)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the average sentence length (which is a component of most readablity scores anyway) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "the text contains 10 sentences\n"
     ]
    }
   ],
   "source": [
    "n_sentences = len(sentences)\n",
    "\n",
    "print(f'the text contains {n_sentences} sentences')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 131,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['NOVA', '&', 'lt', ';', 'NVA.A.TO', '>', 'NOT', 'PLANNING', 'DOME', '&', 'lt', ';', 'DMP', '>', 'BID', 'Nova', ',', 'an', 'Alberta', 'Corp', ',', 'chief', 'executive', 'Robert', 'Blair', 'expressed', 'hope', 'that', 'Dome', 'Petroleum', 'Ltd', '&', 'lt', ';', 'DMP', '>', 'remains', 'under', 'Canadian', 'ownership', ',', 'but', 'added', 'that', 'his', 'company', 'plans', 'no', 'bid', 'of', 'its', 'own', 'for', 'debt-troubled', 'Dome', '.'], ['``', 'We', \"'ve\", 'no', 'plan', 'to', 'bid', ',', \"''\", 'Blair', 'told', 'reporters', 'after', 'a', 'speech', 'to', 'a', 'business', 'group', ',', 'although', 'he', 'stressed', 'that', 'Nova', 'and', '57', 'pct-owned', 'Husky', 'Oil', 'Ltd', '&', 'lt', ';', 'HYO', '>', 'were', 'interested', 'in', \"Dome's\", 'extensive', 'Western', 'Canadian', 'energy', 'holdings', '.'], ['``', 'But', 'being', 'interested', 'can', 'sometimes', 'be', 'different', 'from', 'making', 'a', 'bid', ',', \"''\", 'Blair', 'said', '.'], ['TransCanada', 'PipeLines', 'Ltd', '&', 'lt', ';', 'TRP', '>', 'yesterday', 'bid', '4.30', 'billion', 'dlrs', 'for', 'Dome', ',', 'but', 'Dome', 'said', 'it', 'was', 'discontinuing', 'talks', 'with', 'TransCanada', 'and', 'was', 'considering', 'a', 'proposal', 'from', 'another', 'company', 'and', 'was', 'also', 'talking', 'with', 'another', 'possible', 'buyer', ',', 'both', 'rumored', 'to', 'be', 'offshore', '.'], ['Asked', 'by', 'reporters', 'if', 'Dome', 'should', 'remain', 'in', 'Canadian', 'hands', ',', 'Blair', 'replied', ',', '``', 'Yes', '.'], ['I', 'think', 'that', 'we', 'still', 'need', 'to', 'be', 'building', 'as', 'much', 'Canadian', 'position', 'in', 'this', 'industry', 'as', 'we', 'can', 'and', 'I', 'think', 'it', 'would', 'be', 'best', 'if', 'Dome', 'ends', 'up', 'in', 'the', 'hands', 'of', 'Canadian', 'management', '.', \"''\"], ['He', 'said', 'he', 'did', 'not', 'know', 'who', 'other', 'possible', 'bidders', 'were', '.'], ['Blair', 'said', 'that', 'any', 'move', 'to', 'put', 'Dome', \"'s\", 'financial', 'house', 'in', 'order', '``', 'will', 'remove', 'one', 'of', 'the', 'general', 'problems', 'of', 'attitude', 'that', 'have', 'hung', 'over', 'Western', 'Canadian', 'industry', '.', \"''\"], ['He', 'added', ',', 'however', ',', 'that', 'the', 'energy', 'industry', 'still', 'faced', '``', 'a', 'couple', 'of', 'tough', ',', 'tough', 'additional', 'years', '.', \"''\"], ['Asked', 'about', 'Nova', \"'s\", '1987', 'prospects', ',', 'Blair', 'predicted', 'that', 'Nova', \"'s\", 'net', 'profit', 'would', 'rise', 'this', 'year', 'to', 'more', 'than', '150', 'mln', 'dlrs', 'from', 'last', 'year', \"'s\", 'net', 'profit', 'of', '100.2', 'mln', 'dlrs', 'due', 'to', 'improved', 'product', 'prices', 'and', 'continued', 'cost-cutting', '.']] "
     ]
    }
   ],
   "source": [
    "# tokenize each sentences into a list of individual tokens\n",
    "# note that the result of word_tokenize will list words and punctuation. the punctuation will be removed in a later step\n",
    "# (I assume the punctuation is important for POS-tagging – I will look this up)\n",
    "from nltk import word_tokenize\n",
    "\n",
    "raw_text = reuters.raw('test/16577')\n",
    "\n",
    "words_in_sentences = []\n",
    "for sentence in sentences:\n",
    "    words_in_sentences.append(word_tokenize(sentence))\n",
    "    \n",
    "print(words_in_sentences, end=\" \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cleaning / Noise Removal / Normalisation\n",
    "\n",
    "- separate step?\n",
    "- is lowercasing bad for POS-tagging?   \n",
    "is it basically: keep text intact for POS, for everything else normalise / remove what you can?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# POS-tagging"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 132,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[('NOVA', 'NNP'), ('&', 'CC'), ('lt', 'NN'), (';', ':'), ('NVA.A.TO', 'NNP'), ('>', 'NNP'), ('NOT', 'NNP'), ('PLANNING', 'NNP'), ('DOME', 'NNP'), ('&', 'CC'), ('lt', 'NN'), (';', ':'), ('DMP', 'NNP'), ('>', 'NNP'), ('BID', 'NNP'), ('Nova', 'NNP'), (',', ','), ('an', 'DT'), ('Alberta', 'NNP'), ('Corp', 'NNP'), (',', ','), ('chief', 'JJ'), ('executive', 'NN'), ('Robert', 'NNP'), ('Blair', 'NNP'), ('expressed', 'VBD'), ('hope', 'NN'), ('that', 'IN'), ('Dome', 'NNP'), ('Petroleum', 'NNP'), ('Ltd', 'NNP'), ('&', 'CC'), ('lt', 'NN'), (';', ':'), ('DMP', 'NNP'), ('>', 'NN'), ('remains', 'VBZ'), ('under', 'IN'), ('Canadian', 'JJ'), ('ownership', 'NN'), (',', ','), ('but', 'CC'), ('added', 'VBD'), ('that', 'IN'), ('his', 'PRP$'), ('company', 'NN'), ('plans', 'VBZ'), ('no', 'DT'), ('bid', 'NN'), ('of', 'IN'), ('its', 'PRP$'), ('own', 'JJ'), ('for', 'IN'), ('debt-troubled', 'JJ'), ('Dome', 'NNP'), ('.', '.')], [('``', '``'), ('We', 'PRP'), (\"'ve\", 'VBP'), ('no', 'DT'), ('plan', 'NN'), ('to', 'TO'), ('bid', 'VB'), (',', ','), (\"''\", \"''\"), ('Blair', 'NNP'), ('told', 'VBD'), ('reporters', 'NNS'), ('after', 'IN'), ('a', 'DT'), ('speech', 'NN'), ('to', 'TO'), ('a', 'DT'), ('business', 'NN'), ('group', 'NN'), (',', ','), ('although', 'IN'), ('he', 'PRP'), ('stressed', 'VBD'), ('that', 'IN'), ('Nova', 'NNP'), ('and', 'CC'), ('57', 'CD'), ('pct-owned', 'JJ'), ('Husky', 'NNP'), ('Oil', 'NNP'), ('Ltd', 'NNP'), ('&', 'CC'), ('lt', 'NN'), (';', ':'), ('HYO', 'NNP'), ('>', 'NNPS'), ('were', 'VBD'), ('interested', 'JJ'), ('in', 'IN'), (\"Dome's\", 'NNP'), ('extensive', 'JJ'), ('Western', 'NNP'), ('Canadian', 'NNP'), ('energy', 'NN'), ('holdings', 'NNS'), ('.', '.')], [('``', '``'), ('But', 'CC'), ('being', 'VBG'), ('interested', 'JJ'), ('can', 'MD'), ('sometimes', 'RB'), ('be', 'VB'), ('different', 'JJ'), ('from', 'IN'), ('making', 'VBG'), ('a', 'DT'), ('bid', 'NN'), (',', ','), (\"''\", \"''\"), ('Blair', 'NNP'), ('said', 'VBD'), ('.', '.')], [('TransCanada', 'NNP'), ('PipeLines', 'NNP'), ('Ltd', 'NNP'), ('&', 'CC'), ('lt', 'NN'), (';', ':'), ('TRP', 'NNP'), ('>', 'NNP'), ('yesterday', 'NN'), ('bid', 'VBD'), ('4.30', 'CD'), ('billion', 'CD'), ('dlrs', 'NN'), ('for', 'IN'), ('Dome', 'NNP'), (',', ','), ('but', 'CC'), ('Dome', 'NNP'), ('said', 'VBD'), ('it', 'PRP'), ('was', 'VBD'), ('discontinuing', 'VBG'), ('talks', 'NNS'), ('with', 'IN'), ('TransCanada', 'NNP'), ('and', 'CC'), ('was', 'VBD'), ('considering', 'VBG'), ('a', 'DT'), ('proposal', 'NN'), ('from', 'IN'), ('another', 'DT'), ('company', 'NN'), ('and', 'CC'), ('was', 'VBD'), ('also', 'RB'), ('talking', 'VBG'), ('with', 'IN'), ('another', 'DT'), ('possible', 'JJ'), ('buyer', 'NN'), (',', ','), ('both', 'DT'), ('rumored', 'VBN'), ('to', 'TO'), ('be', 'VB'), ('offshore', 'RB'), ('.', '.')], [('Asked', 'VBN'), ('by', 'IN'), ('reporters', 'NNS'), ('if', 'IN'), ('Dome', 'NNP'), ('should', 'MD'), ('remain', 'VB'), ('in', 'IN'), ('Canadian', 'JJ'), ('hands', 'NNS'), (',', ','), ('Blair', 'NNP'), ('replied', 'VBD'), (',', ','), ('``', '``'), ('Yes', 'UH'), ('.', '.')], [('I', 'PRP'), ('think', 'VBP'), ('that', 'IN'), ('we', 'PRP'), ('still', 'RB'), ('need', 'VB'), ('to', 'TO'), ('be', 'VB'), ('building', 'VBG'), ('as', 'IN'), ('much', 'JJ'), ('Canadian', 'JJ'), ('position', 'NN'), ('in', 'IN'), ('this', 'DT'), ('industry', 'NN'), ('as', 'IN'), ('we', 'PRP'), ('can', 'MD'), ('and', 'CC'), ('I', 'PRP'), ('think', 'VBP'), ('it', 'PRP'), ('would', 'MD'), ('be', 'VB'), ('best', 'RB'), ('if', 'IN'), ('Dome', 'NNP'), ('ends', 'VBZ'), ('up', 'RP'), ('in', 'IN'), ('the', 'DT'), ('hands', 'NNS'), ('of', 'IN'), ('Canadian', 'JJ'), ('management', 'NN'), ('.', '.'), (\"''\", \"''\")], [('He', 'PRP'), ('said', 'VBD'), ('he', 'PRP'), ('did', 'VBD'), ('not', 'RB'), ('know', 'VB'), ('who', 'WP'), ('other', 'JJ'), ('possible', 'JJ'), ('bidders', 'NNS'), ('were', 'VBD'), ('.', '.')], [('Blair', 'NNP'), ('said', 'VBD'), ('that', 'IN'), ('any', 'DT'), ('move', 'NN'), ('to', 'TO'), ('put', 'VB'), ('Dome', 'NNP'), (\"'s\", 'POS'), ('financial', 'JJ'), ('house', 'NN'), ('in', 'IN'), ('order', 'NN'), ('``', '``'), ('will', 'MD'), ('remove', 'VB'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('general', 'JJ'), ('problems', 'NNS'), ('of', 'IN'), ('attitude', 'NN'), ('that', 'WDT'), ('have', 'VBP'), ('hung', 'VBN'), ('over', 'IN'), ('Western', 'JJ'), ('Canadian', 'JJ'), ('industry', 'NN'), ('.', '.'), (\"''\", \"''\")], [('He', 'PRP'), ('added', 'VBD'), (',', ','), ('however', 'RB'), (',', ','), ('that', 'IN'), ('the', 'DT'), ('energy', 'NN'), ('industry', 'NN'), ('still', 'RB'), ('faced', 'VBN'), ('``', '``'), ('a', 'DT'), ('couple', 'NN'), ('of', 'IN'), ('tough', 'JJ'), (',', ','), ('tough', 'JJ'), ('additional', 'JJ'), ('years', 'NNS'), ('.', '.'), (\"''\", \"''\")], [('Asked', 'VBN'), ('about', 'IN'), ('Nova', 'NNP'), (\"'s\", 'POS'), ('1987', 'CD'), ('prospects', 'NNS'), (',', ','), ('Blair', 'NNP'), ('predicted', 'VBD'), ('that', 'IN'), ('Nova', 'NNP'), (\"'s\", 'POS'), ('net', 'JJ'), ('profit', 'NN'), ('would', 'MD'), ('rise', 'VB'), ('this', 'DT'), ('year', 'NN'), ('to', 'TO'), ('more', 'JJR'), ('than', 'IN'), ('150', 'CD'), ('mln', 'JJ'), ('dlrs', 'NN'), ('from', 'IN'), ('last', 'JJ'), ('year', 'NN'), (\"'s\", 'POS'), ('net', 'JJ'), ('profit', 'NN'), ('of', 'IN'), ('100.2', 'CD'), ('mln', 'NN'), ('dlrs', 'NN'), ('due', 'JJ'), ('to', 'TO'), ('improved', 'JJ'), ('product', 'NN'), ('prices', 'NNS'), ('and', 'CC'), ('continued', 'JJ'), ('cost-cutting', 'NN'), ('.', '.')]]\n"
     ]
    }
   ],
   "source": [
    "# pos (part-of-speech) tag every sentence in sents_into_words\n",
    "\n",
    "pos_tagged_text = [nltk.pos_tag(sentence) for sentence in words_in_sentences]\n",
    "\n",
    "print(pos_tagged_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 133,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[]\n",
      "[[('NOVA', 'NNP'), ('&', 'CC'), ('lt', 'NN'), (';', ':'), ('NVA.A.TO', 'NNP'), ('>', 'NNP'), ('NOT', 'NNP'), ('PLANNING', 'NNP'), ('DOME', 'NNP'), ('&', 'CC'), ('lt', 'NN'), (';', ':'), ('DMP', 'NNP'), ('>', 'NNP'), ('BID', 'NNP'), ('Nova', 'NNP'), (',', ','), ('an', 'DT'), ('Alberta', 'NNP'), ('Corp', 'NNP'), (',', ','), ('chief', 'JJ'), ('executive', 'NN'), ('Robert', 'NNP'), ('Blair', 'NNP'), ('expressed', 'VBD'), ('hope', 'NN'), ('that', 'IN'), ('Dome', 'NNP'), ('Petroleum', 'NNP'), ('Ltd', 'NNP'), ('&', 'CC'), ('lt', 'NN'), (';', ':'), ('DMP', 'NNP'), ('>', 'NN'), ('remains', 'VBZ'), ('under', 'IN'), ('Canadian', 'JJ'), ('ownership', 'NN'), (',', ','), ('but', 'CC'), ('added', 'VBD'), ('that', 'IN'), ('his', 'PRP$'), ('company', 'NN'), ('plans', 'VBZ'), ('no', 'DT'), ('bid', 'NN'), ('of', 'IN'), ('its', 'PRP$'), ('own', 'JJ'), ('for', 'IN'), ('debt-troubled', 'JJ'), ('Dome', 'NNP'), ('.', '.')], [('``', '``'), ('We', 'PRP'), (\"'ve\", 'VBP'), ('no', 'DT'), ('plan', 'NN'), ('to', 'TO'), ('bid', 'VB'), (',', ','), (\"''\", \"''\"), ('Blair', 'NNP'), ('told', 'VBD'), ('reporters', 'NNS'), ('after', 'IN'), ('a', 'DT'), ('speech', 'NN'), ('to', 'TO'), ('a', 'DT'), ('business', 'NN'), ('group', 'NN'), (',', ','), ('although', 'IN'), ('he', 'PRP'), ('stressed', 'VBD'), ('that', 'IN'), ('Nova', 'NNP'), ('and', 'CC'), ('57', 'CD'), ('pct-owned', 'JJ'), ('Husky', 'NNP'), ('Oil', 'NNP'), ('Ltd', 'NNP'), ('&', 'CC'), ('lt', 'NN'), (';', ':'), ('HYO', 'NNP'), ('>', 'NNPS'), ('were', 'VBD'), ('interested', 'JJ'), ('in', 'IN'), (\"Dome's\", 'NNP'), ('extensive', 'JJ'), ('Western', 'NNP'), ('Canadian', 'NNP'), ('energy', 'NN'), ('holdings', 'NNS'), ('.', '.')], [('``', '``'), ('But', 'CC'), ('being', 'VBG'), ('interested', 'JJ'), ('can', 'MD'), ('sometimes', 'RB'), ('be', 'VB'), ('different', 'JJ'), ('from', 'IN'), ('making', 'VBG'), ('a', 'DT'), ('bid', 'NN'), (',', ','), (\"''\", \"''\"), ('Blair', 'NNP'), ('said', 'VBD'), ('.', '.')], [('TransCanada', 'NNP'), ('PipeLines', 'NNP'), ('Ltd', 'NNP'), ('&', 'CC'), ('lt', 'NN'), (';', ':'), ('TRP', 'NNP'), ('>', 'NNP'), ('yesterday', 'NN'), ('bid', 'VBD'), ('4.30', 'CD'), ('billion', 'CD'), ('dlrs', 'NN'), ('for', 'IN'), ('Dome', 'NNP'), (',', ','), ('but', 'CC'), ('Dome', 'NNP'), ('said', 'VBD'), ('it', 'PRP'), ('was', 'VBD'), ('discontinuing', 'VBG'), ('talks', 'NNS'), ('with', 'IN'), ('TransCanada', 'NNP'), ('and', 'CC'), ('was', 'VBD'), ('considering', 'VBG'), ('a', 'DT'), ('proposal', 'NN'), ('from', 'IN'), ('another', 'DT'), ('company', 'NN'), ('and', 'CC'), ('was', 'VBD'), ('also', 'RB'), ('talking', 'VBG'), ('with', 'IN'), ('another', 'DT'), ('possible', 'JJ'), ('buyer', 'NN'), (',', ','), ('both', 'DT'), ('rumored', 'VBN'), ('to', 'TO'), ('be', 'VB'), ('offshore', 'RB'), ('.', '.')], [('Asked', 'VBN'), ('by', 'IN'), ('reporters', 'NNS'), ('if', 'IN'), ('Dome', 'NNP'), ('should', 'MD'), ('remain', 'VB'), ('in', 'IN'), ('Canadian', 'JJ'), ('hands', 'NNS'), (',', ','), ('Blair', 'NNP'), ('replied', 'VBD'), (',', ','), ('``', '``'), ('Yes', 'UH'), ('.', '.')], [('I', 'PRP'), ('think', 'VBP'), ('that', 'IN'), ('we', 'PRP'), ('still', 'RB'), ('need', 'VB'), ('to', 'TO'), ('be', 'VB'), ('building', 'VBG'), ('as', 'IN'), ('much', 'JJ'), ('Canadian', 'JJ'), ('position', 'NN'), ('in', 'IN'), ('this', 'DT'), ('industry', 'NN'), ('as', 'IN'), ('we', 'PRP'), ('can', 'MD'), ('and', 'CC'), ('I', 'PRP'), ('think', 'VBP'), ('it', 'PRP'), ('would', 'MD'), ('be', 'VB'), ('best', 'RB'), ('if', 'IN'), ('Dome', 'NNP'), ('ends', 'VBZ'), ('up', 'RP'), ('in', 'IN'), ('the', 'DT'), ('hands', 'NNS'), ('of', 'IN'), ('Canadian', 'JJ'), ('management', 'NN'), ('.', '.'), (\"''\", \"''\")], [('He', 'PRP'), ('said', 'VBD'), ('he', 'PRP'), ('did', 'VBD'), ('not', 'RB'), ('know', 'VB'), ('who', 'WP'), ('other', 'JJ'), ('possible', 'JJ'), ('bidders', 'NNS'), ('were', 'VBD'), ('.', '.')], [('Blair', 'NNP'), ('said', 'VBD'), ('that', 'IN'), ('any', 'DT'), ('move', 'NN'), ('to', 'TO'), ('put', 'VB'), ('Dome', 'NNP'), (\"'s\", 'POS'), ('financial', 'JJ'), ('house', 'NN'), ('in', 'IN'), ('order', 'NN'), ('``', '``'), ('will', 'MD'), ('remove', 'VB'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('general', 'JJ'), ('problems', 'NNS'), ('of', 'IN'), ('attitude', 'NN'), ('that', 'WDT'), ('have', 'VBP'), ('hung', 'VBN'), ('over', 'IN'), ('Western', 'JJ'), ('Canadian', 'JJ'), ('industry', 'NN'), ('.', '.'), (\"''\", \"''\")], [('He', 'PRP'), ('added', 'VBD'), (',', ','), ('however', 'RB'), (',', ','), ('that', 'IN'), ('the', 'DT'), ('energy', 'NN'), ('industry', 'NN'), ('still', 'RB'), ('faced', 'VBN'), ('``', '``'), ('a', 'DT'), ('couple', 'NN'), ('of', 'IN'), ('tough', 'JJ'), (',', ','), ('tough', 'JJ'), ('additional', 'JJ'), ('years', 'NNS'), ('.', '.'), (\"''\", \"''\")], [('Asked', 'VBN'), ('about', 'IN'), ('Nova', 'NNP'), (\"'s\", 'POS'), ('1987', 'CD'), ('prospects', 'NNS'), (',', ','), ('Blair', 'NNP'), ('predicted', 'VBD'), ('that', 'IN'), ('Nova', 'NNP'), (\"'s\", 'POS'), ('net', 'JJ'), ('profit', 'NN'), ('would', 'MD'), ('rise', 'VB'), ('this', 'DT'), ('year', 'NN'), ('to', 'TO'), ('more', 'JJR'), ('than', 'IN'), ('150', 'CD'), ('mln', 'JJ'), ('dlrs', 'NN'), ('from', 'IN'), ('last', 'JJ'), ('year', 'NN'), (\"'s\", 'POS'), ('net', 'JJ'), ('profit', 'NN'), ('of', 'IN'), ('100.2', 'CD'), ('mln', 'NN'), ('dlrs', 'NN'), ('due', 'JJ'), ('to', 'TO'), ('improved', 'JJ'), ('product', 'NN'), ('prices', 'NNS'), ('and', 'CC'), ('continued', 'JJ'), ('cost-cutting', 'NN'), ('.', '.')]]\n"
     ]
    }
   ],
   "source": [
    "# some of these words were not really analysed, they return a pair ('``', '``') \n",
    "# remove (word, tag) pairs that are unanalysed, like ('``', '``')\n",
    "pairs_to_remove = [pair for pair in pos_tagged_text if pair[0] == pair [1]]\n",
    "\n",
    "print(pairs_to_remove)\n",
    "\n",
    "# this step might be unneccesary with the function I wrote. there should no longer be an error for these kinds of pairs. \n",
    "\n",
    "pos_tagged_text_cleaned = [pair for pair in pos_tagged_text if pair not in pairs_to_remove]\n",
    "\n",
    "print(pos_tagged_text_cleaned)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make sense of this output, we can look up these POS tags in a list, e.g.:\n",
    "\n",
    "https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/  \n",
    "\n",
    "Most importantly for our purposes, tags beginning with N denote nouns, tags beginning with V denote verbs, tags beginning with R denote adverbs, and tags beginning with J denote adjectives.  \n",
    "\n",
    "NN noun, singular 'desk' <br>\n",
    "NNS noun plural 'desks'<br>\n",
    "NNP proper noun, singular 'Harrison'<br>\n",
    "NNPS proper noun, plural 'Americans'<br>\n",
    "<br>\n",
    "VB verb, base form take<br>\n",
    "VBD verb, past tense took<br>\n",
    "VBG verb, gerund/present participle taking<br>\n",
    "VBN verb, past participle taken<br>\n",
    "VBP verb, sing. present, non-3d take<br>\n",
    "VBZ verb, 3rd person sing. present takes<br>\n",
    "<br>\n",
    "RB adverb very, silently,<br>\n",
    "RBR adverb, comparative better<br>\n",
    "RBS adverb, superlative best<br>\n",
    "RP particle give up<br>\n",
    "<br>\n",
    "JJ adjective 'big'<br>\n",
    "JJR adjective, comparative 'bigger'<br>\n",
    "JJS adjective, superlative 'biggest'<br>\n",
    "\n",
    "This is important because nltks wordnetlemmatizer needs to know the word class of a given word. E.g. \"book\" used as a verb has a different meaning than \"book\" as a noun. However, wordnetlemmatizer uses a different set of tags ('v' for verb, 'a' for adjective etc.) than the pos_tag() function returns. So we need to convert them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lemmatizing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('NOVA', 'n'), ('&', None), ('lt', 'n'), (';', None), ('NVA.A.TO', 'n'), ('>', 'n'), ('NOT', 'n'), ('PLANNING', 'n'), ('DOME', 'n'), ('&', None), ('lt', 'n'), (';', None), ('DMP', 'n'), ('>', 'n'), ('BID', 'n'), ('Nova', 'n'), (',', None), ('an', None), ('Alberta', 'n'), ('Corp', 'n'), (',', None), ('chief', 'a'), ('executive', 'n'), ('Robert', 'n'), ('Blair', 'n'), ('expressed', 'v'), ('hope', 'n'), ('that', None), ('Dome', 'n'), ('Petroleum', 'n'), ('Ltd', 'n'), ('&', None), ('lt', 'n'), (';', None), ('DMP', 'n'), ('>', 'n'), ('remains', 'v'), ('under', None), ('Canadian', 'a'), ('ownership', 'n'), (',', None), ('but', None), ('added', 'v'), ('that', None), ('his', None), ('company', 'n'), ('plans', 'v'), ('no', None), ('bid', 'n'), ('of', None), ('its', None), ('own', 'a'), ('for', None), ('debt-troubled', 'a'), ('Dome', 'n'), ('.', None), ('``', None), ('We', None), (\"'ve\", 'v'), ('no', None), ('plan', 'n'), ('to', None), ('bid', 'v'), (',', None), (\"''\", None), ('Blair', 'n'), ('told', 'v'), ('reporters', 'n'), ('after', None), ('a', None), ('speech', 'n'), ('to', None), ('a', None), ('business', 'n'), ('group', 'n'), (',', None), ('although', None), ('he', None), ('stressed', 'v'), ('that', None), ('Nova', 'n'), ('and', None), ('57', None), ('pct-owned', 'a'), ('Husky', 'n'), ('Oil', 'n'), ('Ltd', 'n'), ('&', None), ('lt', 'n'), (';', None), ('HYO', 'n'), ('>', 'n'), ('were', 'v'), ('interested', 'a'), ('in', None), (\"Dome's\", 'n'), ('extensive', 'a'), ('Western', 'n'), ('Canadian', 'n'), ('energy', 'n'), ('holdings', 'n'), ('.', None), ('``', None), ('But', None), ('being', 'v'), ('interested', 'a'), ('can', None), ('sometimes', 'r'), ('be', 'v'), ('different', 'a'), ('from', None), ('making', 'v'), ('a', None), ('bid', 'n'), (',', None), (\"''\", None), ('Blair', 'n'), ('said', 'v'), ('.', None), ('TransCanada', 'n'), ('PipeLines', 'n'), ('Ltd', 'n'), ('&', None), ('lt', 'n'), (';', None), ('TRP', 'n'), ('>', 'n'), ('yesterday', 'n'), ('bid', 'v'), ('4.30', None), ('billion', None), ('dlrs', 'n'), ('for', None), ('Dome', 'n'), (',', None), ('but', None), ('Dome', 'n'), ('said', 'v'), ('it', None), ('was', 'v'), ('discontinuing', 'v'), ('talks', 'n'), ('with', None), ('TransCanada', 'n'), ('and', None), ('was', 'v'), ('considering', 'v'), ('a', None), ('proposal', 'n'), ('from', None), ('another', None), ('company', 'n'), ('and', None), ('was', 'v'), ('also', 'r'), ('talking', 'v'), ('with', None), ('another', None), ('possible', 'a'), ('buyer', 'n'), (',', None), ('both', None), ('rumored', 'v'), ('to', None), ('be', 'v'), ('offshore', 'r'), ('.', None), ('Asked', 'v'), ('by', None), ('reporters', 'n'), ('if', None), ('Dome', 'n'), ('should', None), ('remain', 'v'), ('in', None), ('Canadian', 'a'), ('hands', 'n'), (',', None), ('Blair', 'n'), ('replied', 'v'), (',', None), ('``', None), ('Yes', None), ('.', None), ('I', None), ('think', 'v'), ('that', None), ('we', None), ('still', 'r'), ('need', 'v'), ('to', None), ('be', 'v'), ('building', 'v'), ('as', None), ('much', 'a'), ('Canadian', 'a'), ('position', 'n'), ('in', None), ('this', None), ('industry', 'n'), ('as', None), ('we', None), ('can', None), ('and', None), ('I', None), ('think', 'v'), ('it', None), ('would', None), ('be', 'v'), ('best', 'r'), ('if', None), ('Dome', 'n'), ('ends', 'v'), ('up', 'r'), ('in', None), ('the', None), ('hands', 'n'), ('of', None), ('Canadian', 'a'), ('management', 'n'), ('.', None), (\"''\", None), ('He', None), ('said', 'v'), ('he', None), ('did', 'v'), ('not', 'r'), ('know', 'v'), ('who', None), ('other', 'a'), ('possible', 'a'), ('bidders', 'n'), ('were', 'v'), ('.', None), ('Blair', 'n'), ('said', 'v'), ('that', None), ('any', None), ('move', 'n'), ('to', None), ('put', 'v'), ('Dome', 'n'), (\"'s\", None), ('financial', 'a'), ('house', 'n'), ('in', None), ('order', 'n'), ('``', None), ('will', None), ('remove', 'v'), ('one', None), ('of', None), ('the', None), ('general', 'a'), ('problems', 'n'), ('of', None), ('attitude', 'n'), ('that', None), ('have', 'v'), ('hung', 'v'), ('over', None), ('Western', 'a'), ('Canadian', 'a'), ('industry', 'n'), ('.', None), (\"''\", None), ('He', None), ('added', 'v'), (',', None), ('however', 'r'), (',', None), ('that', None), ('the', None), ('energy', 'n'), ('industry', 'n'), ('still', 'r'), ('faced', 'v'), ('``', None), ('a', None), ('couple', 'n'), ('of', None), ('tough', 'a'), (',', None), ('tough', 'a'), ('additional', 'a'), ('years', 'n'), ('.', None), (\"''\", None), ('Asked', 'v'), ('about', None), ('Nova', 'n'), (\"'s\", None), ('1987', None), ('prospects', 'n'), (',', None), ('Blair', 'n'), ('predicted', 'v'), ('that', None), ('Nova', 'n'), (\"'s\", None), ('net', 'a'), ('profit', 'n'), ('would', None), ('rise', 'v'), ('this', None), ('year', 'n'), ('to', None), ('more', 'a'), ('than', None), ('150', None), ('mln', 'a'), ('dlrs', 'n'), ('from', None), ('last', 'a'), ('year', 'n'), (\"'s\", None), ('net', 'a'), ('profit', 'n'), ('of', None), ('100.2', None), ('mln', 'n'), ('dlrs', 'n'), ('due', 'a'), ('to', None), ('improved', 'a'), ('product', 'n'), ('prices', 'n'), ('and', None), ('continued', 'a'), ('cost-cutting', 'n'), ('.', None)]\n"
     ]
    }
   ],
   "source": [
    "from nltk.corpus import wordnet\n",
    "# WordNet POS tags are: NOUN = 'n', ADJ = 's', VERB = 'v', ADV = 'r', ADJ_SAT = 'a'\n",
    "# wordnetlemmatizer takes pos values like \"v\" for verb or \"a\" for adjective. But the POS-tagger returns things like \"NN\" or \"PRP\"\n",
    "\n",
    "\n",
    "def convert_pos_tags(pos_tagger_tag):\n",
    "    \n",
    "    tag_dict = {\"J\": wordnet.ADJ,\n",
    "            \"N\": wordnet.NOUN,\n",
    "            \"V\": wordnet.VERB,\n",
    "            \"R\": wordnet.ADV}\n",
    "    \n",
    "    if pos_tagger_tag[0] in tag_dict:\n",
    "        return tag_dict[pos_tagger_tag[0]]\n",
    "    else:\n",
    "        return None \n",
    "    \n",
    "wordnet_pos_tags = []\n",
    "for sentence in pos_tagged_text_cleaned:\n",
    "    for word in sentence: \n",
    "        wordnet_pos_tags.append((word[0], convert_pos_tags(word[1])))\n",
    "    \n",
    "\n",
    "print(wordnet_pos_tags)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We feed each of the above pairs into wnl.lemmatize(), specifiying it's first element as the word-argument and the second element as the optional pos argument:\n",
    "\n",
    "wnl.lemmatize(word = pair[0],pos = pair[1]) if pair[1]\n",
    "\n",
    "This will return the lemma for each pair"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 135,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " the lemmatized text contains 331 words\n",
      " the lemmatized texts contains 169 unique words\n"
     ]
    }
   ],
   "source": [
    "from nltk.stem import WordNetLemmatizer\n",
    "\n",
    "wnl = nltk.WordNetLemmatizer()\n",
    "\n",
    "lemmatized_text = [wnl.lemmatize(word = pair[0],pos = pair[1]) if pair[1] else wnl.lemmatize(word = pair[0]) for pair in wordnet_pos_tags]    \n",
    "\n",
    "print(f' the lemmatized text contains {len(lemmatized_text)} words') #297 # this is not 331.. (4 extra words..)\n",
    "print(f' the lemmatized texts contains {len(set(lemmatized_text))} unique words') #166 (unique lemmas) #this is now 169, why?  (3 extra words..)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Take a look at the lemmatized text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 136,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['NOVA', '&', 'lt', ';', 'NVA.A.TO', '>', 'NOT', 'PLANNING', 'DOME', '&', 'lt', ';', 'DMP', '>', 'BID', 'Nova', ',', 'an', 'Alberta', 'Corp', ',', 'chief', 'executive', 'Robert', 'Blair', 'express', 'hope', 'that', 'Dome', 'Petroleum', 'Ltd', '&', 'lt', ';', 'DMP', '>', 'remain', 'under', 'Canadian', 'ownership', ',', 'but', 'add', 'that', 'his', 'company', 'plan', 'no', 'bid', 'of', 'it', 'own', 'for', 'debt-troubled', 'Dome', '.', '``', 'We', \"'ve\", 'no', 'plan', 'to', 'bid', ',', \"''\", 'Blair', 'tell', 'reporter', 'after', 'a', 'speech', 'to', 'a', 'business', 'group', ',', 'although', 'he', 'stress', 'that', 'Nova', 'and', '57', 'pct-owned', 'Husky', 'Oil', 'Ltd', '&', 'lt', ';', 'HYO', '>', 'be', 'interested', 'in', \"Dome's\", 'extensive', 'Western', 'Canadian', 'energy', 'holding', '.', '``', 'But', 'be', 'interested', 'can', 'sometimes', 'be', 'different', 'from', 'make', 'a', 'bid', ',', \"''\", 'Blair', 'say', '.', 'TransCanada', 'PipeLines', 'Ltd', '&', 'lt', ';', 'TRP', '>', 'yesterday', 'bid', '4.30', 'billion', 'dlrs', 'for', 'Dome', ',', 'but', 'Dome', 'say', 'it', 'be', 'discontinue', 'talk', 'with', 'TransCanada', 'and', 'be', 'consider', 'a', 'proposal', 'from', 'another', 'company', 'and', 'be', 'also', 'talk', 'with', 'another', 'possible', 'buyer', ',', 'both', 'rumor', 'to', 'be', 'offshore', '.', 'Asked', 'by', 'reporter', 'if', 'Dome', 'should', 'remain', 'in', 'Canadian', 'hand', ',', 'Blair', 'reply', ',', '``', 'Yes', '.', 'I', 'think', 'that', 'we', 'still', 'need', 'to', 'be', 'build', 'a', 'much', 'Canadian', 'position', 'in', 'this', 'industry', 'a', 'we', 'can', 'and', 'I', 'think', 'it', 'would', 'be', 'best', 'if', 'Dome', 'end', 'up', 'in', 'the', 'hand', 'of', 'Canadian', 'management', '.', \"''\", 'He', 'say', 'he', 'do', 'not', 'know', 'who', 'other', 'possible', 'bidder', 'be', '.', 'Blair', 'say', 'that', 'any', 'move', 'to', 'put', 'Dome', \"'s\", 'financial', 'house', 'in', 'order', '``', 'will', 'remove', 'one', 'of', 'the', 'general', 'problem', 'of', 'attitude', 'that', 'have', 'hang', 'over', 'Western', 'Canadian', 'industry', '.', \"''\", 'He', 'add', ',', 'however', ',', 'that', 'the', 'energy', 'industry', 'still', 'face', '``', 'a', 'couple', 'of', 'tough', ',', 'tough', 'additional', 'year', '.', \"''\", 'Asked', 'about', 'Nova', \"'s\", '1987', 'prospect', ',', 'Blair', 'predict', 'that', 'Nova', \"'s\", 'net', 'profit', 'would', 'rise', 'this', 'year', 'to', 'more', 'than', '150', 'mln', 'dlrs', 'from', 'last', 'year', \"'s\", 'net', 'profit', 'of', '100.2', 'mln', 'dlrs', 'due', 'to', 'improved', 'product', 'price', 'and', 'continued', 'cost-cutting', '.'] "
     ]
    }
   ],
   "source": [
    "print(lemmatized_text, end = \" \")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tokenizing 2 \n",
    "no hierarchy, sentences --> words. Just the whole text as a bag of words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 137,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['NOVA', '&', 'lt', ';', 'NVA.A.TO', '>', 'NOT', 'PLANNING', 'DOME', '&', 'lt', ';', 'DMP', '>', 'BID', 'Nova', ',', 'an', 'Alberta', 'Corp', ',', 'chief', 'executive', 'Robert', 'Blair', 'expressed', 'hope', 'that', 'Dome', 'Petroleum', 'Ltd', '&', 'lt', ';', 'DMP', '>', 'remains', 'under', 'Canadian', 'ownership', ',', 'but', 'added', 'that', 'his', 'company', 'plans', 'no', 'bid', 'of', 'its', 'own', 'for', 'debt-troubled', 'Dome', '.', '``', 'We', \"'ve\", 'no', 'plan', 'to', 'bid', ',', \"''\", 'Blair', 'told', 'reporters', 'after', 'a', 'speech', 'to', 'a', 'business', 'group', ',', 'although', 'he', 'stressed', 'that', 'Nova', 'and', '57', 'pct-owned', 'Husky', 'Oil', 'Ltd', '&', 'lt', ';', 'HYO', '>', 'were', 'interested', 'in', \"Dome's\", 'extensive', 'Western', 'Canadian', 'energy', 'holdings', '.', '``', 'But', 'being', 'interested', 'can', 'sometimes', 'be', 'different', 'from', 'making', 'a', 'bid', ',', \"''\", 'Blair', 'said', '.', 'TransCanada', 'PipeLines', 'Ltd', '&', 'lt', ';', 'TRP', '>', 'yesterday', 'bid', '4.30', 'billion', 'dlrs', 'for', 'Dome', ',', 'but', 'Dome', 'said', 'it', 'was', 'discontinuing', 'talks', 'with', 'TransCanada', 'and', 'was', 'considering', 'a', 'proposal', 'from', 'another', 'company', 'and', 'was', 'also', 'talking', 'with', 'another', 'possible', 'buyer', ',', 'both', 'rumored', 'to', 'be', 'offshore', '.', 'Asked', 'by', 'reporters', 'if', 'Dome', 'should', 'remain', 'in', 'Canadian', 'hands', ',', 'Blair', 'replied', ',', '``', 'Yes', '.', 'I', 'think', 'that', 'we', 'still', 'need', 'to', 'be', 'building', 'as', 'much', 'Canadian', 'position', 'in', 'this', 'industry', 'as', 'we', 'can', 'and', 'I', 'think', 'it', 'would', 'be', 'best', 'if', 'Dome', 'ends', 'up', 'in', 'the', 'hands', 'of', 'Canadian', 'management', '.', \"''\", 'He', 'said', 'he', 'did', 'not', 'know', 'who', 'other', 'possible', 'bidders', 'were', '.', 'Blair', 'said', 'that', 'any', 'move', 'to', 'put', 'Dome', \"'s\", 'financial', 'house', 'in', 'order', '``', 'will', 'remove', 'one', 'of', 'the', 'general', 'problems', 'of', 'attitude', 'that', 'have', 'hung', 'over', 'Western', 'Canadian', 'industry', '.', \"''\", 'He', 'added', ',', 'however', ',', 'that', 'the', 'energy', 'industry', 'still', 'faced', '``', 'a', 'couple', 'of', 'tough', ',', 'tough', 'additional', 'years', '.', \"''\", 'Asked', 'about', 'Nova', \"'s\", '1987', 'prospects', ',', 'Blair', 'predicted', 'that', 'Nova', \"'s\", 'net', 'profit', 'would', 'rise', 'this', 'year', 'to', 'more', 'than', '150', 'mln', 'dlrs', 'from', 'last', 'year', \"'s\", 'net', 'profit', 'of', '100.2', 'mln', 'dlrs', 'due', 'to', 'improved', 'product', 'prices', 'and', 'continued', 'cost-cutting', '.']\n"
     ]
    }
   ],
   "source": [
    "words = word_tokenize(raw_text)\n",
    "print(words)\n",
    "\n",
    "#  Nova's becomes ['Nova'][''s'] which is not ideal... \n",
    "# on the model of He's = two words He + is \n",
    "# does not distinguish genetive s and is as a clitic. I suppose this can be remedied with the POS tagger – is it wortht he hassle though? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 138,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " the mean sentence length of this text is 33.1\n",
      " the median sentence length of this text is 35.0\n"
     ]
    }
   ],
   "source": [
    "import statistics\n",
    "\n",
    "sentence_length = []\n",
    "for sentence in words_in_sentences:\n",
    "    sentence_length.append(len(sentence))\n",
    "\n",
    "\n",
    "# I suppose the mean might be problematic, since it is not robust to outliers. If due to a tokenization error a sentence ends up being very long.. (is that likely to happen?) \n",
    "mean_sentence_length = statistics.mean(sentence_length)\n",
    "median_sentence_length = statistics.median(sentence_length)\n",
    "\n",
    "print(f' the mean sentence length of this text is {mean_sentence_length}')\n",
    "print(f' the median sentence length of this text is {median_sentence_length}')\n",
    "\n",
    "# This includes punctuation marks in the word count. If we never remove punctuation, this is not a problem for comparison? But I suppose I should remove punctuation so it actually reflects sentence legnth. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 139,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " the mean sentence length of this text is 29.2\n",
      " the median sentence length of this text is 34.0\n"
     ]
    }
   ],
   "source": [
    "sentence_length_no_punct = []\n",
    "\n",
    "for sentence in words_in_sentences:\n",
    "    sentence_without_punctuation = [word for word in sentence if not word in string.punctuation]\n",
    "    sentence_length_no_punct.append(len(sentence_without_punctuation))\n",
    "\n",
    "mean_sentence_length_no_punct = statistics.mean(sentence_length_no_punct)\n",
    "median_sentence_length_no_punct = statistics.median(sentence_length_no_punct)\n",
    "\n",
    "print(f' the mean sentence length of this text is {mean_sentence_length_no_punct}')\n",
    "print(f' the median sentence length of this text is {median_sentence_length_no_punct}')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 140,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['NOVA', 'lt', 'NVA.A.TO', 'NOT', 'PLANNING', 'DOME', 'lt', 'DMP', 'BID', 'Nova', 'an', 'Alberta', 'Corp', 'chief', 'executive', 'Robert', 'Blair', 'expressed', 'hope', 'that', 'Dome', 'Petroleum', 'Ltd', 'lt', 'DMP', 'remains', 'under', 'Canadian', 'ownership', 'but', 'added', 'that', 'his', 'company', 'plans', 'no', 'bid', 'of', 'its', 'own', 'for', 'debt-troubled', 'Dome']\n",
      "['``', 'We', \"'ve\", 'no', 'plan', 'to', 'bid', \"''\", 'Blair', 'told', 'reporters', 'after', 'a', 'speech', 'to', 'a', 'business', 'group', 'although', 'he', 'stressed', 'that', 'Nova', 'and', '57', 'pct-owned', 'Husky', 'Oil', 'Ltd', 'lt', 'HYO', 'were', 'interested', 'in', \"Dome's\", 'extensive', 'Western', 'Canadian', 'energy', 'holdings']\n",
      "['``', 'But', 'being', 'interested', 'can', 'sometimes', 'be', 'different', 'from', 'making', 'a', 'bid', \"''\", 'Blair', 'said']\n",
      "['TransCanada', 'PipeLines', 'Ltd', 'lt', 'TRP', 'yesterday', 'bid', '4.30', 'billion', 'dlrs', 'for', 'Dome', 'but', 'Dome', 'said', 'it', 'was', 'discontinuing', 'talks', 'with', 'TransCanada', 'and', 'was', 'considering', 'a', 'proposal', 'from', 'another', 'company', 'and', 'was', 'also', 'talking', 'with', 'another', 'possible', 'buyer', 'both', 'rumored', 'to', 'be', 'offshore']\n",
      "['Asked', 'by', 'reporters', 'if', 'Dome', 'should', 'remain', 'in', 'Canadian', 'hands', 'Blair', 'replied', '``', 'Yes']\n",
      "['I', 'think', 'that', 'we', 'still', 'need', 'to', 'be', 'building', 'as', 'much', 'Canadian', 'position', 'in', 'this', 'industry', 'as', 'we', 'can', 'and', 'I', 'think', 'it', 'would', 'be', 'best', 'if', 'Dome', 'ends', 'up', 'in', 'the', 'hands', 'of', 'Canadian', 'management', \"''\"]\n",
      "['He', 'said', 'he', 'did', 'not', 'know', 'who', 'other', 'possible', 'bidders', 'were']\n",
      "['Blair', 'said', 'that', 'any', 'move', 'to', 'put', 'Dome', \"'s\", 'financial', 'house', 'in', 'order', '``', 'will', 'remove', 'one', 'of', 'the', 'general', 'problems', 'of', 'attitude', 'that', 'have', 'hung', 'over', 'Western', 'Canadian', 'industry', \"''\"]\n",
      "['He', 'added', 'however', 'that', 'the', 'energy', 'industry', 'still', 'faced', '``', 'a', 'couple', 'of', 'tough', 'tough', 'additional', 'years', \"''\"]\n",
      "['Asked', 'about', 'Nova', \"'s\", '1987', 'prospects', 'Blair', 'predicted', 'that', 'Nova', \"'s\", 'net', 'profit', 'would', 'rise', 'this', 'year', 'to', 'more', 'than', '150', 'mln', 'dlrs', 'from', 'last', 'year', \"'s\", 'net', 'profit', 'of', '100.2', 'mln', 'dlrs', 'due', 'to', 'improved', 'product', 'prices', 'and', 'continued', 'cost-cutting']\n"
     ]
    }
   ],
   "source": [
    "# this includes sthese ['``'] – where do they come from? These should definitely be taken care of in a data cleaning step, which precedes all these steps\n",
    "# I get rid of them later for POS-tagging, but really this should be taken care of before\n",
    "\n",
    "for sentence in words_in_sentences:\n",
    "    sentence_without_punctuation = [word for word in sentence if not word in string.punctuation]\n",
    "    print(sentence_without_punctuation)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 141,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NOVA & lt ; NVA.A.TO > NOT PLANNING DOME & lt ; DMP > BID Nova , an Alberta Corp , chief executive Robert Blair expressed hope that Dome Petroleum Ltd & lt ; DMP > remains under Canadian ownership , but added that his company plans no bid of its own for debt-troubled Dome . `` We 've no plan to bid , '' Blair told reporters after a speech to a business group , although he stressed that Nova and 57 pct-owned Husky Oil Ltd & lt ; HYO > were interested in Dome's extensive Western Canadian energy holdings . `` But being interested can sometimes be different from making a bid , '' Blair said . TransCanada PipeLines Ltd & lt ; TRP > yesterday bid 4.30 billion dlrs for Dome , but Dome said it was discontinuing talks with TransCanada and was considering a proposal from another company and was also talking with another possible buyer , both rumored to be offshore . Asked by reporters if Dome should remain in Canadian hands , Blair replied , `` Yes . I think that we still need to be building as much Canadian position in this industry as we can and I think it would be best if Dome ends up in the hands of Canadian management . '' He said he did not know who other possible bidders were . Blair said that any move to put Dome 's financial house in order `` will remove one of the general problems of attitude that have hung over Western Canadian industry . '' He added , however , that the energy industry still faced `` a couple of tough , tough additional years . '' Asked about Nova 's 1987 prospects , Blair predicted that Nova 's net profit would rise this year to more than 150 mln dlrs from last year 's net profit of 100.2 mln dlrs due to improved product prices and continued cost-cutting .\n"
     ]
    }
   ],
   "source": [
    "text_as_string = ' '.join(words)\n",
    "# is it a problem that punctuation is like this? \n",
    "print(text_as_string)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 142,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~\n"
     ]
    }
   ],
   "source": [
    "import string\n",
    "\n",
    "print(string.punctuation)\n",
    "\n",
    "word_list = [token for token in words if not token in string.punctuation]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cleaning – Punctuation removal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 143,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~\n"
     ]
    }
   ],
   "source": [
    "# the list'words' lists punctuation as well. To get the true word count, we remove the punctuation \n",
    "\n",
    "import string\n",
    "\n",
    "print(string.punctuation)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 144,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['NOVA', 'lt', 'NVA.A.TO', 'NOT', 'PLANNING', 'DOME', 'lt', 'DMP', 'BID', 'Nova', 'an', 'Alberta', 'Corp', 'chief', 'executive', 'Robert', 'Blair', 'expressed', 'hope', 'that', 'Dome', 'Petroleum', 'Ltd', 'lt', 'DMP', 'remains', 'under', 'Canadian', 'ownership', 'but', 'added', 'that', 'his', 'company', 'plans', 'no', 'bid', 'of', 'its', 'own', 'for', 'debt-troubled', 'Dome', '``', 'We', \"'ve\", 'no', 'plan', 'to', 'bid', \"''\", 'Blair', 'told', 'reporters', 'after', 'a', 'speech', 'to', 'a', 'business', 'group', 'although', 'he', 'stressed', 'that', 'Nova', 'and', '57', 'pct-owned', 'Husky', 'Oil', 'Ltd', 'lt', 'HYO', 'were', 'interested', 'in', \"Dome's\", 'extensive', 'Western', 'Canadian', 'energy', 'holdings', '``', 'But', 'being', 'interested', 'can', 'sometimes', 'be', 'different', 'from', 'making', 'a', 'bid', \"''\", 'Blair', 'said', 'TransCanada', 'PipeLines', 'Ltd', 'lt', 'TRP', 'yesterday', 'bid', '4.30', 'billion', 'dlrs', 'for', 'Dome', 'but', 'Dome', 'said', 'it', 'was', 'discontinuing', 'talks', 'with', 'TransCanada', 'and', 'was', 'considering', 'a', 'proposal', 'from', 'another', 'company', 'and', 'was', 'also', 'talking', 'with', 'another', 'possible', 'buyer', 'both', 'rumored', 'to', 'be', 'offshore', 'Asked', 'by', 'reporters', 'if', 'Dome', 'should', 'remain', 'in', 'Canadian', 'hands', 'Blair', 'replied', '``', 'Yes', 'I', 'think', 'that', 'we', 'still', 'need', 'to', 'be', 'building', 'as', 'much', 'Canadian', 'position', 'in', 'this', 'industry', 'as', 'we', 'can', 'and', 'I', 'think', 'it', 'would', 'be', 'best', 'if', 'Dome', 'ends', 'up', 'in', 'the', 'hands', 'of', 'Canadian', 'management', \"''\", 'He', 'said', 'he', 'did', 'not', 'know', 'who', 'other', 'possible', 'bidders', 'were', 'Blair', 'said', 'that', 'any', 'move', 'to', 'put', 'Dome', \"'s\", 'financial', 'house', 'in', 'order', '``', 'will', 'remove', 'one', 'of', 'the', 'general', 'problems', 'of', 'attitude', 'that', 'have', 'hung', 'over', 'Western', 'Canadian', 'industry', \"''\", 'He', 'added', 'however', 'that', 'the', 'energy', 'industry', 'still', 'faced', '``', 'a', 'couple', 'of', 'tough', 'tough', 'additional', 'years', \"''\", 'Asked', 'about', 'Nova', \"'s\", '1987', 'prospects', 'Blair', 'predicted', 'that', 'Nova', \"'s\", 'net', 'profit', 'would', 'rise', 'this', 'year', 'to', 'more', 'than', '150', 'mln', 'dlrs', 'from', 'last', 'year', \"'s\", 'net', 'profit', 'of', '100.2', 'mln', 'dlrs', 'due', 'to', 'improved', 'product', 'prices', 'and', 'continued', 'cost-cutting']"
     ]
    }
   ],
   "source": [
    "word_list = [token for token in words if not token in string.punctuation]\n",
    "\n",
    "# works overall, but weirdly keeps ‘\"' and '``', apparently only when they come in pairs?\n",
    "# can I do a greedy pattern matching here? in order to catch them all? \n",
    "\n",
    "\n",
    "print(word_list, end = \"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Word Count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 145,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "292\n"
     ]
    }
   ],
   "source": [
    "# after punctuation removal, we can count the words items in word_list to get the word_count \n",
    "\n",
    "word_count = len(word_list)\n",
    "\n",
    "print(word_count)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Remove Stopwords"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 146,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\", \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', \"she's\", 'her', 'hers', 'herself', 'it', \"it's\", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', \"that'll\", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', \"don't\", 'should', \"should've\", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', \"aren't\", 'couldn', \"couldn't\", 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", 'haven', \"haven't\", 'isn', \"isn't\", 'ma', 'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn', \"needn't\", 'shan', \"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n"
     ]
    }
   ],
   "source": [
    "# remove stopwords \n",
    "\n",
    "from nltk.corpus import stopwords\n",
    "\n",
    "# you can inspect the list of English stop words here: \n",
    "\n",
    "print(stopwords.words('english'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 147,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['NOVA', 'lt', 'NVA.A.TO', 'NOT', 'PLANNING', 'DOME', 'lt', 'DMP', 'BID', 'Nova', 'Alberta', 'Corp', 'chief', 'executive', 'Robert', 'Blair', 'expressed', 'hope', 'Dome', 'Petroleum', 'Ltd', 'lt', 'DMP', 'remains', 'Canadian', 'ownership', 'added', 'company', 'plans', 'bid', 'debt-troubled', 'Dome', '``', 'We', \"'ve\", 'plan', 'bid', \"''\", 'Blair', 'told', 'reporters', 'speech', 'business', 'group', 'although', 'stressed', 'Nova', '57', 'pct-owned', 'Husky', 'Oil', 'Ltd', 'lt', 'HYO', 'interested', \"Dome's\", 'extensive', 'Western', 'Canadian', 'energy', 'holdings', '``', 'But', 'interested', 'sometimes', 'different', 'making', 'bid', \"''\", 'Blair', 'said', 'TransCanada', 'PipeLines', 'Ltd', 'lt', 'TRP', 'yesterday', 'bid', '4.30', 'billion', 'dlrs', 'Dome', 'Dome', 'said', 'discontinuing', 'talks', 'TransCanada', 'considering', 'proposal', 'another', 'company', 'also', 'talking', 'another', 'possible', 'buyer', 'rumored', 'offshore', 'Asked', 'reporters', 'Dome', 'remain', 'Canadian', 'hands', 'Blair', 'replied', '``', 'Yes', 'I', 'think', 'still', 'need', 'building', 'much', 'Canadian', 'position', 'industry', 'I', 'think', 'would', 'best', 'Dome', 'ends', 'hands', 'Canadian', 'management', \"''\", 'He', 'said', 'know', 'possible', 'bidders', 'Blair', 'said', 'move', 'put', 'Dome', \"'s\", 'financial', 'house', 'order', '``', 'remove', 'one', 'general', 'problems', 'attitude', 'hung', 'Western', 'Canadian', 'industry', \"''\", 'He', 'added', 'however', 'energy', 'industry', 'still', 'faced', '``', 'couple', 'tough', 'tough', 'additional', 'years', \"''\", 'Asked', 'Nova', \"'s\", '1987', 'prospects', 'Blair', 'predicted', 'Nova', \"'s\", 'net', 'profit', 'would', 'rise', 'year', '150', 'mln', 'dlrs', 'last', 'year', \"'s\", 'net', 'profit', '100.2', 'mln', 'dlrs', 'due', 'improved', 'product', 'prices', 'continued', 'cost-cutting']\n"
     ]
    }
   ],
   "source": [
    "# remove stopwords from word_list\n",
    "\n",
    "filtered_words = [word for word in word_list if not word in stopwords.words('english')]\n",
    "\n",
    "print(filtered_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 148,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "without stop words, the next contains 197 words\n"
     ]
    }
   ],
   "source": [
    "print(f'without stop words, the next contains {len(filtered_words)} words')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Stemming"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 149,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['nova', 'lt', 'nva.a.to', 'not', 'plan', 'dome', 'lt', 'dmp', 'bid', 'nova', 'alberta', 'corp', 'chief', 'execut', 'robert', 'blair', 'express', 'hope', 'dome', 'petroleum', 'ltd', 'lt', 'dmp', 'remain', 'canadian', 'ownership', 'ad', 'compani', 'plan', 'bid', 'debt-troubl', 'dome', '``', 'We', \"'ve\", 'plan', 'bid', \"''\", 'blair', 'told', 'report', 'speech', 'busi', 'group', 'although', 'stress', 'nova', '57', 'pct-own', 'huski', 'oil', 'ltd', 'lt', 'hyo', 'interest', \"dome'\", 'extens', 'western', 'canadian', 'energi', 'hold', '``', 'but', 'interest', 'sometim', 'differ', 'make', 'bid', \"''\", 'blair', 'said', 'transcanada', 'pipelin', 'ltd', 'lt', 'trp', 'yesterday', 'bid', '4.30', 'billion', 'dlr', 'dome', 'dome', 'said', 'discontinu', 'talk', 'transcanada', 'consid', 'propos', 'anoth', 'compani', 'also', 'talk', 'anoth', 'possibl', 'buyer', 'rumor', 'offshor', 'ask', 'report', 'dome', 'remain', 'canadian', 'hand', 'blair', 'repli', '``', 'ye', 'I', 'think', 'still', 'need', 'build', 'much', 'canadian', 'posit', 'industri', 'I', 'think', 'would', 'best', 'dome', 'end', 'hand', 'canadian', 'manag', \"''\", 'He', 'said', 'know', 'possibl', 'bidder', 'blair', 'said', 'move', 'put', 'dome', \"'s\", 'financi', 'hous', 'order', '``', 'remov', 'one', 'gener', 'problem', 'attitud', 'hung', 'western', 'canadian', 'industri', \"''\", 'He', 'ad', 'howev', 'energi', 'industri', 'still', 'face', '``', 'coupl', 'tough', 'tough', 'addit', 'year', \"''\", 'ask', 'nova', \"'s\", '1987', 'prospect', 'blair', 'predict', 'nova', \"'s\", 'net', 'profit', 'would', 'rise', 'year', '150', 'mln', 'dlr', 'last', 'year', \"'s\", 'net', 'profit', '100.2', 'mln', 'dlr', 'due', 'improv', 'product', 'price', 'continu', 'cost-cut']\n"
     ]
    }
   ],
   "source": [
    "from nltk.stem import PorterStemmer\n",
    "\n",
    "ps = PorterStemmer(); \n",
    "\n",
    "# this list should not contain punctuation, or numbers perhap? \n",
    "# at which step should I remove them? \n",
    "stemmed_words = [ps.stem(word) for word in filtered_words]\n",
    "\n",
    "print(stemmed_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 150,
   "metadata": {},
   "outputs": [],
   "source": [
    "# average sentence length? \n",
    "# some readability score \n",
    "# misspellings? \n",
    "# sentiment analysis "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reading Ease score = 206.835 - (1.015 × ASL) - (84.6 × ASW)\n",
    "Here,\n",
    "ASL = average sentence length (number of words divided by number of sentences)\n",
    "ASW = average word length in syllables (number of syllables divided by number of words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Readability score\n",
    "\n",
    "pip install py-readability-metrics  \n",
    "to read more about this: https://github.com/cdimascio/py-readability-metrics\n",
    "pip install textstat\n",
    "\n",
    "- is one of them better?  \n",
    "textstat seems to be much easier (one line of code versus 4 lines of code...) AND it has more metrics (which we probably don't need, but still). So for the moment I tend towards textstat. \n",
    "\n",
    "\n",
    "py-readability-metrics allows:\n",
    "\n",
    "```\n",
    "r.flesch_kincaid()  \n",
    "r.flesch()  \n",
    "r.gunning_fog()  \n",
    "r.coleman_liau()  \n",
    "r.dale_chall()  \n",
    "r.ari()  \n",
    "r.linsear_write()  \n",
    "r.smog()  \n",
    "r.spache()  \n",
    "```\n",
    "\n",
    "\n",
    "textstat allows:  \n",
    "\n",
    "```\n",
    "textstat.flesch_reading_ease(test_data)\n",
    "textstat.smog_index(test_data)\n",
    "textstat.flesch_kincaid_grade(test_data)\n",
    "textstat.coleman_liau_index(test_data)\n",
    "textstat.automated_readability_index(test_data)\n",
    "textstat.dale_chall_readability_score(test_data)\n",
    "textstat.difficult_words(test_data)\n",
    "textstat.linsear_write_formula(test_data)\n",
    "textstat.gunning_fog(test_data)\n",
    "textstat.text_standard(test_data)\n",
    "````\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 151,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " Flesch reading ease: -83.49\n",
      " Flesch-Kincaid grade: 58.7\n"
     ]
    }
   ],
   "source": [
    "# this will be more meaningful with properly cleaned input...\n",
    "\n",
    "import textstat\n",
    "\n",
    "print(f' Flesch reading ease: {textstat.flesch_reading_ease(raw_text)}')\n",
    "print(f' Flesch-Kincaid grade: {textstat.flesch_kincaid_grade(raw_text)}')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 152,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " the Flesch-Kincaid score for this text is 13.077063829787232\n",
      " the Flesch-Kincaid grade level for this text is 13\n"
     ]
    }
   ],
   "source": [
    "# this will be more meaningful with properly cleaned input...\n",
    "\n",
    "from readability import Readability\n",
    "\n",
    "r = Readability(raw_text)\n",
    "fk = r.flesch_kincaid()\n",
    "print(f' the Flesch-Kincaid score for this text is {fk.score}')\n",
    "print(f' the Flesch-Kincaid grade level for this text is {fk.grade_level}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above is very much in need of interpretation. Is Flesch-Kincaid grade level = Dlesch-Kincaid grade? If so, why are they so different between textstat and py-readability-metrics?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Vectorisation magic"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 153,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from time import time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 154,
   "metadata": {},
   "outputs": [],
   "source": [
    "n = 10000000\n",
    "x = np.random.rand(n)\n",
    "y = np.random.rand(n)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 155,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " the code took 4.947542905807495 seconds\n"
     ]
    }
   ],
   "source": [
    "start_time = time()\n",
    "z1 = []\n",
    "for k in range (n):\n",
    "    z1.append(x[k] + y[k])\n",
    "end_time = time()\n",
    "t2 = end_time - start_time\n",
    "print(f' the code took {t2} seconds')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " the code took 0.10487699508666992 seconds\n"
     ]
    }
   ],
   "source": [
    "start_time = time()\n",
    "z2 = x + y\n",
    "end_time = time()\n",
    "t2 = end_time - start_time\n",
    "print(f' the code took {t2} seconds')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "does vectorisation make sense (speed things up) for string data?   \n",
    "https://stackoverflow.com/questions/49112552/vectorized-string-operations-in-numpy-why-are-they-rather-slow/49134333"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This does not sound good: The dtype of any numpy array containing string values is the maximum length of any string present in the array. Once set, it will only be able to store new string having length not more than the maximum length at the time of the creation. If we try to reassign some another string value having length greater than the maximum length of the existing elements, it simply discards all the values beyond the maximum length.  \n",
    "\n",
    "https://www.geeksforgeeks.org/modify-numpy-array-to-store-an-arbitrary-length-string/#:~:text=NumPy%20provides%20two%20fundamental%20objects,string%20present%20in%20the%20array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 157,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['NOVA', '&', 'lt', ';', 'NVA.A.TO', '>', 'NOT', 'PLANNING', 'DOME',\n",
       "       '&', 'lt', ';', 'DMP', '>', 'BID', 'Nova', ',', 'an', 'Alberta',\n",
       "       'Corp', ',', 'chief', 'executive', 'Robert', 'Blair', 'expressed',\n",
       "       'hope', 'that', 'Dome', 'Petroleum', 'Ltd', '&', 'lt', ';', 'DMP',\n",
       "       '>', 'remains', 'under', 'Canadian', 'ownership', ',', 'but',\n",
       "       'added', 'that', 'his', 'company', 'plans', 'no', 'bid', 'of',\n",
       "       'its', 'own', 'for', 'debt-troubled', 'Dome', '.', '``', 'We',\n",
       "       \"'ve\", 'no', 'plan', 'to', 'bid', ',', \"''\", 'Blair', 'told',\n",
       "       'reporters', 'after', 'a', 'speech', 'to', 'a', 'business',\n",
       "       'group', ',', 'although', 'he', 'stressed', 'that', 'Nova', 'and',\n",
       "       '57', 'pct-owned', 'Husky', 'Oil', 'Ltd', '&', 'lt', ';', 'HYO',\n",
       "       '>', 'were', 'interested', 'in', \"Dome's\", 'extensive', 'Western',\n",
       "       'Canadian', 'energy', 'holdings', '.', '``', 'But', 'being',\n",
       "       'interested', 'can', 'sometimes', 'be', 'different', 'from',\n",
       "       'making', 'a', 'bid', ',', \"''\", 'Blair', 'said', '.',\n",
       "       'TransCanada', 'PipeLines', 'Ltd', '&', 'lt', ';', 'TRP', '>',\n",
       "       'yesterday', 'bid', '4.30', 'billion', 'dlrs', 'for', 'Dome', ',',\n",
       "       'but', 'Dome', 'said', 'it', 'was', 'discontinuing', 'talks',\n",
       "       'with', 'TransCanada', 'and', 'was', 'considering', 'a',\n",
       "       'proposal', 'from', 'another', 'company', 'and', 'was', 'also',\n",
       "       'talking', 'with', 'another', 'possible', 'buyer', ',', 'both',\n",
       "       'rumored', 'to', 'be', 'offshore', '.', 'Asked', 'by', 'reporters',\n",
       "       'if', 'Dome', 'should', 'remain', 'in', 'Canadian', 'hands', ',',\n",
       "       'Blair', 'replied', ',', '``', 'Yes', '.', 'I', 'think', 'that',\n",
       "       'we', 'still', 'need', 'to', 'be', 'building', 'as', 'much',\n",
       "       'Canadian', 'position', 'in', 'this', 'industry', 'as', 'we',\n",
       "       'can', 'and', 'I', 'think', 'it', 'would', 'be', 'best', 'if',\n",
       "       'Dome', 'ends', 'up', 'in', 'the', 'hands', 'of', 'Canadian',\n",
       "       'management', '.', \"''\", 'He', 'said', 'he', 'did', 'not', 'know',\n",
       "       'who', 'other', 'possible', 'bidders', 'were', '.', 'Blair',\n",
       "       'said', 'that', 'any', 'move', 'to', 'put', 'Dome', \"'s\",\n",
       "       'financial', 'house', 'in', 'order', '``', 'will', 'remove', 'one',\n",
       "       'of', 'the', 'general', 'problems', 'of', 'attitude', 'that',\n",
       "       'have', 'hung', 'over', 'Western', 'Canadian', 'industry', '.',\n",
       "       \"''\", 'He', 'added', ',', 'however', ',', 'that', 'the', 'energy',\n",
       "       'industry', 'still', 'faced', '``', 'a', 'couple', 'of', 'tough',\n",
       "       ',', 'tough', 'additional', 'years', '.', \"''\", 'Asked', 'about',\n",
       "       'Nova', \"'s\", '1987', 'prospects', ',', 'Blair', 'predicted',\n",
       "       'that', 'Nova', \"'s\", 'net', 'profit', 'would', 'rise', 'this',\n",
       "       'year', 'to', 'more', 'than', '150', 'mln', 'dlrs', 'from', 'last',\n",
       "       'year', \"'s\", 'net', 'profit', 'of', '100.2', 'mln', 'dlrs', 'due',\n",
       "       'to', 'improved', 'product', 'prices', 'and', 'continued',\n",
       "       'cost-cutting', '.'], dtype=object)"
      ]
     },
     "execution_count": 157,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "words_array = np.array(words, dtype = 'object') \n",
    "words_array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 162,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 907 µs, sys: 1.22 ms, total: 2.12 ms\n",
      "Wall time: 6.96 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array(['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you',\n",
       "       \"you're\", \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself',\n",
       "       'yourselves', 'he', 'him', 'his', 'himself', 'she', \"she's\", 'her',\n",
       "       'hers', 'herself', 'it', \"it's\", 'its', 'itself', 'they', 'them',\n",
       "       'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom',\n",
       "       'this', 'that', \"that'll\", 'these', 'those', 'am', 'is', 'are',\n",
       "       'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',\n",
       "       'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and',\n",
       "       'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at',\n",
       "       'by', 'for', 'with', 'about', 'against', 'between', 'into',\n",
       "       'through', 'during', 'before', 'after', 'above', 'below', 'to',\n",
       "       'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',\n",
       "       'again', 'further', 'then', 'once', 'here', 'there', 'when',\n",
       "       'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\n",
       "       'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own',\n",
       "       'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will',\n",
       "       'just', 'don', \"don't\", 'should', \"should've\", 'now', 'd', 'll',\n",
       "       'm', 'o', 're', 've', 'y', 'ain', 'aren', \"aren't\", 'couldn',\n",
       "       \"couldn't\", 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\",\n",
       "       'hasn', \"hasn't\", 'haven', \"haven't\", 'isn', \"isn't\", 'ma',\n",
       "       'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn', \"needn't\",\n",
       "       'shan', \"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\",\n",
       "       'weren', \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 162,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "stopwords_array = np.array(stopwords.words('english'), dtype = 'object')\n",
    "stopwords_array\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}