Skip to content

Instantly share code, notes, and snippets.

@waleedsial
Last active April 9, 2020 17:54
Show Gist options
  • Save waleedsial/6538f33169deafe5f7abbdcb83a04d79 to your computer and use it in GitHub Desktop.
Save waleedsial/6538f33169deafe5f7abbdcb83a04d79 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Project: Part of Speech Tagging with Hidden Markov Models \n",
"---\n",
"### Introduction\n",
"\n",
"Part of speech tagging is the process of determining the syntactic category of a word from the words in its surrounding context. It is often used to help disambiguate natural language phrases because it can be done quickly with high accuracy. Tagging can be used for many NLP tasks like determining correct pronunciation during speech synthesis (for example, _dis_-count as a noun vs dis-_count_ as a verb), for information retrieval, and for word sense disambiguation.\n",
"\n",
"In this notebook, you'll use the [Pomegranate](http://pomegranate.readthedocs.io/) library to build a hidden Markov model for part of speech tagging using a \"universal\" tagset. Hidden Markov models have been able to achieve [>96% tag accuracy with larger tagsets on realistic text corpora](http://www.coli.uni-saarland.de/~thorsten/publications/Brants-ANLP00.pdf). Hidden Markov models have also been used for speech recognition and speech generation, machine translation, gene recognition for bioinformatics, and human gesture recognition for computer vision, and more. \n",
"\n",
"![](_post-hmm.png)\n",
"\n",
"The notebook already contains some code to get you started. You only need to add some new functionality in the areas indicated to complete the project; you will not need to modify the included code beyond what is requested. Sections that begin with **'IMPLEMENTATION'** in the header indicate that you must provide code in the block that follows. Instructions will be provided for each section, and the specifics of the implementation are marked in the code block with a 'TODO' statement. Please be sure to read the instructions carefully!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-info\">\n",
"**Note:** Once you have completed all of the code implementations, you need to finalize your work by exporting the iPython Notebook as an HTML document. Before exporting the notebook to html, all of the code cells need to have been run so that reviewers can see the final implementation and output. You must then **export the notebook** by running the last cell in the notebook, or by using the menu above and navigating to **File -> Download as -> HTML (.html)** Your submissions should include both the `html` and `ipynb` files.\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-info\">\n",
"**Note:** Code and Markdown cells can be executed using the `Shift + Enter` keyboard shortcut. Markdown cells can be edited by double-clicking the cell to enter edit mode.\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Road Ahead\n",
"You must complete Steps 1-3 below to pass the project. The section on Step 4 includes references & resources you can use to further explore HMM taggers.\n",
"\n",
"- [Step 1](#Step-1:-Read-and-preprocess-the-dataset): Review the provided interface to load and access the text corpus\n",
"- [Step 2](#Step-2:-Build-a-Most-Frequent-Class-tagger): Build a Most Frequent Class tagger to use as a baseline\n",
"- [Step 3](#Step-3:-Build-an-HMM-tagger): Build an HMM Part of Speech tagger and compare to the MFC baseline\n",
"- [Step 4](#Step-4:-[Optional]-Improving-model-performance): (Optional) Improve the HMM tagger"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-warning\">\n",
"**Note:** Make sure you have selected a **Python 3** kernel in Workspaces or the hmm-tagger conda environment if you are running the Jupyter server on your own machine.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Jupyter \"magic methods\" -- only need to be run once per kernel restart\n",
"%load_ext autoreload\n",
"%aimport helpers, tests\n",
"%autoreload 1"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from IPython.core.interactiveshell import InteractiveShell\n",
"InteractiveShell.ast_node_interactivity = \"all\""
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# import python modules -- this cell needs to be run again if you make changes to any of the files\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"\n",
"from IPython.core.display import HTML\n",
"from itertools import chain\n",
"from collections import Counter, defaultdict\n",
"from helpers import show_model, Dataset\n",
"from pomegranate import State, HiddenMarkovModel, DiscreteDistribution"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Read and preprocess the dataset\n",
"---\n",
"We'll start by reading in a text corpus and splitting it into a training and testing dataset. The data set is a copy of the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) (originally from the [NLTK](https://www.nltk.org/) library) that has already been pre-processed to only include the [universal tagset](https://arxiv.org/pdf/1104.2086.pdf). You should expect to get slightly higher accuracy using this simplified tagset than the same model would achieve on a larger tagset like the full [Penn treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html), but the process you'll follow would be the same.\n",
"\n",
"The `Dataset` class provided in helpers.py will read and parse the corpus. You can generate your own datasets compatible with the reader by writing them to the following format. The dataset is stored in plaintext as a collection of words and corresponding tags. Each sentence starts with a unique identifier on the first line, followed by one tab-separated word/tag pair on each following line. Sentences are separated by a single blank line.\n",
"\n",
"Example from the Brown corpus. \n",
"```\n",
"b100-38532\n",
"Perhaps\tADV\n",
"it\tPRON\n",
"was\tVERB\n",
"right\tADJ\n",
";\t.\n",
";\t.\n",
"\n",
"b100-35577\n",
"...\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 57340 sentences in the corpus.\n",
"There are 45872 sentences in the training set.\n",
"There are 11468 sentences in the testing set.\n"
]
}
],
"source": [
"data = Dataset(\"tags-universal.txt\", \"brown-universal.txt\", train_test_split=0.8)\n",
"\n",
"print(\"There are {} sentences in the corpus.\".format(len(data)))\n",
"print(\"There are {} sentences in the training set.\".format(len(data.training_set)))\n",
"print(\"There are {} sentences in the testing set.\".format(len(data.testing_set)))\n",
"\n",
"assert len(data) == len(data.training_set) + len(data.testing_set), \\\n",
" \"The number of sentences in the training set + testing set should sum to the number of sentences in the corpus\""
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"helpers.Dataset"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"frozenset({'.',\n",
" 'ADJ',\n",
" 'ADP',\n",
" 'ADV',\n",
" 'CONJ',\n",
" 'DET',\n",
" 'NOUN',\n",
" 'NUM',\n",
" 'PRON',\n",
" 'PRT',\n",
" 'VERB',\n",
" 'X'})"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(data)\n",
"data.tagset\n",
"\n",
"# Read more on dataset class "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Dataset Interface\n",
"\n",
"You can access (mostly) immutable references to the dataset through a simple interface provided through the `Dataset` class, which represents an iterable collection of sentences along with easy access to partitions of the data for training & testing. Review the reference below, then run and review the next few cells to make sure you understand the interface before moving on to the next step.\n",
"\n",
"```\n",
"Dataset-only Attributes:\n",
" training_set - reference to a Subset object containing the samples for training\n",
" testing_set - reference to a Subset object containing the samples for testing\n",
"\n",
"Dataset & Subset Attributes:\n",
" sentences - a dictionary with an entry {sentence_key: Sentence()} for each sentence in the corpus\n",
" keys - an immutable ordered (not sorted) collection of the sentence_keys for the corpus\n",
" vocab - an immutable collection of the unique words in the corpus\n",
" tagset - an immutable collection of the unique tags in the corpus\n",
" X - returns an array of words grouped by sentences ((w11, w12, w13, ...), (w21, w22, w23, ...), ...)\n",
" Y - returns an array of tags grouped by sentences ((t11, t12, t13, ...), (t21, t22, t23, ...), ...)\n",
" N - returns the number of distinct samples (individual words or tags) in the dataset\n",
"\n",
"Methods:\n",
" stream() - returns an flat iterable over all (word, tag) pairs across all sentences in the corpus\n",
" __iter__() - returns an iterable over the data as (sentence_key, Sentence()) pairs\n",
" __len__() - returns the nubmer of sentences in the dataset\n",
"```\n",
"\n",
"For example, consider a Subset, `subset`, of the sentences `{\"s0\": Sentence((\"See\", \"Spot\", \"run\"), (\"VERB\", \"NOUN\", \"VERB\")), \"s1\": Sentence((\"Spot\", \"ran\"), (\"NOUN\", \"VERB\"))}`. The subset will have these attributes:\n",
"\n",
"```\n",
"subset.keys == {\"s1\", \"s0\"} # unordered\n",
"subset.vocab == {\"See\", \"run\", \"ran\", \"Spot\"} # unordered\n",
"subset.tagset == {\"VERB\", \"NOUN\"} # unordered\n",
"subset.X == ((\"Spot\", \"ran\"), (\"See\", \"Spot\", \"run\")) # order matches .keys\n",
"subset.Y == ((\"NOUN\", \"VERB\"), (\"VERB\", \"NOUN\", \"VERB\")) # order matches .keys\n",
"subset.N == 7 # there are a total of seven observations over all sentences\n",
"len(subset) == 2 # because there are two sentences\n",
"```\n",
"\n",
"<div class=\"alert alert-block alert-info\">\n",
"**Note:** The `Dataset` class is _convenient_, but it is **not** efficient. It is not suitable for huge datasets because it stores multiple redundant copies of the same data.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"57340"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"tuple"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"<method-wrapper '__iter__' of tuple object at 0x7341388>"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(data.sentences)\n",
"len(data.sentences)\n",
"type(data.keys)\n",
"data.stream\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Sentences\n",
"\n",
"`Dataset.sentences` is a dictionary of all sentences in the training corpus, each keyed to a unique sentence identifier. Each `Sentence` is itself an object with two attributes: a tuple of the words in the sentence named `words` and a tuple of the tag corresponding to each word named `tags`."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sentence: b100-38532\n",
"words:\n",
"\t('Perhaps', 'it', 'was', 'right', ';', ';')\n",
"tags:\n",
"\t('ADV', 'PRON', 'VERB', 'ADJ', '.', '.')\n"
]
},
{
"data": {
"text/plain": [
"Sentence(words=('Perhaps', 'it', 'was', 'right', ';', ';'), tags=('ADV', 'PRON', 'VERB', 'ADJ', '.', '.'))"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"key = 'b100-38532'\n",
"print(\"Sentence: {}\".format(key))\n",
"print(\"words:\\n\\t{!s}\".format(data.sentences[key].words))\n",
"print(\"tags:\\n\\t{!s}\".format(data.sentences[key].tags))\n",
"data.sentences[key]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-info\">\n",
"**Note:** The underlying iterable sequence is **unordered** over the sentences in the corpus; it is not guaranteed to return the sentences in a consistent order between calls. Use `Dataset.stream()`, `Dataset.keys`, `Dataset.X`, or `Dataset.Y` attributes if you need ordered access to the data.\n",
"</div>\n",
"\n",
"#### Counting Unique Elements\n",
"\n",
"You can access the list of unique words (the dataset vocabulary) via `Dataset.vocab` and the unique list of tags via `Dataset.tagset`."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are a total of 1161192 samples of 56057 unique words in the corpus.\n",
"There are 928458 samples of 50536 unique words in the training set.\n",
"There are 232734 samples of 25112 unique words in the testing set.\n",
"There are 5521 words in the test set that are missing in the training set.\n"
]
}
],
"source": [
"print(\"There are a total of {} samples of {} unique words in the corpus.\"\n",
" .format(data.N, len(data.vocab)))\n",
"print(\"There are {} samples of {} unique words in the training set.\"\n",
" .format(data.training_set.N, len(data.training_set.vocab)))\n",
"print(\"There are {} samples of {} unique words in the testing set.\"\n",
" .format(data.testing_set.N, len(data.testing_set.vocab)))\n",
"print(\"There are {} words in the test set that are missing in the training set.\"\n",
" .format(len(data.testing_set.vocab - data.training_set.vocab)))\n",
"\n",
"assert data.N == data.training_set.N + data.testing_set.N, \\\n",
" \"The number of training + test samples should sum to the total number of samples\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Accessing word and tag Sequences\n",
"The `Dataset.X` and `Dataset.Y` attributes provide access to ordered collections of matching word and tag sequences for each sentence in the dataset."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sentence 1: ('Mr.', 'Podger', 'had', 'thanked', 'him', 'gravely', ',', 'and', 'now', 'he', 'made', 'use', 'of', 'the', 'advice', '.')\n",
"\n",
"Labels 1: ('NOUN', 'NOUN', 'VERB', 'VERB', 'PRON', 'ADV', '.', 'CONJ', 'ADV', 'PRON', 'VERB', 'NOUN', 'ADP', 'DET', 'NOUN', '.')\n",
"\n",
"Sentence 2: ('But', 'there', 'seemed', 'to', 'be', 'some', 'difference', 'of', 'opinion', 'as', 'to', 'how', 'far', 'the', 'board', 'should', 'go', ',', 'and', 'whose', 'advice', 'it', 'should', 'follow', '.')\n",
"\n",
"Labels 2: ('CONJ', 'PRT', 'VERB', 'PRT', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADP', 'ADP', 'ADV', 'ADV', 'DET', 'NOUN', 'VERB', 'VERB', '.', 'CONJ', 'DET', 'NOUN', 'PRON', 'VERB', 'VERB', '.')\n",
"\n"
]
},
{
"data": {
"text/plain": [
"('Mr.',\n",
" 'Podger',\n",
" 'had',\n",
" 'thanked',\n",
" 'him',\n",
" 'gravely',\n",
" ',',\n",
" 'and',\n",
" 'now',\n",
" 'he',\n",
" 'made',\n",
" 'use',\n",
" 'of',\n",
" 'the',\n",
" 'advice',\n",
" '.')"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"('NOUN',\n",
" 'NOUN',\n",
" 'VERB',\n",
" 'VERB',\n",
" 'PRON',\n",
" 'ADV',\n",
" '.',\n",
" 'CONJ',\n",
" 'ADV',\n",
" 'PRON',\n",
" 'VERB',\n",
" 'NOUN',\n",
" 'ADP',\n",
" 'DET',\n",
" 'NOUN',\n",
" '.')"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# accessing words with Dataset.X and tags with Dataset.Y \n",
"for i in range(2): \n",
" print(\"Sentence {}:\".format(i + 1), data.X[i])\n",
" print()\n",
" print(\"Labels {}:\".format(i + 1), data.Y[i])\n",
" print()\n",
" \n",
"data.X[0]\n",
"data.Y[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Accessing (word, tag) Samples\n",
"The `Dataset.stream()` method returns an iterator that chains together every pair of (word, tag) entries across all sentences in the entire corpus."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Stream (word, tag) pairs:\n",
"\n",
"\t ('Mr.', 'NOUN')\n",
"\t ('Podger', 'NOUN')\n",
"\t ('had', 'VERB')\n",
"\t ('thanked', 'VERB')\n",
"\t ('him', 'PRON')\n",
"\t ('gravely', 'ADV')\n",
"\t (',', '.')\n"
]
}
],
"source": [
"# use Dataset.stream() (word, tag) samples for the entire corpus\n",
"print(\"\\nStream (word, tag) pairs:\\n\")\n",
"for i, pair in enumerate(data.stream()):\n",
" print(\"\\t\", pair)\n",
" if i > 5: break"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
" "
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"928458"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"928458"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Extracting words & tags from Training \n",
"# Complete ordered lists of words & tags . Need to check later if it should contain the training only or whole dataset. \n",
"\n",
"words = []\n",
"tags = []\n",
"\n",
"for word, tag in data.training_set.stream(): \n",
" words.append(word)\n",
" tags.append(tag)\n",
" \n",
"len(words)\n",
"len(tags) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Scratch Block \n",
"\n",
"for tags in data.training_set.tagset:\n",
" pair_count[tags] = {}\n",
" \n",
"# Generalize the function so that it works for Tag - Word count \n",
"# and also works for word - tag count \n",
"# so it needs to take 2 streams, it should be up to us that we pass in. \n",
"\n",
"# Lets do the first case first, i.e. make a generalized version of tag - word count function \n",
"# what could be the sequences for this \n",
"# for this, the first sequence will be a tag list \n",
"# the 2nd sequence could be a tag, word list \n",
"\n",
"# for the first sequence, we can choose the tagset frozen set,if we can iterate it than it seems we can use it as a sequence \n",
"# for the 2nd sequence, we need to a tag, word list \n",
"\n",
"\n",
"\n",
"type(data.training_set.tagset)\n",
"\n",
"for i in data.training_set.tagset:\n",
" print(i)\n",
" \n",
"\n",
"\n",
" # creating the code to be pasted in the function pair count below. \n",
"for word, tag in data.training_set.stream():\n",
" if tag in pair_count.keys(): # checking if the tag exists in the keys \n",
" if word in pair_count[tag]:\n",
" pair_count[tag][word]+=1\n",
" else:\n",
" pair_count[tag][word] = 1\n",
" else:\n",
" pair_count[tag] = {}\n",
" pair_count[tag][word] = 1\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"For both our baseline tagger and the HMM model we'll build, we need to estimate the frequency of tags & words from the frequency counts of observations in the training corpus. In the next several cells you will complete functions to compute the counts of several sets of counts. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Build a Most Frequent Class tagger\n",
"---\n",
"\n",
"Perhaps the simplest tagger (and a good baseline for tagger performance) is to simply choose the tag most frequently assigned to each word. This \"most frequent class\" tagger inspects each observed word in the sequence and assigns it the label that was most often assigned to that word in the corpus."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### IMPLEMENTATION: Pair Counts\n",
"\n",
"Complete the function below that computes the joint frequency counts for two input sequences."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div class=\"alert alert-block alert-success\">Your emission counts look good!</div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# sequences_A = tags \n",
"# sequences_B = words, training data\n",
"\n",
"def pair_counts(sequences_A, sequences_B):\n",
" \"\"\"Return a dictionary keyed to each unique value in the first sequence list\n",
" that counts the number of occurrences of the corresponding value from the\n",
" second sequences list.\n",
" \n",
" For example, if sequences_A is tags and sequences_B is the corresponding\n",
" words, then if 1244 sequences contain the word \"time\" tagged as a NOUN, then\n",
" you should return a dictionary such that pair_counts[NOUN][time] == 1244\n",
" \"\"\"\n",
" # TODO: Finish this function!\n",
" \n",
" pair_count_dict = {}\n",
" \n",
" for i, j in zip(sequences_A, sequences_B):\n",
" if i in pair_count_dict.keys(): # if the tag exists in the dictionary \n",
" if j in pair_count_dict[i].keys(): # if the word exist in the current key values \n",
" pair_count_dict[i][j] += 1\n",
" else: # if the word does not exist in the noun, than we need to add it \n",
" pair_count_dict[i][j] = 1\n",
" else: \n",
" pair_count_dict[i] = {}\n",
" pair_count_dict[i][j] = 1\n",
" return pair_count_dict\n",
"\n",
"\n",
"# Calculate C(t_i, w_i)\n",
"emission_counts = pair_counts(tags, words )\n",
"\n",
"assert len(emission_counts) == 12, \\\n",
" \"Uh oh. There should be 12 tags in your dictionary.\"\n",
"assert max(emission_counts[\"NOUN\"], key=emission_counts[\"NOUN\"].get) == 'time', \\\n",
" \"Hmmm...'time' is expected to be the most common NOUN.\"\n",
"HTML('<div class=\"alert alert-block alert-success\">Your emission counts look good!</div>')"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"word_count : {'Whenever': {'ADV': 1}, 'artists': {'NOUN': 1}, ',': {'.': 2}, 'indeed': {'ADV': 1}, 'turned': {'VERB': 1}, 'to': {'ADP': 1}, 'actual': {'ADJ': 1}, 'representations': {'NOUN': 1}, 'or': {'CONJ': 1}}\n",
"max_key is .\n"
]
},
{
"data": {
"text/plain": [
"{'Whenever': 'ADV',\n",
" 'artists': 'NOUN',\n",
" ',': '.',\n",
" 'indeed': 'ADV',\n",
" 'turned': 'VERB',\n",
" 'to': 'ADP',\n",
" 'actual': 'ADJ',\n",
" 'representations': 'NOUN',\n",
" 'or': 'CONJ'}"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n",
"# Scratch block for developing mfc table dictionary \n",
"# using the above function we can also get the word count \n",
"\n",
"word_count = pair_counts(words[:10],tags[:10] )\n",
"\n",
"# using this, we need to find the most frequent class label, this means \n",
"# that we have to find the pos tag with the maximum value \n",
"\n",
"type(word_count)\n",
"print('word_count : ',word_count ) \n",
"\n",
"# so we need to iterate the keys, which are unique words \n",
"# and save in form of a table, most probably another dict \n",
"# with the word followed by the value\n",
"# for each dict key, we need to find the pos tag with the maximum value\n",
"\n",
"max_key = max(word_count[','], key=word_count[','].get) \n",
"print('max_key is ',max_key)\n",
"\n",
"# so this way, we will iterate the words, for each word we will save the key \n",
"\n",
"mfc_table_dic = {}\n",
"for i in word_count:\n",
" mfc_table_dic[i] = max(word_count[i], key=word_count[i].get) \n",
"mfc_table_dic\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Need to learn how this is working, this is a cooler implementation of what I was trying to do \n",
"# I found this on medium. However, my code is simpler for me to understand. \n",
"#mfc_table = dict((word, max(tags.keys(), key=lambda key: tags[key])) for word, tags in word_count.items())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### IMPLEMENTATION: Most Frequent Class Tagger\n",
"\n",
"Use the `pair_counts()` function and the training dataset to find the most frequent class label for each word in the training data, and populate the `mfc_table` below. The table keys should be words, and the values should be the appropriate tag string.\n",
"\n",
"The `MFCTagger` class is provided to mock the interface of Pomegranite HMM models so that they can be used interchangeably."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div class=\"alert alert-block alert-success\">Your MFC tagger has all the correct words!</div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a lookup table mfc_table where mfc_table[word] contains the tag label most frequently assigned to that word\n",
"from collections import namedtuple\n",
"\n",
"FakeState = namedtuple(\"FakeState\", \"name\")\n",
"\n",
"class MFCTagger:\n",
" # NOTE: You should not need to modify this class or any of its methods\n",
" missing = FakeState(name=\"<MISSING>\")\n",
" \n",
" def __init__(self, table):\n",
" self.table = defaultdict(lambda: MFCTagger.missing)\n",
" self.table.update({word: FakeState(name=tag) for word, tag in table.items()})\n",
" \n",
" def viterbi(self, seq):\n",
" \"\"\"This method simplifies predictions by matching the Pomegranate viterbi() interface\"\"\"\n",
" return 0., list(enumerate([\"<start>\"] + [self.table[w] for w in seq] + [\"<end>\"]))\n",
"\n",
"\n",
"# TODO: calculate the frequency of each tag being assigned to each word (hint: similar, but not\n",
"# the same as the emission probabilities) and use it to fill the mfc_table\n",
"\n",
"word_counts = pair_counts(words, tags)\n",
"\n",
"mfc_table = {}\n",
"\n",
"for i in word_counts:\n",
" mfc_table[i] = max(word_counts[i], key=word_counts[i].get) \n",
"\n",
"# DO NOT MODIFY BELOW THIS LINE\n",
"mfc_model = MFCTagger(mfc_table) # Create a Most Frequent Class tagger instance\n",
"\n",
"assert len(mfc_table) == len(data.training_set.vocab), \"\"\n",
"assert all(k in data.training_set.vocab for k in mfc_table.keys()), \"\"\n",
"assert sum(int(k not in mfc_table) for k in data.testing_set.vocab) == 5521, \"\"\n",
"HTML('<div class=\"alert alert-block alert-success\">Your MFC tagger has all the correct words!</div>')"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Whenever', 'artists']"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"{'Whenever': 'ADV', 'artists': 'NOUN'}"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"{'Whenever': {'ADV': 12}, 'artists': {'NOUN': 34}}"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Scratch block \n",
"list(mfc_table)[:2]\n",
"# Just anaalyzing how the tables look like \n",
"mfc_first2pairs = {k: mfc_table[k] for k in list(mfc_table)[:2]}\n",
"mfc_first2pairs\n",
"\n",
"\n",
"word_count_first2pairs = {k: word_counts[k] for k in list(word_counts)[:2]}\n",
"word_count_first2pairs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Making Predictions with a Model\n",
"The helper functions provided below interface with Pomegranate network models & the mocked MFCTagger to take advantage of the [missing value](http://pomegranate.readthedocs.io/en/latest/nan.html) functionality in Pomegranate through a simple sequence decoding function. Run these functions, then run the next cell to see some of the predictions made by the MFC tagger."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"def replace_unknown(sequence):\n",
" \"\"\"Return a copy of the input sequence where each unknown word is replaced\n",
" by the literal string value 'nan'. Pomegranate will ignore these values\n",
" during computation.\n",
" \"\"\"\n",
" return [w if w in data.training_set.vocab else 'nan' for w in sequence]\n",
"\n",
"def simplify_decoding(X, model):\n",
" \"\"\"X should be a 1-D sequence of observations for the model to predict\"\"\"\n",
" _, state_path = model.viterbi(replace_unknown(X))\n",
" return [state[1].name for state in state_path[1:-1]] # do not show the start/end state predictions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Explanation in words of the Algorithm "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1- First function that we created was Pair Count function, which takes 2 sequences. It creates a dictioary, whose keys are\n",
"unique values in the first sequence & than the values for these keys are count of occurances of these keys in 2nd sequence. We can use this function for creating 2 types of tables . a) a Tag - word count , with keys of unique tags & their occurances in the 2nd sequence.\n",
"b) we can also use it in the reverse order, where word is a the key & all the assigned tags with their counts as its values. For example : {'Whenever': {'ADV': 12}, 'artists': {'NOUN': 34}}\n",
"\n",
"2- Using this pair count function, we can develop a table, whose keys are unique words & each key has a value which is the most \n",
"frequent tag assigned to this word. We do this for our complete training dataset. \n",
"{'Whenever': 'ADV', 'artists': 'NOUN'}\n",
"\n",
"3- Than we need to predict the tags of the test sequences using previously assigned tags. We loop over the sentences, in form of sequences of words, than we check each word in the table & return the tag assigned to that word. This way for a sequence of \n",
"sentences, we return a sequence of corresponding tags. \n",
"\n",
"4- For this prediction, we have used some extra functions as well. \n",
"\n",
"5- Simplify decoding function takes 2 arguments, the first one is the sequence of words, the 2nd is a model. The model here is \n",
"an instance of class mfctagger. Will explain this in coming points. \n",
"\n",
"6- The function replace_unknown is called inside the simplify decoding & it just scnas the words in the sequence & remove those\n",
"words which are not in training vocabulary. \n",
"\n",
"7 - The simplify_decoding function inside calls a method to the model argument, the method name is veterbi, what this method does is that it returns a tuple, with a 0 as first element & a list as the 2nd element. This 2nd element is tthe state path or the list of the tags which were assigned to the words, we call them state because thats what they are in Hidden markov models. \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"tuple"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"('and',\n",
" 'August',\n",
" '15',\n",
" ',',\n",
" 'November',\n",
" '15',\n",
" ',',\n",
" 'February',\n",
" '17',\n",
" ',',\n",
" 'and',\n",
" 'May',\n",
" '15',\n",
" ',',\n",
" '(',\n",
" 'Cranston',\n",
" ')',\n",
" '.')"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of words : 18\n"
]
},
{
"data": {
"text/plain": [
"__main__.MFCTagger"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"simplify_decoding_result: ['CONJ', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'CONJ', 'NOUN', 'NUM', '.', '.', 'NOUN', '.', '.']\n",
"Number of tags produced: 18\n"
]
},
{
"data": {
"text/plain": [
"tuple"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"(0.0,\n",
" [(0, '<start>'),\n",
" (1, FakeState(name='CONJ')),\n",
" (2, FakeState(name='NOUN')),\n",
" (3, FakeState(name='NUM')),\n",
" (4, FakeState(name='.')),\n",
" (5, FakeState(name='NOUN')),\n",
" (6, FakeState(name='NUM')),\n",
" (7, FakeState(name='.')),\n",
" (8, FakeState(name='NOUN')),\n",
" (9, FakeState(name='NUM')),\n",
" (10, FakeState(name='.')),\n",
" (11, FakeState(name='CONJ')),\n",
" (12, FakeState(name='NOUN')),\n",
" (13, FakeState(name='NUM')),\n",
" (14, FakeState(name='.')),\n",
" (15, FakeState(name='.')),\n",
" (16, FakeState(name='NOUN')),\n",
" (17, FakeState(name='.')),\n",
" (18, FakeState(name='.')),\n",
" (19, '<end>')])"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Scratch block \n",
"\n",
"type(data.sentences['b100-28144'].words)\n",
"data.sentences['b100-28144'].words\n",
"print('Number of words : ',len(data.sentences['b100-28144'].words))\n",
"type(mfc_model)\n",
"\n",
"simplify_decoding_result = simplify_decoding(data.sentences['b100-28144'].words, mfc_model)\n",
"print('simplify_decoding_result: ',simplify_decoding_result)\n",
"\n",
"print('Number of tags produced: ', len(simplify_decoding_result))\n",
"\n",
"# Lets analyze model.vertibi result \n",
"\n",
"mfc_modelviterbi_result = mfc_model.viterbi(replace_unknown(data.sentences['b100-28144'].words))\n",
"# Question : What does model.verbi is doing ?\n",
"# Answer : here the model is mfc_model, it takes the sequence of words\n",
"# It returns a tuple. I domnt know what the 0.0 means, after this it contains an array of stages with the name of pos tag attached. \n",
"# I think it is as stated just to be similar to pomegranate interface, otherwise we are just looping through sentences, \n",
"# sending the words as a seq to simplify decoding, which in turns remove the words which are not in training via repalce_unknown.\n",
"# after that we call model.vertibi on it. \n",
"# mfc model is an instance of mfctagger class which is intantiated using the mfc table we created. & than we call \n",
"# vertibi method on this instance of mfc tagger class. \n",
"type(mfc_modelviterbi_result)\n",
"mfc_modelviterbi_result\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example Decoding Sequences with MFC Tagger"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sentence Key: b100-28144\n",
"\n",
"Predicted labels:\n",
"-----------------\n",
"['CONJ', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'CONJ', 'NOUN', 'NUM', '.', '.', 'NOUN', '.', '.']\n",
"\n",
"Actual labels:\n",
"--------------\n",
"('CONJ', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'CONJ', 'NOUN', 'NUM', '.', '.', 'NOUN', '.', '.')\n",
"\n",
"\n",
"Sentence Key: b100-23146\n",
"\n",
"Predicted labels:\n",
"-----------------\n",
"['PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'VERB', '.', 'ADP', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADP', 'DET', 'NOUN', '.']\n",
"\n",
"Actual labels:\n",
"--------------\n",
"('PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'VERB', '.', 'ADP', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADP', 'DET', 'NOUN', '.')\n",
"\n",
"\n",
"Sentence Key: b100-35462\n",
"\n",
"Predicted labels:\n",
"-----------------\n",
"['DET', 'ADJ', 'NOUN', 'VERB', 'VERB', 'VERB', 'ADP', 'DET', 'ADJ', 'ADJ', 'NOUN', 'ADP', 'DET', 'ADJ', 'NOUN', '.', 'ADP', 'ADJ', 'NOUN', '.', 'CONJ', 'ADP', 'DET', '<MISSING>', 'ADP', 'ADJ', 'ADJ', '.', 'ADJ', '.', 'CONJ', 'ADJ', 'NOUN', 'ADP', 'ADV', 'NOUN', '.']\n",
"\n",
"Actual labels:\n",
"--------------\n",
"('DET', 'ADJ', 'NOUN', 'VERB', 'VERB', 'VERB', 'ADP', 'DET', 'ADJ', 'ADJ', 'NOUN', 'ADP', 'DET', 'ADJ', 'NOUN', '.', 'ADP', 'ADJ', 'NOUN', '.', 'CONJ', 'ADP', 'DET', 'NOUN', 'ADP', 'ADJ', 'ADJ', '.', 'ADJ', '.', 'CONJ', 'ADJ', 'NOUN', 'ADP', 'ADJ', 'NOUN', '.')\n",
"\n",
"\n"
]
}
],
"source": [
"for key in data.testing_set.keys[:3]:\n",
" print(\"Sentence Key: {}\\n\".format(key))\n",
" print(\"Predicted labels:\\n-----------------\")\n",
" print(simplify_decoding(data.sentences[key].words, mfc_model))\n",
" print()\n",
" print(\"Actual labels:\\n--------------\")\n",
" print(data.sentences[key].tags)\n",
" print(\"\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Evaluating Model Accuracy\n",
"\n",
"The function below will evaluate the accuracy of the MFC tagger on the collection of all sentences from a text corpus. "
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"def accuracy(X, Y, model):\n",
" \"\"\"Calculate the prediction accuracy by using the model to decode each sequence\n",
" in the input X and comparing the prediction with the true labels in Y.\n",
" \n",
" The X should be an array whose first dimension is the number of sentences to test,\n",
" and each element of the array should be an iterable of the words in the sequence.\n",
" The arrays X and Y should have the exact same shape.\n",
" \n",
" X = [(\"See\", \"Spot\", \"run\"), (\"Run\", \"Spot\", \"run\", \"fast\"), ...]\n",
" Y = [(), (), ...]\n",
" \"\"\"\n",
" correct = total_predictions = 0\n",
" for observations, actual_tags in zip(X, Y):\n",
" \n",
" # The model.viterbi call in simplify_decoding will return None if the HMM\n",
" # raises an error (for example, if a test sentence contains a word that\n",
" # is out of vocabulary for the training set). Any exception counts the\n",
" # full sentence as an error (which makes this a conservative estimate).\n",
" try:\n",
" most_likely_tags = simplify_decoding(observations, model)\n",
" correct += sum(p == t for p, t in zip(most_likely_tags, actual_tags))\n",
" except:\n",
" pass\n",
" total_predictions += len(observations)\n",
" return correct / total_predictions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Evaluate the accuracy of the MFC tagger\n",
"Run the next cell to evaluate the accuracy of the tagger on the training and test corpus."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"training accuracy mfc_model: 95.72%\n",
"testing accuracy mfc_model: 93.01%\n"
]
},
{
"data": {
"text/html": [
"<div class=\"alert alert-block alert-success\">Your MFC tagger accuracy looks correct!</div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mfc_training_acc = accuracy(data.training_set.X, data.training_set.Y, mfc_model)\n",
"print(\"training accuracy mfc_model: {:.2f}%\".format(100 * mfc_training_acc))\n",
"\n",
"mfc_testing_acc = accuracy(data.testing_set.X, data.testing_set.Y, mfc_model)\n",
"print(\"testing accuracy mfc_model: {:.2f}%\".format(100 * mfc_testing_acc))\n",
"\n",
"assert mfc_training_acc >= 0.955, \"Uh oh. Your MFC accuracy on the training set doesn't look right.\"\n",
"assert mfc_testing_acc >= 0.925, \"Uh oh. Your MFC accuracy on the testing set doesn't look right.\"\n",
"HTML('<div class=\"alert alert-block alert-success\">Your MFC tagger accuracy looks correct!</div>')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3: Build an HMM tagger\n",
"---\n",
"The HMM tagger has one hidden state for each possible tag, and parameterized by two distributions: the emission probabilties giving the conditional probability of observing a given **word** from each hidden state, and the transition probabilities giving the conditional probability of moving between **tags** during the sequence.\n",
"\n",
"We will also estimate the starting probability distribution (the probability of each **tag** being the first tag in a sequence), and the terminal probability distribution (the probability of each **tag** being the last tag in a sequence).\n",
"\n",
"The maximum likelihood estimate of these distributions can be calculated from the frequency counts as described in the following sections where you'll implement functions to count the frequencies, and finally build the model. The HMM model will make predictions according to the formula:\n",
"\n",
"$$t_i^n = \\underset{t_i^n}{\\mathrm{argmax}} \\prod_{i=1}^n P(w_i|t_i) P(t_i|t_{i-1})$$\n",
"\n",
"Refer to Speech & Language Processing [Chapter 10](https://web.stanford.edu/~jurafsky/slp3/10.pdf) for more information."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### IMPLEMENTATION: Unigram Counts\n",
"\n",
"Complete the function below to estimate the co-occurrence frequency of each symbol over all of the input sequences. The unigram probabilities in our HMM model are estimated from the formula below, where N is the total number of samples in the input. (You only need to compute the counts for now.)\n",
"\n",
"$$P(tag_1) = \\frac{C(tag_1)}{N}$$"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div class=\"alert alert-block alert-success\">Your tag unigrams look good!</div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def unigram_counts(sequences):\n",
" \"\"\"Return a dictionary keyed to each unique value in the input sequence list that\n",
" counts the number of occurrences of the value in the sequences list. The sequences\n",
" collection should be a 2-dimensional array.\n",
" \n",
" For example, if the tag NOUN appears 275558 times over all the input sequences,\n",
" then you should return a dictionary such that your_unigram_counts[NOUN] == 275558.\n",
" \"\"\"\n",
" # TODO: Finish this function!\n",
" cnt = Counter()\n",
" for token in sequences:\n",
" cnt[token]+=1\n",
" return cnt\n",
"\n",
"# TODO: call unigram_counts with a list of tag sequences from the training set\n",
"tag_unigrams = unigram_counts(tags)\n",
"\n",
"assert set(tag_unigrams.keys()) == data.training_set.tagset, \\\n",
" \"Uh oh. It looks like your tag counts doesn't include all the tags!\"\n",
"assert min(tag_unigrams, key=tag_unigrams.get) == 'X', \\\n",
" \"Hmmm...'X' is expected to be the least common class\"\n",
"assert max(tag_unigrams, key=tag_unigrams.get) == 'NOUN', \\\n",
" \"Hmmm...'NOUN' is expected to be the most common class\"\n",
"HTML('<div class=\"alert alert-block alert-success\">Your tag unigrams look good!</div>')"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['ADV', 'NOUN', '.', 'ADV']"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0\n",
"1\n",
"2\n"
]
},
{
"data": {
"text/plain": [
"Counter({('ADV', 'NOUN'): 1, ('NOUN', '.'): 1, ('.', 'ADV'): 1})"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# For the bigram, we have to read two words, & than save them in a tuple or something. Bi gram does not mean to create all \n",
"# possible pairs of words. \n",
"\n",
"\n",
"tags_1 = tags[:4]\n",
"tags_2 = tags[:4]\n",
"tags_1\n",
"\n",
"bigram_list = []\n",
"\n",
"for i in range (len(tags_1)-1):\n",
"\n",
" print(i)\n",
" bigram_list.append((tags_1[i],tags_1[i+1] ))\n",
"\n",
"bi_cnt = Counter()\n",
"for pairs in bigram_list:\n",
" bi_cnt[pairs]+=1\n",
"\n",
"bi_cnt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### IMPLEMENTATION: Bigram Counts\n",
"\n",
"Complete the function below to estimate the co-occurrence frequency of each pair of symbols in each of the input sequences. These counts are used in the HMM model to estimate the bigram probability of two tags from the frequency counts according to the formula: $$P(tag_2|tag_1) = \\frac{C(tag_2|tag_1)}{C(tag_2)}$$\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div class=\"alert alert-block alert-success\">Your tag bigrams look good!</div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def bigram_counts(sequences):\n",
" \"\"\"Return a dictionary keyed to each unique PAIR of values in the input sequences\n",
" list that counts the number of occurrences of pair in the sequences list. The input\n",
" should be a 2-dimensional array.\n",
" \n",
" For example, if the pair of tags (NOUN, VERB) appear 61582 times, then you should\n",
" return a dictionary such that your_bigram_counts[(NOUN, VERB)] == 61582\n",
" \"\"\"\n",
"\n",
" # TODO: Finish this function!\n",
" bigram_list = []\n",
"\n",
" for i in range (len(tags)-1):\n",
" bigram_list.append((tags[i],tags[i+1] ))\n",
" bi_cnt = Counter()\n",
" for pairs in bigram_list:\n",
" bi_cnt[pairs]+=1\n",
" return bi_cnt\n",
"# TODO: call bigram_counts with a list of tag sequences from the training set\n",
"tag_bigrams = bigram_counts(tags)\n",
"\n",
"assert len(tag_bigrams) == 144, \\\n",
" \"Uh oh. There should be 144 pairs of bigrams (12 tags x 12 tags)\"\n",
"assert min(tag_bigrams, key=tag_bigrams.get) in [('X', 'NUM'), ('PRON', 'X')], \\\n",
" \"Hmmm...The least common bigram should be one of ('X', 'NUM') or ('PRON', 'X').\"\n",
"assert max(tag_bigrams, key=tag_bigrams.get) in [('DET', 'NOUN')], \\\n",
" \"Hmmm...('DET', 'NOUN') is expected to be the most common bigram.\"\n",
"HTML('<div class=\"alert alert-block alert-success\">Your tag bigrams look good!</div>')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type(data.Y)\n",
"#data.X\n",
"ist_tag_count = Counter()\n",
"for tupi in data.Y:\n",
" ist_tag_count[tupi[0]]+=1\n",
"ist_tag_count['NOUN']\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### IMPLEMENTATION: Sequence Starting Counts\n",
"Complete the code below to estimate the bigram probabilities of a sequence starting with each tag."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div class=\"alert alert-block alert-success\">Your starting tag counts look good!</div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def starting_counts(sequences):\n",
" \"\"\"Return a dictionary keyed to each unique value in the input sequences list\n",
" that counts the number of occurrences where that value is at the beginning of\n",
" a sequence.\n",
" \n",
" For example, if 8093 sequences start with NOUN, then you should return a\n",
" dictionary such that your_starting_counts[NOUN] == 8093\n",
" \"\"\"\n",
" # TODO: Finish this function!\n",
" ist_tag_count = Counter()\n",
" for tupi in sequences:\n",
" ist_tag_count[tupi[0]]+=1\n",
" return ist_tag_count\n",
"\n",
"# TODO: Calculate the count of each tag starting a sequence\n",
"tag_starts = starting_counts(data.training_set.Y)\n",
"\n",
"assert len(tag_starts) == 12, \"Uh oh. There should be 12 tags in your dictionary.\"\n",
"assert min(tag_starts, key=tag_starts.get) == 'X', \"Hmmm...'X' is expected to be the least common starting bigram.\"\n",
"assert max(tag_starts, key=tag_starts.get) == 'DET', \"Hmmm...'DET' is expected to be the most common starting bigram.\"\n",
"HTML('<div class=\"alert alert-block alert-success\">Your starting tag counts look good!</div>')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### IMPLEMENTATION: Sequence Ending Counts\n",
"Complete the function below to estimate the bigram probabilities of a sequence ending with each tag."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div class=\"alert alert-block alert-success\">Your ending tag counts look good!</div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def ending_counts(sequences):\n",
" \"\"\"Return a dictionary keyed to each unique value in the input sequences list\n",
" that counts the number of occurrences where that value is at the end of\n",
" a sequence.\n",
" \n",
" For example, if 18 sequences end with DET, then you should return a\n",
" dictionary such that your_starting_counts[DET] == 18\n",
" \"\"\"\n",
" # TODO: Finish this function!\n",
" last_tag_count = Counter()\n",
" for tupi in sequences:\n",
" last_tag_count[tupi[-1]]+=1\n",
" return last_tag_count\n",
"\n",
"# TODO: Calculate the count of each tag ending a sequence\n",
"tag_ends = ending_counts(data.training_set.Y)\n",
"\n",
"assert len(tag_ends) == 12, \"Uh oh. There should be 12 tags in your dictionary.\"\n",
"assert min(tag_ends, key=tag_ends.get) in ['X', 'CONJ'], \"Hmmm...'X' or 'CONJ' should be the least common ending bigram.\"\n",
"assert max(tag_ends, key=tag_ends.get) == '.', \"Hmmm...'.' is expected to be the most common ending bigram.\"\n",
"HTML('<div class=\"alert alert-block alert-success\">Your ending tag counts look good!</div>')"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# P_of_Word_given_tag_dict \n",
"\n",
"# I need to pass in a prob distribution to the states when I create states. \n",
"# since for each state, I need to have emission distribution, i.e. what words it can produce. \n",
"# It should be of the following fashion, where we can replace the yes no with our words \n",
"# {\"yes\": 0.1, \"no\": 0.9}\n",
"\n",
" # Words \n",
"# Pos Count\n",
"\n",
"# I need to divide the above by total tags of each \n",
"\n",
"\n",
" \n",
"# for each state which is a tag in our case, we need to create a distribution \n",
"# since we have multiple states, therefore we will loop over all states "
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Counter({'ADV': 44877,\n",
" 'NOUN': 220632,\n",
" '.': 117757,\n",
" 'VERB': 146161,\n",
" 'ADP': 115808,\n",
" 'ADJ': 66754,\n",
" 'CONJ': 30537,\n",
" 'DET': 109671,\n",
" 'PRT': 23906,\n",
" 'NUM': 11878,\n",
" 'PRON': 39383,\n",
" 'X': 1094})"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"collections.Counter"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"44877"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n",
"#Dict_Of_Tags_Words\n",
"\n",
"# Lets also get the count of each tag = C(t)\n",
"tag_count = unigram_counts(tags)\n",
"tag_count\n",
"type(tag_count)\n",
"tag_count.get('ADV')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Dict_Of_Tags_Words"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"# Here we are changing the dictionary of tag- words to freq by dividing the each word count by the number of times a tag \n",
"# occurred. \n",
"\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"44877"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tag_count.get('ADV')"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Whenever': 0.0002673975533123872,\n",
" 'indeed': 0.002050047908728302,\n",
" 'almost': 0.006729505091695078,\n",
" 'Yes': 0.0015375359315462263,\n",
" 'About': 0.0007576264010517637,\n",
" 'back': 0.012812799429551887,\n",
" 'therefore': 0.0027853911803373665,\n",
" 'not': 0.0793725070748936,\n",
" 'deeply': 0.0006462107538382691,\n",
" 'enough': 0.004211511464670099,\n",
" 'so': 0.024154912315885645,\n",
" 'first': 0.005125119771820755,\n",
" 'then': 0.017781937295273748,\n",
" 'most': 0.012077456157942822,\n",
" 'Meanwhile': 0.00037881320052588185,\n",
" 'again': 0.009269781848162757,\n",
" 'more': 0.02157006930053257,\n",
" 'away': 0.008021926599371615,\n",
" 'when': 0.03141921251420549,\n",
" 'also': 0.017670521648060253,\n",
" 'usually': 0.003364752545847539,\n",
" 'sometimes': 0.002941373086436259,\n",
" 'where': 0.015085678632707178,\n",
" 'preferably': 0.00017826503554159147,\n",
" 'because': 0.0025848430153530763,\n",
" 'about': 0.009403480624818949,\n",
" 'especially': 0.0025848430153530763,\n",
" 'manifestly': 0.00011141564721349466,\n",
" 'now': 0.01833901553134122,\n",
" 'why': 0.0043229271118835925,\n",
" 'heavily': 0.0010473070838068498,\n",
" 'just': 0.013013347594536177,\n",
" 'once': 0.0062615593733984,\n",
" 'only': 0.0229961895848653,\n",
" 'honestly': 0.00024511442386968826,\n",
" 'steadily': 0.00035653007108318294,\n",
" 'unblinkingly': 4.456625888539787e-05,\n",
" 'Just': 0.0020277647792856027,\n",
" 'else': 0.002896806827550861,\n",
" 'very': 0.01209973928738552,\n",
" 'much': 0.007799095304944627,\n",
" 'right': 0.0034761681930610337,\n",
" 'voraciously': 2.2283129442698933e-05,\n",
" 'well': 0.012278004322927112,\n",
" 'Because': 0.0004679457182966776,\n",
" 'doubtless': 0.00020054816498429038,\n",
" 'farther': 0.0005793613655101723,\n",
" 'negatively': 4.456625888539787e-05,\n",
" 'later': 0.005526216101789335,\n",
" 'currently': 0.0005793613655101723,\n",
" 'thus': 0.0028076743097800653,\n",
" 'inevitably': 0.0006016444949528711,\n",
" 'far': 0.006640372573924282,\n",
" 'No': 0.0021168972970563985,\n",
" 'enterprisingly': 2.2283129442698933e-05,\n",
" 'Even': 0.0033424694164048397,\n",
" 'how': 0.010740468391380886,\n",
" 'recently': 0.001960915390957506,\n",
" 'Never': 0.0006016444949528711,\n",
" 'Then': 0.006105577467299507,\n",
" 'coldly': 0.0001336987766561936,\n",
" 'directly': 0.002428861109254184,\n",
" 'When': 0.01025023954364151,\n",
" 'no': 0.003632150099159926,\n",
" 'longer': 0.002339728591483388,\n",
" 'as': 0.018628696214096307,\n",
" 'high': 0.0006016444949528711,\n",
" 'really': 0.004701740312409474,\n",
" 'surely': 0.0007353432716090647,\n",
" 'often': 0.006283842502841099,\n",
" 'primarily': 0.001002740824921452,\n",
" 'superbly': 0.00017826503554159147,\n",
" 'successfully': 0.0005570782360674733,\n",
" 'inside': 0.001025023954364151,\n",
" 'likely': 0.0004902288477393765,\n",
" 'All': 0.0010473070838068498,\n",
" 'practically': 0.0009358914365933552,\n",
" 'Why': 0.002540276756467678,\n",
" 'anhydrously': 2.2283129442698933e-05,\n",
" 'surprisingly': 0.00028968068275508614,\n",
" 'Perhaps': 0.001648951578759721,\n",
" 'How': 0.0038104151347015175,\n",
" 'ever': 0.005570782360674733,\n",
" 'even': 0.0166900639525815,\n",
" 'realistically': 0.0001336987766561936,\n",
" 'essentially': 0.0008467589188225595,\n",
" 'too': 0.01439490161998351,\n",
" 'yet': 0.003854981393586915,\n",
" 'ago': 0.004367493370768991,\n",
" 'rapidly': 0.0012255721193484414,\n",
" 'weekly': 0.0001336987766561936,\n",
" 'already': 0.004790872830180271,\n",
" 'fair': 2.2283129442698933e-05,\n",
" 'reasonably': 0.0005793613655101723,\n",
" 'Not': 0.0034538850636183344,\n",
" 'visibly': 0.00011141564721349466,\n",
" 'highly': 0.0017603672259732157,\n",
" 'i.e.': 0.0008021926599371615,\n",
" 'truly': 0.0008021926599371615,\n",
" 'never': 0.012233438064041714,\n",
" 'heretofore': 0.0001336987766561936,\n",
" 'generally': 0.0022283129442698933,\n",
" 'somewhat': 0.0022060298148271944,\n",
" 'newly': 0.0005125119771820755,\n",
" 'for': 0.00011141564721349466,\n",
" 'perhaps': 0.0038104151347015175,\n",
" 'slightly': 0.0014929696726608285,\n",
" 'gently': 0.0005793613655101723,\n",
" 'outward': 8.913251777079573e-05,\n",
" 'certainly': 0.0021391804264990974,\n",
" 'alone': 0.003632150099159926,\n",
" 'all': 0.003208770639748646,\n",
" 'always': 0.007865944693272724,\n",
" 'present': 0.0010695902132495487,\n",
" 'together': 0.004679457182966776,\n",
" 'rhythmically': 2.2283129442698933e-05,\n",
" 'More': 0.00033424694164048397,\n",
" 'maybe': 0.0011810058604630434,\n",
" 'pretty': 0.001002740824921452,\n",
" 'however': 0.006774071350580476,\n",
" 'adequately': 0.00031196381219778506,\n",
" 'low': 0.00024511442386968826,\n",
" 'that': 0.0009581745660360541,\n",
" 'progressively': 0.00011141564721349466,\n",
" 'lately': 0.00020054816498429038,\n",
" 'uncomfortably': 6.68493883280968e-05,\n",
" 'curiously': 0.00015598190609889253,\n",
" 'totally': 0.00042337945941127973,\n",
" 'peculiarly': 0.00015598190609889253,\n",
" 'fully': 0.0013815540254473338,\n",
" 'tremendously': 0.00020054816498429038,\n",
" 'basically': 0.00035653007108318294,\n",
" 'aside': 0.001002740824921452,\n",
" 'widely': 0.0010473070838068498,\n",
" 'outside': 0.0008690420482652584,\n",
" 'along': 0.0026739755331238717,\n",
" 'earlier': 0.0011810058604630434,\n",
" 'constantly': 0.0006907770127236669,\n",
" 'liberally': 8.913251777079573e-05,\n",
" 'exactly': 0.0018717828731867104,\n",
" 'However': 0.003008222474764356,\n",
" 'near': 0.00037881320052588185,\n",
" 'entirely': 0.0015821021904316242,\n",
" 'yes': 0.0009358914365933552,\n",
" 'overboard': 0.00017826503554159147,\n",
" 'quite': 0.004033246429128507,\n",
" 'equally': 0.0011364396015776455,\n",
" 'instead': 0.0022283129442698933,\n",
" 'better': 0.0028745236981081623,\n",
" 'merely': 0.0023842948503687857,\n",
" 'nowhere': 0.00024511442386968826,\n",
" 'actually': 0.0021168972970563985,\n",
" 'precisely': 0.0008021926599371615,\n",
" 'Nearly': 0.00011141564721349466,\n",
" 'Currently': 2.2283129442698933e-05,\n",
" 'some': 0.00033424694164048397,\n",
" 'way': 0.00015598190609889253,\n",
" 'nice': 2.2283129442698933e-05,\n",
" 'there': 0.009470330013147047,\n",
" 'subconsciously': 8.913251777079573e-05,\n",
" 'consciously': 0.00020054816498429038,\n",
" 'here': 0.01067361900305279,\n",
" 'masterfully': 2.2283129442698933e-05,\n",
" 'So': 0.0031196381219778507,\n",
" 'materially': 8.913251777079573e-05,\n",
" 'commercially': 0.00022283129442698932,\n",
" 'Really': 0.00015598190609889253,\n",
" 'Typically': 6.68493883280968e-05,\n",
" 'before': 0.00271854179200927,\n",
" 'undoubtedly': 0.00028968068275508614,\n",
" 'possibly': 0.0009581745660360541,\n",
" 'divinely': 6.68493883280968e-05,\n",
" 'long': 0.003587583840274528,\n",
" 'immediately': 0.0020277647792856027,\n",
" 'Wherefore': 4.456625888539787e-05,\n",
" 'Finally': 0.000980457695478753,\n",
" 'strongly': 0.0006462107538382691,\n",
" 'Maybe': 0.0012255721193484414,\n",
" 'Mostly': 0.00011141564721349466,\n",
" 'Meantime': 4.456625888539787e-05,\n",
" 'Indeed': 0.0007353432716090647,\n",
" 'solemnly': 0.0001336987766561936,\n",
" 'Apart': 0.00022283129442698932,\n",
" 'obviously': 0.0014929696726608285,\n",
" 'extremely': 0.0009136083071506563,\n",
" 'overly': 0.00015598190609889253,\n",
" 'around': 0.00396639704080041,\n",
" 'slowly': 0.001983198520400205,\n",
" 'namely': 0.0006016444949528711,\n",
" 'mutually': 0.00020054816498429038,\n",
" 'otherwise': 0.0011364396015776455,\n",
" 'soon': 0.003097354992535152,\n",
" 'elsewhere': 0.0008021926599371615,\n",
" 'badly': 0.0005793613655101723,\n",
" 'due': 0.00035653007108318294,\n",
" 'Usually': 0.00035653007108318294,\n",
" 'Later': 0.0006684938832809679,\n",
" 'intriguingly': 4.456625888539787e-05,\n",
" 'greatly': 0.0011587227310203446,\n",
" 'Now': 0.004367493370768991,\n",
" 'Well': 0.0004679457182966776,\n",
" 'pointedly': 6.68493883280968e-05,\n",
" 'hell-for-leather': 2.2283129442698933e-05,\n",
" 'ablaze': 6.68493883280968e-05,\n",
" 'uneasily': 0.00011141564721349466,\n",
" 'prior': 0.0006239276243955701,\n",
" 'daily': 0.0008913251777079573,\n",
" 'approximately': 0.0011364396015776455,\n",
" 'two-fold': 2.2283129442698933e-05,\n",
" 'jealously': 2.2283129442698933e-05,\n",
" 'suddenly': 0.0024734273681395816,\n",
" 'happily': 0.00033424694164048397,\n",
" 'behind': 0.00044566258885397864,\n",
" 'necessarily': 0.0009581745660360541,\n",
" 'accidentally': 0.0001336987766561936,\n",
" 'rather': 0.0043229271118835925,\n",
" 'Distally': 2.2283129442698933e-05,\n",
" 'effectively': 0.0005125119771820755,\n",
" 'still': 0.012478552487911403,\n",
" 'Simultaneously': 0.0001336987766561936,\n",
" 'mostly': 0.0006684938832809679,\n",
" 'moreover': 0.00035653007108318294,\n",
" 'There': 0.0007353432716090647,\n",
" 'regardless': 0.0006462107538382691,\n",
" 'presumably': 0.0006239276243955701,\n",
" 'Therefore': 0.0009136083071506563,\n",
" 'less': 0.0036098669697172273,\n",
" 'frequently': 0.0015598190609889253,\n",
" 'accurately': 0.00037881320052588185,\n",
" 'forth': 0.001314704637119237,\n",
" 'Here': 0.0024065779798114846,\n",
" 'relatively': 0.0015152528021035274,\n",
" 'increasingly': 0.0007799095304944627,\n",
" 'permanently': 0.00022283129442698932,\n",
" 'nervously': 4.456625888539787e-05,\n",
" 'virtually': 0.0006462107538382691,\n",
" 'Where': 0.0015821021904316242,\n",
" 'aristocratically': 2.2283129442698933e-05,\n",
" 'incredibly': 0.0001336987766561936,\n",
" 'though': 0.0005570782360674733,\n",
" 'poorly': 0.00017826503554159147,\n",
" 'sensibly': 6.68493883280968e-05,\n",
" 'apparently': 0.0017826503554159146,\n",
" 'Third': 8.913251777079573e-05,\n",
" 'spiritually': 0.00011141564721349466,\n",
" 'nearly': 0.0022951623325979903,\n",
" 'sufficiently': 0.0006684938832809679,\n",
" 'Also': 0.001270138378233839,\n",
" 'early': 0.0017603672259732157,\n",
" 'magnificently': 0.00015598190609889253,\n",
" 'easily': 0.0018494997437440114,\n",
" 'socially': 0.00022283129442698932,\n",
" 'forward': 0.0017826503554159146,\n",
" 'by': 0.0008467589188225595,\n",
" 'chiefly': 0.00040109632996858076,\n",
" 'finally': 0.002250596073712592,\n",
" 'reluctantly': 8.913251777079573e-05,\n",
" 'fruitlessly': 2.2283129442698933e-05,\n",
" 'Thus': 0.003030505604207055,\n",
" 'alike': 0.00020054816498429038,\n",
" 'great': 0.00024511442386968826,\n",
" 'hardly': 0.0016712347082024198,\n",
" 'anyhow': 0.00033424694164048397,\n",
" 'late': 0.0008244757893798605,\n",
" 'Plus': 6.68493883280968e-05,\n",
" 'mainly': 0.0006239276243955701,\n",
" 'provocatively': 2.2283129442698933e-05,\n",
" 'exceedingly': 0.00015598190609889253,\n",
" 'sharply': 0.0005570782360674733,\n",
" 'notably': 0.00024511442386968826,\n",
" 'Too': 0.0005125119771820755,\n",
" 'easy': 0.00020054816498429038,\n",
" 'rarely': 0.0006239276243955701,\n",
" 'formerly': 0.0005793613655101723,\n",
" 'economically': 0.0001336987766561936,\n",
" 'anyway': 0.0007353432716090647,\n",
" 'strictly': 0.0006462107538382691,\n",
" 'probably': 0.004256077723555496,\n",
" 'Less': 0.00011141564721349466,\n",
" 'Admirably': 2.2283129442698933e-05,\n",
" 'partially': 0.00040109632996858076,\n",
" 'earnestly': 0.00020054816498429038,\n",
" 'consistently': 0.00040109632996858076,\n",
" 'gradually': 0.0007130601421663659,\n",
" 'hard': 0.0010695902132495487,\n",
" 'Certainly': 0.0004902288477393765,\n",
" 'advisedly': 2.2283129442698933e-05,\n",
" 'positively': 0.00015598190609889253,\n",
" 'Second': 0.0001336987766561936,\n",
" 'respectively': 0.0004902288477393765,\n",
" 'considerably': 0.0008021926599371615,\n",
" 'further': 0.0016266684493170221,\n",
" 'medically': 4.456625888539787e-05,\n",
" 'fussily': 2.2283129442698933e-05,\n",
" 'Still': 0.0008913251777079573,\n",
" 'originally': 0.00035653007108318294,\n",
" 'p.m.': 0.0008913251777079573,\n",
" 'academically': 0.00011141564721349466,\n",
" 'normally': 0.0006684938832809679,\n",
" 'physically': 0.00031196381219778506,\n",
" 'ruefully': 6.68493883280968e-05,\n",
" 'fairly': 0.0010473070838068498,\n",
" 'altogether': 0.00044566258885397864,\n",
" 'Moreover': 0.0012255721193484414,\n",
" 'across': 0.00017826503554159147,\n",
" 'continuously': 0.00042337945941127973,\n",
" 'somehow': 0.0011141564721349466,\n",
" 'devotedly': 2.2283129442698933e-05,\n",
" 'seemingly': 0.00024511442386968826,\n",
" 'tight': 0.00011141564721349466,\n",
" 'Approximately': 0.0001336987766561936,\n",
" 'simply': 0.0031196381219778507,\n",
" 'Sometimes': 0.0009358914365933552,\n",
" 'etc.': 0.0011587227310203446,\n",
" 'perfectly': 0.0006016444949528711,\n",
" 'Especially': 8.913251777079573e-05,\n",
" 'next': 0.0005347951066247743,\n",
" 'formally': 0.00028968068275508614,\n",
" 'Naturally': 0.00028968068275508614,\n",
" 'insanely': 2.2283129442698933e-05,\n",
" 'quietly': 0.000980457695478753,\n",
" 'naturally': 0.0009136083071506563,\n",
" 'justly': 0.00011141564721349466,\n",
" 'readily': 0.0006907770127236669,\n",
" 'e.g.': 0.0005125119771820755,\n",
" 'Ever': 0.00024511442386968826,\n",
" 'As': 0.0009581745660360541,\n",
" 'purely': 0.0005347951066247743,\n",
" 'safely': 0.00024511442386968826,\n",
" 'above': 0.0015375359315462263,\n",
" 'Better': 0.0001336987766561936,\n",
" 'Again': 0.0008913251777079573,\n",
" 'but': 0.00044566258885397864,\n",
" 'sooner': 0.00022283129442698932,\n",
" 'scarcely': 0.0004902288477393765,\n",
" 'grimly': 0.00020054816498429038,\n",
" 'best': 0.0009581745660360541,\n",
" 'such': 0.00028968068275508614,\n",
" 'aye': 2.2283129442698933e-05,\n",
" 'nay': 2.2283129442698933e-05,\n",
" 'sadly': 0.00015598190609889253,\n",
" 'frankly': 0.00020054816498429038,\n",
" 'Along': 8.913251777079573e-05,\n",
" 'tensely': 6.68493883280968e-05,\n",
" 'upstairs': 0.00042337945941127973,\n",
" 'dangerously': 6.68493883280968e-05,\n",
" 'carefully': 0.0014484034137754306,\n",
" 'constructively': 4.456625888539787e-05,\n",
" 'unluckily': 2.2283129442698933e-05,\n",
" 'algebraically': 6.68493883280968e-05,\n",
" 'sympathetically': 0.00011141564721349466,\n",
" 'plainly': 0.00035653007108318294,\n",
" 'last': 0.0005347951066247743,\n",
" 'atrociously': 2.2283129442698933e-05,\n",
" 'demonstrably': 4.456625888539787e-05,\n",
" 'thence': 0.00011141564721349466,\n",
" 'verie': 2.2283129442698933e-05,\n",
" 'Sure': 0.00040109632996858076,\n",
" 'slimly': 2.2283129442698933e-05,\n",
" 'evenly': 6.68493883280968e-05,\n",
" 'jointly': 0.0001336987766561936,\n",
" 'abroad': 0.0008690420482652584,\n",
" 'eventually': 0.0007353432716090647,\n",
" 'unwittingly': 0.00011141564721349466,\n",
" 'loudly': 0.00028968068275508614,\n",
" 'Luckily': 4.456625888539787e-05,\n",
" 'completely': 0.0018940660026294093,\n",
" 'Often': 0.00040109632996858076,\n",
" 'empirically': 8.913251777079573e-05,\n",
" 'transversely': 2.2283129442698933e-05,\n",
" 'meanwhile': 0.00020054816498429038,\n",
" 'overnight': 0.00022283129442698932,\n",
" 'swiftly': 0.0002673975533123872,\n",
" 'southward': 0.0001336987766561936,\n",
" 'supposedly': 0.0002673975533123872,\n",
" 'little': 0.0020277647792856027,\n",
" 'anywhere': 0.0006907770127236669,\n",
" 'privately': 0.0001336987766561936,\n",
" 'continually': 0.0004679457182966776,\n",
" 'domestically': 2.2283129442698933e-05,\n",
" 'besides': 0.00017826503554159147,\n",
" 'Rather': 0.00020054816498429038,\n",
" 'briefly': 0.0005793613655101723,\n",
" 'bad': 6.68493883280968e-05,\n",
" 'freely': 0.00037881320052588185,\n",
" 'excitedly': 0.00011141564721349466,\n",
" 'hopefully': 0.00011141564721349466,\n",
" 'Last': 4.456625888539787e-05,\n",
" 'clearly': 0.002050047908728302,\n",
" 'neatly': 0.00035653007108318294,\n",
" 'largely': 0.001270138378233839,\n",
" 'since': 0.0006016444949528711,\n",
" 'inward': 0.0001336987766561936,\n",
" 'thermodynamically': 4.456625888539787e-05,\n",
" 'Once': 0.0009136083071506563,\n",
" 'hastily': 0.00024511442386968826,\n",
" 'fortunately': 6.68493883280968e-05,\n",
" 'Only': 0.001604385319874323,\n",
" 'backward': 0.00028968068275508614,\n",
" 'securely': 6.68493883280968e-05,\n",
" 'literally': 0.00035653007108318294,\n",
" 'indoors': 0.00011141564721349466,\n",
" 'Initially': 8.913251777079573e-05,\n",
" 'impeccably': 4.456625888539787e-05,\n",
" 'straight': 0.0010473070838068498,\n",
" 'beyond': 0.0001336987766561936,\n",
" 'organizationally': 2.2283129442698933e-05,\n",
" 'warmly': 0.00015598190609889253,\n",
" 'after': 0.00022283129442698932,\n",
" 'ahead': 0.0017158009670878178,\n",
" 'perpetually': 4.456625888539787e-05,\n",
" 'favorably': 0.00028968068275508614,\n",
" 'instantly': 0.00031196381219778506,\n",
" 'tightly': 0.00033424694164048397,\n",
" 'doubtfully': 2.2283129442698933e-05,\n",
" 'faster': 0.0002673975533123872,\n",
" 'prematurely': 6.68493883280968e-05,\n",
" 'irresistibly': 2.2283129442698933e-05,\n",
" 'helpfully': 6.68493883280968e-05,\n",
" 'aimlessly': 2.2283129442698933e-05,\n",
" 'radically': 0.0002673975533123872,\n",
" 'regularly': 0.00040109632996858076,\n",
" 'Above': 2.2283129442698933e-05,\n",
" 'many': 0.00015598190609889253,\n",
" 'genuinely': 0.00017826503554159147,\n",
" 'prominently': 0.0001336987766561936,\n",
" 'axially': 2.2283129442698933e-05,\n",
" 'below': 0.0011587227310203446,\n",
" 'indirectly': 0.0002673975533123872,\n",
" 'thereof': 0.00028968068275508614,\n",
" 'Yeah': 0.00028968068275508614,\n",
" 'Particularly': 4.456625888539787e-05,\n",
" 'previously': 0.0009136083071506563,\n",
" 'understandingly': 6.68493883280968e-05,\n",
" 'occasionally': 0.0006684938832809679,\n",
" 'overhead': 8.913251777079573e-05,\n",
" 'fast': 0.0008467589188225595,\n",
" 'quickly': 0.001604385319874323,\n",
" 'Generally': 0.0002673975533123872,\n",
" 'everywhere': 0.0007130601421663659,\n",
" 'lazily': 2.2283129442698933e-05,\n",
" 'confidentially': 8.913251777079573e-05,\n",
" 'perennially': 2.2283129442698933e-05,\n",
" 'symbolically': 4.456625888539787e-05,\n",
" 'concretely': 4.456625888539787e-05,\n",
" 'close': 0.0016712347082024198,\n",
" 'mentally': 0.00028968068275508614,\n",
" 'theretofore': 2.2283129442698933e-05,\n",
" 'invisibly': 4.456625888539787e-05,\n",
" 'Carefully': 6.68493883280968e-05,\n",
" 'particularly': 0.0024957104975822804,\n",
" 'painfully': 0.00024511442386968826,\n",
" 'thereby': 0.0006239276243955701,\n",
" 'bravely': 6.68493883280968e-05,\n",
" 'nearby': 0.00040109632996858076,\n",
" 'underneath': 2.2283129442698933e-05,\n",
" 'litle': 2.2283129442698933e-05,\n",
" 'inasmuch': 2.2283129442698933e-05,\n",
" 'heartily': 0.00015598190609889253,\n",
" 'sure': 0.00031196381219778506,\n",
" 'half': 0.00031196381219778506,\n",
" 'quick': 0.0001336987766561936,\n",
" 'Real': 4.456625888539787e-05,\n",
" 'Actually': 0.0006016444949528711,\n",
" 'Instead': 0.0006907770127236669,\n",
" 'appreciatively': 4.456625888539787e-05,\n",
" 'dimly': 0.00017826503554159147,\n",
" 'intently': 8.913251777079573e-05,\n",
" 'Apparently': 0.00033424694164048397,\n",
" 'self-consciously': 6.68493883280968e-05,\n",
" 'exceptionally': 0.00015598190609889253,\n",
" 'wholly': 0.00042337945941127973,\n",
" 'tenuously': 2.2283129442698933e-05,\n",
" 'Furthermore': 0.0007130601421663659,\n",
" 'aboard': 0.00028968068275508614,\n",
" 'thither': 2.2283129442698933e-05,\n",
" 'homewards': 2.2283129442698933e-05,\n",
" 'rebelliously': 6.68493883280968e-05,\n",
" 'lightly': 0.0005793613655101723,\n",
" 'least': 0.00028968068275508614,\n",
" 'brilliantly': 0.00015598190609889253,\n",
" 'principally': 0.00020054816498429038,\n",
" 'bitterly': 0.0002673975533123872,\n",
" 'nearest': 4.456625888539787e-05,\n",
" 'upon': 0.0006239276243955701,\n",
" 'with': 6.68493883280968e-05,\n",
" 'officially': 0.00031196381219778506,\n",
" 'unofficially': 2.2283129442698933e-05,\n",
" 'closer': 0.0006684938832809679,\n",
" 'forever': 0.0006462107538382691,\n",
" 'openly': 0.0006684938832809679,\n",
" 'softly': 0.0005793613655101723,\n",
" 'Obviously': 0.0004902288477393765,\n",
" 'infernally': 2.2283129442698933e-05,\n",
" 'knowingly': 6.68493883280968e-05,\n",
" 'hence': 0.00037881320052588185,\n",
" 'stiffly': 0.00015598190609889253,\n",
" 'impatiently': 0.0001336987766561936,\n",
" 'firmly': 0.0008244757893798605,\n",
" 'dearly': 6.68493883280968e-05,\n",
" 'counter': 6.68493883280968e-05,\n",
" 'fundamentally': 0.00015598190609889253,\n",
" 'seldom': 0.0006016444949528711,\n",
" 'boldly': 0.00015598190609889253,\n",
" 'dramatically': 0.00015598190609889253,\n",
" 'aback': 2.2283129442698933e-05,\n",
" 'solidly': 0.00020054816498429038,\n",
" 'conclusively': 0.00011141564721349466,\n",
" 'methodically': 0.00011141564721349466,\n",
" 'quarterly': 2.2283129442698933e-05,\n",
" 'ontologically': 2.2283129442698933e-05,\n",
" 'miserably': 6.68493883280968e-05,\n",
" 'subjectively': 0.0001336987766561936,\n",
" 'plumb': 8.913251777079573e-05,\n",
" 'thereto': 0.00022283129442698932,\n",
" 'simultaneously': 0.0005347951066247743,\n",
" 'narrowly': 0.0001336987766561936,\n",
" 'beautifully': 0.00024511442386968826,\n",
" 'A.D.': 0.00017826503554159147,\n",
" 'Recently': 0.0002673975533123872,\n",
" 'Very': 0.0004902288477393765,\n",
" 'barely': 0.0006016444949528711,\n",
" 'outdoors': 6.68493883280968e-05,\n",
" 'nowadays': 0.00017826503554159147,\n",
" 'Statistically': 2.2283129442698933e-05,\n",
" 'artificially': 0.00011141564721349466,\n",
" 'Frequently': 0.0001336987766561936,\n",
" 'foremost': 0.0001336987766561936,\n",
" 'wherever': 0.00044566258885397864,\n",
" 'glibly': 6.68493883280968e-05,\n",
" 'vaguely': 0.0002673975533123872,\n",
" 'conceivably': 0.00017826503554159147,\n",
" 'Yet': 0.0011364396015776455,\n",
" 'out-of-doors': 2.2283129442698933e-05,\n",
" 'upward': 0.00035653007108318294,\n",
" 'Occasionally': 0.00031196381219778506,\n",
" 'Similarly': 0.00031196381219778506,\n",
" 'profoundly': 0.0001336987766561936,\n",
" 'harmoniously': 2.2283129442698933e-05,\n",
" 'responsibly': 2.2283129442698933e-05,\n",
" 'evidently': 0.00040109632996858076,\n",
" 'partly': 0.0008913251777079573,\n",
" 'deliberately': 0.0005347951066247743,\n",
" 'incomparably': 6.68493883280968e-05,\n",
" 'incidentally': 8.913251777079573e-05,\n",
" 'thoughtfully': 0.00022283129442698932,\n",
" 'seriously': 0.0008690420482652584,\n",
" 'religiously': 8.913251777079573e-05,\n",
" 'Almost': 0.0005125119771820755,\n",
" 'properly': 0.0009136083071506563,\n",
" 'popularly': 0.0001336987766561936,\n",
" 'imperfectly': 4.456625888539787e-05,\n",
" 'Commonly': 2.2283129442698933e-05,\n",
" 'westward': 0.00017826503554159147,\n",
" 'deeper': 0.00015598190609889253,\n",
" 'hereby': 0.00015598190609889253,\n",
" 'persistently': 6.68493883280968e-05,\n",
" 'conveniently': 0.00011141564721349466,\n",
" 'analytically': 2.2283129442698933e-05,\n",
" 'ruthlessly': 4.456625888539787e-05,\n",
" 'wildly': 0.00044566258885397864,\n",
" 'shrewdly': 4.456625888539787e-05,\n",
" 'politely': 0.00020054816498429038,\n",
" 'imprudently': 4.456625888539787e-05,\n",
" 'unfortunately': 0.00028968068275508614,\n",
" 'Closely': 6.68493883280968e-05,\n",
" 'inordinately': 4.456625888539787e-05,\n",
" 'Conversely': 6.68493883280968e-05,\n",
" 'Somehow': 0.00028968068275508614,\n",
" 'astray': 4.456625888539787e-05,\n",
" 'Wherever': 6.68493883280968e-05,\n",
" 'parallel': 0.00020054816498429038,\n",
" 'Surely': 0.00017826503554159147,\n",
" 'suitably': 6.68493883280968e-05,\n",
" 'sometime': 0.0001336987766561936,\n",
" 'gorgeously': 2.2283129442698933e-05,\n",
" 'generously': 0.00015598190609889253,\n",
" 'logically': 0.0001336987766561936,\n",
" 'hereafter': 6.68493883280968e-05,\n",
" 'momentarily': 8.913251777079573e-05,\n",
" 'succinctly': 2.2283129442698933e-05,\n",
" 'frantically': 0.00015598190609889253,\n",
" 'grudgingly': 0.0001336987766561936,\n",
" 'Consequently': 0.00015598190609889253,\n",
" 'overwhelmingly': 0.00015598190609889253,\n",
" 'thick': 4.456625888539787e-05,\n",
" 'similarly': 0.00031196381219778506,\n",
" 'Besides': 0.00044566258885397864,\n",
" 'resolutely': 4.456625888539787e-05,\n",
" 'terribly': 0.00031196381219778506,\n",
" 'Incidentally': 8.913251777079573e-05,\n",
" 'Forever': 4.456625888539787e-05,\n",
" 'urgently': 0.00011141564721349466,\n",
" 'Off-Broadway': 2.2283129442698933e-05,\n",
" 'comparatively': 0.00020054816498429038,\n",
" 'Below': 0.00011141564721349466,\n",
" 'unanimously': 0.00022283129442698932,\n",
" 'publicly': 0.00044566258885397864,\n",
" 'stunningly': 2.2283129442698933e-05,\n",
" 'mutely': 4.456625888539787e-05,\n",
" 'immensely': 0.0001336987766561936,\n",
" 'whenever': 0.0005570782360674733,\n",
" 'faithfully': 8.913251777079573e-05,\n",
" 'Basically': 4.456625888539787e-05,\n",
" 'Hence': 0.0005125119771820755,\n",
" 'Far': 0.00022283129442698932,\n",
" 'temporarily': 0.00037881320052588185,\n",
" 'furiously': 0.00024511442386968826,\n",
" 'clockwise': 6.68493883280968e-05,\n",
" 'counter-clockwise': 2.2283129442698933e-05,\n",
" 'Momentarily': 2.2283129442698933e-05,\n",
" 'Fourth': 2.2283129442698933e-05,\n",
" 'chronologically': 2.2283129442698933e-05,\n",
" 'cosily': 2.2283129442698933e-05,\n",
" 'correctly': 0.00020054816498429038,\n",
" 'solely': 0.00040109632996858076,\n",
" 'desperately': 0.00033424694164048397,\n",
" 'apart': 0.0008690420482652584,\n",
" 'Thereupon': 4.456625888539787e-05,\n",
" 'blindly': 0.0001336987766561936,\n",
" 'within': 0.00017826503554159147,\n",
" 'repeatedly': 0.00040109632996858076,\n",
" 'inexorably': 2.2283129442698933e-05,\n",
" 'visually': 6.68493883280968e-05,\n",
" 'agilely': 2.2283129442698933e-05,\n",
" 'classically': 4.456625888539787e-05,\n",
" 'nevertheless': 0.0004902288477393765,\n",
" 'nearer': 0.00015598190609889253,\n",
" 'clear': 0.00017826503554159147,\n",
" 'absently': 8.913251777079573e-05,\n",
" 'dizzily': 2.2283129442698933e-05,\n",
" 'hopelessly': 0.00015598190609889253,\n",
" 'any': 0.0002673975533123872,\n",
" 'closely': 0.0011810058604630434,\n",
" 'cruelly': 6.68493883280968e-05,\n",
" 'actively': 0.00024511442386968826,\n",
" 'ineptly': 2.2283129442698933e-05,\n",
" 'superficially': 6.68493883280968e-05,\n",
" \"o'clock\": 0.0007130601421663659,\n",
" 'nominally': 6.68493883280968e-05,\n",
" 'Anyway': 0.00015598190609889253,\n",
" 'Most': 0.00015598190609889253,\n",
" 'Much': 0.0001336987766561936,\n",
" 'adamantly': 2.2283129442698933e-05,\n",
" 'blithely': 6.68493883280968e-05,\n",
" 'continentally': 2.2283129442698933e-05,\n",
" 'finely': 8.913251777079573e-05,\n",
" 'initially': 0.0002673975533123872,\n",
" 'ordinarily': 0.00024511442386968826,\n",
" 'affectionately': 6.68493883280968e-05,\n",
" 'significantly': 0.00020054816498429038,\n",
" 'acutely': 0.00011141564721349466,\n",
" 'Eventually': 0.00017826503554159147,\n",
" 'eagerly': 0.0002673975533123872,\n",
" 'authentically': 4.456625888539787e-05,\n",
" 'automatically': 0.0005793613655101723,\n",
" 'silently': 0.0002673975533123872,\n",
" 'ominously': 6.68493883280968e-05,\n",
" 'Already': 0.00033424694164048397,\n",
" 'uniformly': 8.913251777079573e-05,\n",
" 'awful': 6.68493883280968e-05,\n",
" 'roughly': 0.0004679457182966776,\n",
" 'Back': 0.00022283129442698932,\n",
" 'Inevitably': 6.68493883280968e-05,\n",
" 'stark': 2.2283129442698933e-05,\n",
" 'Ordinarily': 4.456625888539787e-05,\n",
" 'schematically': 6.68493883280968e-05,\n",
" 'verbally': 6.68493883280968e-05,\n",
" 'Secondly': 8.913251777079573e-05,\n",
" 'soundly': 4.456625888539787e-05,\n",
" 'dreamlessly': 2.2283129442698933e-05,\n",
" 'extensively': 0.00017826503554159147,\n",
" 'Suddenly': 0.00042337945941127973,\n",
" 'proportionately': 0.0001336987766561936,\n",
" 'experimentally': 0.0001336987766561936,\n",
" 'aloud': 0.00020054816498429038,\n",
" 'Somewhere': 0.00022283129442698932,\n",
" 'somewhere': 0.0007353432716090647,\n",
" 'utterly': 0.00040109632996858076,\n",
" 'curvaceously': 2.2283129442698933e-05,\n",
" 'subtly': 8.913251777079573e-05,\n",
" 'competently': 0.00011141564721349466,\n",
" 'That': 2.2283129442698933e-05,\n",
" 'halfway': 0.00017826503554159147,\n",
" 'Feebly': 2.2283129442698933e-05,\n",
" 'separately': 0.00020054816498429038,\n",
" 'purposely': 8.913251777079573e-05,\n",
" 'Sooner': 4.456625888539787e-05,\n",
" 'Hardly': 0.0001336987766561936,\n",
" 'presently': 0.00044566258885397864,\n",
" 'rationally': 4.456625888539787e-05,\n",
" 'prickly': 2.2283129442698933e-05,\n",
" 'tentatively': 0.00011141564721349466,\n",
" 'either': 0.0005125119771820755,\n",
" 'somewheres': 2.2283129442698933e-05,\n",
" \"O'Clock\": 4.456625888539787e-05,\n",
" 'professedly': 4.456625888539787e-05,\n",
" 'thereafter': 0.0002673975533123872,\n",
" 'That-a-way': 2.2283129442698933e-05,\n",
" 'upright': 0.00011141564721349466,\n",
" 'downstairs': 0.00020054816498429038,\n",
" 'financially': 0.00017826503554159147,\n",
" 'victoriously': 2.2283129442698933e-05,\n",
" 'girlishly': 4.456625888539787e-05,\n",
" 'upwards': 0.00011141564721349466,\n",
" 'True': 0.00015598190609889253,\n",
" 'annually': 0.00028968068275508614,\n",
" 'indelibly': 2.2283129442698933e-05,\n",
" 'Essentially': 6.68493883280968e-05,\n",
" 'First': 0.0007353432716090647,\n",
" 'gloriously': 2.2283129442698933e-05,\n",
" 'morally': 0.00015598190609889253,\n",
" 'politically': 0.00020054816498429038,\n",
" 'sociologically': 2.2283129442698933e-05,\n",
" 'importantly': 8.913251777079573e-05,\n",
" 'nicely': 0.00017826503554159147,\n",
" 'specifically': 0.0006462107538382691,\n",
" 'real': 0.00028968068275508614,\n",
" 'vice': 0.0001336987766561936,\n",
" 'versa': 0.0001336987766561936,\n",
" 'awfully': 0.00015598190609889253,\n",
" 'according': 0.00015598190609889253,\n",
" 'farthest': 2.2283129442698933e-05,\n",
" 'deep': 0.0002673975533123872,\n",
" 'oftener': 2.2283129442698933e-05,\n",
" 'wide': 0.00017826503554159147,\n",
" 'fatally': 6.68493883280968e-05,\n",
" 'a.m.': 0.00042337945941127973,\n",
" 'Out': 0.00017826503554159147,\n",
" 'mechanically': 6.68493883280968e-05,\n",
" 'overseas': 0.00017826503554159147,\n",
" 'consequently': 0.0002673975533123872,\n",
" 'idly': 0.0001336987766561936,\n",
" 'unexpectedly': 0.00022283129442698932,\n",
" 'reproducibly': 2.2283129442698933e-05,\n",
" 'definitely': 0.00037881320052588185,\n",
" 'wordlessly': 2.2283129442698933e-05,\n",
" 'excellently': 8.913251777079573e-05,\n",
" 'enthusiastically': 6.68493883280968e-05,\n",
" 'systematically': 0.00017826503554159147,\n",
" 'numerically': 4.456625888539787e-05,\n",
" 'Nevertheless': 0.0007799095304944627,\n",
" 'culturally': 4.456625888539787e-05,\n",
" 'geographically': 8.913251777079573e-05,\n",
" 'despairingly': 6.68493883280968e-05,\n",
" 'limply': 2.2283129442698933e-05,\n",
" 'N-no': 2.2283129442698933e-05,\n",
" 'justifiably': 0.00011141564721349466,\n",
" 'afloat': 0.00011141564721349466,\n",
" 'Jist': 4.456625888539787e-05,\n",
" 'lots': 2.2283129442698933e-05,\n",
" 'steady': 6.68493883280968e-05,\n",
" 'therein': 0.00017826503554159147,\n",
" 'efficaciously': 4.456625888539787e-05,\n",
" 'unavoidably': 6.68493883280968e-05,\n",
" 'therefrom': 6.68493883280968e-05,\n",
" 'Cautiously': 4.456625888539787e-05,\n",
" 'photographically': 2.2283129442698933e-05,\n",
" 'mighty': 0.00022283129442698932,\n",
" 'Silently': 2.2283129442698933e-05,\n",
" 'Pretty': 6.68493883280968e-05,\n",
" 'scornfully': 4.456625888539787e-05,\n",
" 'afterwards': 0.0001336987766561936,\n",
" 'concurrently': 2.2283129442698933e-05,\n",
" 'endlessly': 0.0001336987766561936,\n",
" 'haggardly': 2.2283129442698933e-05,\n",
" 'Apprehensively': 2.2283129442698933e-05,\n",
" 'dogmatically': 4.456625888539787e-05,\n",
" 'exquisitely': 4.456625888539787e-05,\n",
" 'apprehensively': 4.456625888539787e-05,\n",
" 'concededly': 2.2283129442698933e-05,\n",
" 'gracefully': 0.00017826503554159147,\n",
" 'second': 8.913251777079573e-05,\n",
" 'elaborately': 0.0001336987766561936,\n",
" 'hereabouts': 4.456625888539787e-05,\n",
" 'Likewise': 0.00011141564721349466,\n",
" 'dimensionally': 2.2283129442698933e-05,\n",
" 'exclusively': 0.00042337945941127973,\n",
" 'paradoxically': 0.00015598190609889253,\n",
" 'twice': 0.0011587227310203446,\n",
" 'purposively': 2.2283129442698933e-05,\n",
" 'Soon': 0.00037881320052588185,\n",
" 'A.M.': 0.00024511442386968826,\n",
" 'greenly': 2.2283129442698933e-05,\n",
" 'thoroughly': 0.0007130601421663659,\n",
" 'technically': 0.00011141564721349466,\n",
" 'Together': 0.00015598190609889253,\n",
" 'noisily': 6.68493883280968e-05,\n",
" 'Someday': 8.913251777079573e-05,\n",
" 'harder': 0.00031196381219778506,\n",
" 'savagely': 6.68493883280968e-05,\n",
" 'personally': 0.0005347951066247743,\n",
" 'likewise': 0.0002673975533123872,\n",
" 'vigorously': 0.00017826503554159147,\n",
" 'suspiciously': 4.456625888539787e-05,\n",
" 'open': 0.0001336987766561936,\n",
" 'Inside': 0.00015598190609889253,\n",
" 'astonishingly': 8.913251777079573e-05,\n",
" 'downhill': 6.68493883280968e-05,\n",
" 'Microscopically': 6.68493883280968e-05,\n",
" 'ornately': 2.2283129442698933e-05,\n",
" 'commonly': 0.00044566258885397864,\n",
" 'erroneously': 2.2283129442698933e-05,\n",
" 'hither': 2.2283129442698933e-05,\n",
" 'yon': 2.2283129442698933e-05,\n",
" 'Convulsively': 2.2283129442698933e-05,\n",
" 'Aside': 0.0001336987766561936,\n",
" 'dismally': 4.456625888539787e-05,\n",
" 'Inland': 2.2283129442698933e-05,\n",
" 'under': 0.00020054816498429038,\n",
" 'Heavily': 2.2283129442698933e-05,\n",
" 'differently': 0.00035653007108318294,\n",
" 'Next': 0.00022283129442698932,\n",
" 'Fortunately': 0.00031196381219778506,\n",
" 'ingeniously': 2.2283129442698933e-05,\n",
" 'round': 0.0002673975533123872,\n",
" 'slow': 4.456625888539787e-05,\n",
" 'loud': 8.913251777079573e-05,\n",
" 'strong': 6.68493883280968e-05,\n",
" 'recurrently': 2.2283129442698933e-05,\n",
" 'Afterwards': 8.913251777079573e-05,\n",
" 'cautiously': 0.00011141564721349466,\n",
" 'ashore': 0.0001336987766561936,\n",
" 'unconsciously': 0.00015598190609889253,\n",
" 'obliquely': 2.2283129442698933e-05,\n",
" 'stupidly': 4.456625888539787e-05,\n",
" 'selectively': 4.456625888539787e-05,\n",
" 'busily': 0.00017826503554159147,\n",
" 'humbly': 6.68493883280968e-05,\n",
" 'instinctively': 8.913251777079573e-05,\n",
" 'wilfully': 4.456625888539787e-05,\n",
" 'honorably': 4.456625888539787e-05,\n",
" 'neither': 0.0001336987766561936,\n",
" 'alreadeh': 2.2283129442698933e-05,\n",
" 'perversely': 6.68493883280968e-05,\n",
" 'Alone': 4.456625888539787e-05,\n",
" 'ill': 4.456625888539787e-05,\n",
" 'higher': 0.00028968068275508614,\n",
" 'eye-to-eye': 2.2283129442698933e-05,\n",
" 'Otherwise': 0.00028968068275508614,\n",
" 'monthly': 6.68493883280968e-05,\n",
" 'collectively': 8.913251777079573e-05,\n",
" 'wherein': 6.68493883280968e-05,\n",
" 'drastically': 0.00015598190609889253,\n",
" 'Slowly': 0.0001336987766561936,\n",
" 'invariably': 0.0006239276243955701,\n",
" 'palely': 2.2283129442698933e-05,\n",
" 'accusingly': 4.456625888539787e-05,\n",
" 'intimately': 0.00011141564721349466,\n",
" 'dialectically': 2.2283129442698933e-05,\n",
" 'vividly': 0.00015598190609889253,\n",
" 'grossly': 6.68493883280968e-05,\n",
" 'midway': 8.913251777079573e-05,\n",
" 'profusely': 6.68493883280968e-05,\n",
" 'ultimately': 0.00028968068275508614,\n",
" 'predominantly': 0.0001336987766561936,\n",
" 'supra': 2.2283129442698933e-05,\n",
" 'Partly': 6.68493883280968e-05,\n",
" 'unusually': 0.0001336987766561936,\n",
" 'distractedly': 2.2283129442698933e-05,\n",
" 'Regardless': 0.00011141564721349466,\n",
" 'unfairly': 4.456625888539787e-05,\n",
" 'fondly': 4.456625888539787e-05,\n",
" 'satisfactorily': 0.00020054816498429038,\n",
" 'Reputedly': 2.2283129442698933e-05,\n",
" 'underwater': 6.68493883280968e-05,\n",
" 'fine': 6.68493883280968e-05,\n",
" 'nonetheless': 0.00011141564721349466,\n",
" 'perpendicularly': 2.2283129442698933e-05,\n",
" 'Formally': 2.2283129442698933e-05,\n",
" 'Entirely': 2.2283129442698933e-05,\n",
" 'unhesitatingly': 4.456625888539787e-05,\n",
" 'alongside': 6.68493883280968e-05,\n",
" 'surreptitiously': 6.68493883280968e-05,\n",
" 'shyly': 4.456625888539787e-05,\n",
" 'technologically': 2.2283129442698933e-05,\n",
" 'purposefully': 2.2283129442698933e-05,\n",
" 'Clearly': 0.00015598190609889253,\n",
" 'skillfully': 8.913251777079573e-05,\n",
" 'Historically': 8.913251777079573e-05,\n",
" 'sweetly': 6.68493883280968e-05,\n",
" 'shortly': 0.00040109632996858076,\n",
" 'homogeneously': 4.456625888539787e-05,\n",
" 'drunkenly': 8.913251777079573e-05,\n",
" 'asleep': 0.00037881320052588185,\n",
" 'wherewith': 2.2283129442698933e-05,\n",
" 'except': 0.0001336987766561936,\n",
" 'zealously': 4.456625888539787e-05,\n",
" 'overtly': 6.68493883280968e-05,\n",
" 'Precisely': 4.456625888539787e-05,\n",
" 'wittingly': 2.2283129442698933e-05,\n",
" 'funny': 2.2283129442698933e-05,\n",
" 'Immediately': 0.00011141564721349466,\n",
" 'loosely': 0.00017826503554159147,\n",
" 'locally': 0.00022283129442698932,\n",
" 'insufficiently': 6.68493883280968e-05,\n",
" 'perforce': 2.2283129442698933e-05,\n",
" 'alternatively': 4.456625888539787e-05,\n",
" 'vastly': 0.00015598190609889253,\n",
" 'Behind': 4.456625888539787e-05,\n",
" 'sobbingly': 2.2283129442698933e-05,\n",
" 'cheaply': 4.456625888539787e-05,\n",
" 'truthfully': 8.913251777079573e-05,\n",
" 'Presently': 0.00017826503554159147,\n",
" 'a-la-Aristotle': 2.2283129442698933e-05,\n",
" 'intensely': 0.00017826503554159147,\n",
" 'Obligingly': 2.2283129442698933e-05,\n",
" 'substantially': 0.0005347951066247743,\n",
" 'smartly': 4.456625888539787e-05,\n",
" 'spectacularly': 4.456625888539787e-05,\n",
" 'typically': 0.00024511442386968826,\n",
" 'distinctly': 0.00020054816498429038,\n",
" 'free': 0.00015598190609889253,\n",
" 'mistakenly': 6.68493883280968e-05,\n",
" 'irregularly': 8.913251777079573e-05,\n",
" 'beforehand': 4.456625888539787e-05,\n",
" 'subsequently': 0.00017826503554159147,\n",
" 'stealthily': 2.2283129442698933e-05,\n",
" 'upstream': 6.68493883280968e-05,\n",
" 'subsequent': 2.2283129442698933e-05,\n",
" 'patiently': 0.00015598190609889253,\n",
" 'Accordingly': 0.0002673975533123872,\n",
" 'wrong': 6.68493883280968e-05,\n",
" 'casually': 0.00022283129442698932,\n",
" 'appreciably': 0.0001336987766561936,\n",
" 'specially': 0.0001336987766561936,\n",
" 'imperiously': 4.456625888539787e-05,\n",
" 'Farther': 4.456625888539787e-05,\n",
" 'downright': 4.456625888539787e-05,\n",
" 'Systematically': 2.2283129442698933e-05,\n",
" 'conspicuously': 0.00015598190609889253,\n",
" 'racially': 2.2283129442698933e-05,\n",
" 'Softly': 2.2283129442698933e-05,\n",
" 'smoothly': 0.00020054816498429038,\n",
" 'sensitively': 2.2283129442698933e-05,\n",
" 'periodically': 0.00011141564721349466,\n",
" 'hypothalamically': 2.2283129442698933e-05,\n",
" 'formidably': 2.2283129442698933e-05,\n",
" 'between': 2.2283129442698933e-05,\n",
" 'Originally': 4.456625888539787e-05,\n",
" 'rigidly': 0.00017826503554159147,\n",
" 'Some': 0.00011141564721349466,\n",
" 'semantically': 4.456625888539787e-05,\n",
" 'yearly': 8.913251777079573e-05,\n",
" 'nonspecifically': 4.456625888539787e-05,\n",
" 'opposite': 4.456625888539787e-05,\n",
" 'Ideally': 8.913251777079573e-05,\n",
" 'Oddly': 6.68493883280968e-05,\n",
" 'rightly': 8.913251777079573e-05,\n",
" 'Strangely': 2.2283129442698933e-05,\n",
" 'whereby': 0.00037881320052588185,\n",
" 'legitimately': 4.456625888539787e-05,\n",
" 'hull-first': 2.2283129442698933e-05,\n",
" 'individually': 0.00031196381219778506,\n",
" 'psychologically': 4.456625888539787e-05,\n",
" 'erotically': 2.2283129442698933e-05,\n",
" 'amazingly': 6.68493883280968e-05,\n",
" 'alertly': 2.2283129442698933e-05,\n",
" 'sleepily': 2.2283129442698933e-05,\n",
" 'friendlily': 2.2283129442698933e-05,\n",
" 'absolutely': 0.0004679457182966776,\n",
" 'adrift': 2.2283129442698933e-05,\n",
" 'clairaudiently': 2.2283129442698933e-05,\n",
" 'courteously': 6.68493883280968e-05,\n",
" 'linearly': 6.68493883280968e-05,\n",
" 'emphatically': 4.456625888539787e-05,\n",
" 'henceforth': 6.68493883280968e-05,\n",
" 'remarkably': 0.00037881320052588185,\n",
" 'irreparably': 2.2283129442698933e-05,\n",
" 'inversely': 8.913251777079573e-05,\n",
" 'unquestionably': 0.0001336987766561936,\n",
" 'richly': 0.00011141564721349466,\n",
" 'legally': 8.913251777079573e-05,\n",
" 'severely': 0.00020054816498429038,\n",
" 'Rarely': 4.456625888539787e-05,\n",
" 'ideally': 8.913251777079573e-05,\n",
" 'intermittently': 4.456625888539787e-05,\n",
" 'scholastically': 2.2283129442698933e-05,\n",
" 'Outside': 0.00024511442386968826,\n",
" 'afterward': 0.00028968068275508614,\n",
" 'aptly': 8.913251777079573e-05,\n",
" 'ironically': 6.68493883280968e-05,\n",
" 'thoughtlessly': 2.2283129442698933e-05,\n",
" 'willfully': 2.2283129442698933e-05,\n",
" 'upside': 0.00015598190609889253,\n",
" 'apiece': 4.456625888539787e-05,\n",
" 'good': 0.00017826503554159147,\n",
" 'Further': 0.00028968068275508614,\n",
" 'insofar': 8.913251777079573e-05,\n",
" 'Gradually': 0.0001336987766561936,\n",
" 'Beyond': 4.456625888539787e-05,\n",
" 'unfunnily': 2.2283129442698933e-05,\n",
" 'implicitly': 6.68493883280968e-05,\n",
" 'eminently': 6.68493883280968e-05,\n",
" 'roundly': 4.456625888539787e-05,\n",
" 'linguistically': 2.2283129442698933e-05,\n",
" 'Altogether': 8.913251777079573e-05,\n",
" 'idiotically': 2.2283129442698933e-05,\n",
" 'functionally': 4.456625888539787e-05,\n",
" 'Anyhow': 6.68493883280968e-05,\n",
" ...}"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Dict_Of_Tags_Words['ADV']"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pomegranate.hmm.HiddenMarkovModel"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAATwAAACCCAYAAADbsnS3AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAIABJREFUeJzt3XlcVPX+P/DXmYVhG3aQzVhENFTIBSJ3c1fMJc2uZWGEWZeyeysts2/5sN0yu1dTKtNU1DQ1tVJJLCUTw1QQIhJlEVkccACBYWBm3r8/+M25sg8wzAHm83w85lF4tvcMH97zOed8zvvDEREYhmHMgUjoABiGYUyFJTyGYcwGS3gMw5gNlvAYhjEbLOExDGM2WMJjGMZssITHMIzZYAmPYRizwRIewzBmQ2Li47HHOhiGMbY9ABYZsiLr4TEMYzZYwmMYxmywhMcwjNkw9TW8Xq2mpgZ5eXm4c+cOysvLodFoUF5eDgCQyWSwsbGBg4MDbG1tYWNjAzc3N1hYWAgcNdNb1NTUAABycnJQVVXVoA1aWloCAGxsbGBvbw87OzvY2trC0dGRX2YOWMJrJ41Gg7S0NJw/fx6///470tPTkZubCwAoKipq174kEgn8/PwAAIMGDcK9996LoKAgjB8/Ht7e3kaPnekdtFot0tPTkZSUhKSkJGRkZCA3NxeFhYUd2p++rfXr1w8hISEYMWIERo4ciX79+hkz7G6BM3E9vB53l1alUiE+Ph4HDx4EABw9ehRKpRJyuRwjRozAfffdB19fXwCAj48P7rnnHtjb28Pe3h5isRgODg4A6r999d+6lZWVqKqqwo0bN5CRkQEAyMjIQEZGBjIzM1FbW4v77rsPM2bMAADMmjULYWFh4DjO9B8AI7iamhrEx8dj//79AOrbYHl5OWxtbREaGoqQkBD4+vry7dDX1xdyubxBG9T3/iorK1FRUcG3w5KSEly7dg0AkJWVhUuXLiElJQVqtRr9+vXD9OnTsWhR/Q3Q8PDw7toGDb5LCyIy5avHKCgooFWrVpGTkxOJRCIaPXo0jR49mtavX09paWmk1Wq75LgqlYqOHz9OMTEx5OfnR35+fgSAgoODaefOnVRbW9slx2W6n4KCAlq5ciU5ODg0aIOffPIJpaSkkEaj6ZLjqtVqOn36NL3++usUFBREqO+oUGBgIG3atImqqqq65LidsJsMzEEs4TVSVlZGy5cvJwsLC3J3d6e1a9dSQUGBoDH98ccf9Nhjj5FEIiEfHx/67LPPSKPRdFmDZ4SjVCpJqVRSTEwMyWQycnd3p3fffZfy8/MFi+nixYt08eJFWrp0KVlZWZG7uztt2bKF6urqBIupEYMTHrtLyzCM+TA0Mxrp1W3t27eP9u3bRx4eHuTq6kqxsbFUU1MjdFgNZGdn0/PPP08WFhY0YsQIGjFiBKWnpwsdFmMk+/fvJw8PD/Lw8KA+ffp0yzZ469Ytvg2GhYVRenp6d2iD7JTWUBqNhl566SXiOI44jqPo6GgqLS0VOqxWpaen0wMPPEAPPPAAWVtb065du4QOiekEjUZDy5cvJ47jKCoqiqKiouj27dtCh9Wq9PR0CgsLI2tra7K2tqa9e/cKGQ5LeIZQq9U0c+ZMsrKyori4OIqLixM6JIPV1dVRXV0d/fvf/yaO42jdunVCh8R0gEql4tvgN998I3Q47VJXV0cvvPACvfDCC8RxHK1fv16oUFjCa4tGo6FHHnmE7Ozs6Pz580KH0ymffvopcRxHmzZtEjoUph20Wi0tWLCAnJycenwb/Pjjj4njOPrss8+EODxLeG15/fXXydLSkn755RehQzGKd999l8RiMZ06dUroUBgDvfbaaySTyej06dNCh2IUa9euJYlEIkQbZAmvNb/99huJxWLasmWL0KEY1fz588nT05PKy8uFDoVpRVJSEiUlJZFYLKYvvvhC6HCMRqfT0bx58+iee+4xdRtkw1IYhmEaM6tHy/TvNTg4GN7e3vjxxx+766MyHVJaWooBAwYgMjISH330kdDhMM3QarUYOnQoAMDd3R0nTpzoVW1QoVAgKCgIS5YswYcffmiqw7JHy5pz6NAhOnToEIlEIkpLSxM6nC6xYcMGsra2puLiYiouLhY6HKaRgwcPkkgkIpFIRJmZmUKH0yXWr19v6jZo8CmtWfXwxo0bBwBwcnLCoUOHhAzFqM6dOweJRAIvLy/Y2dnBx8cHK1asAACsXLlS4OiYu02YMAFyuRwAcOTIEYGj6RoqlQqenp5YtWoVAOCVV17p6kMa3MMzm4RXUlICd3d3AMD+/fsxd+5coUIxqjt37sDOzo7/WSwWQyaT8f8WFhYGHx8fuLu7w9vbGxEREXBychIqXLNWXFwMDw8PHD58GEB9FZze4tSpU8jLy8OgQYMwcOBAvPLKK0hOTgYA/PHHH119eIMTntnUwzt+/Dgkkvq3O3nyZIGjMR65XA5vb2/k5+cDqL9GVF1djerqagD1vQh9kdHa2lps2rQJzz33nGDxmrNTp05BIpFgwoQJQodidO+88w5OnTrF/+zo6AiVSgUAWL58OYYMGcLXfNSXTBOC2SS81NRUDB48GABga2srcDTGNWHCBOzduxd1dXXNLq+trQUAWFpaYuHChaYMjbnLuXPnMHz48F7X/oD6Ara//vor39aUSiW/bMuWLdBqtdBqtQCAp59+Gl988YUgcbJhKQzDmA2zSXhZWVkICAhAQECA0KEY3ZgxY6DT6VpdRyqVIiYmBs7OziaKimnsxo0bfEn/3mbw4MF8D66x2traBsv0w3KEYDYJ79atW/Dw8ICHh4fQoRjdmDFjWmxsd/vXv/5lgmiYlhQXF6NPnz5Ch9ElhgwZ0mYbFIlE8PPzQ3R0tImiaiYGwY4sAI7jetUgT70BAwbA0dGxxeVSqRSRkZHw9PQ0YVRMY1qtFmKxWOgwusTgwYPb/NsiInz00UeQSqUmiqops0l4IpEIGo0GGo1G6FCMjuM4jB07tsU/Jq1Wa4qxUEwbbGxsUFVVJXQYXUIul7d69iSRSDBs2DDBh4OZTcLz9vZGXl4e8vLyhA6lS4wdOxYiUdNfp1Qqxbx589C/f38BomLu5uTkhJKSEqHD6DJDhw5tsZen0WiwYcMGwc+wzCbhBQQEICsrC1lZWUKH0iVGjx7d7LCUuro6vPbaawJExDQ2aNAgXLlyRegwukxISEizE8tLpVI89NBDGD16tABRNWQ2CY9hGMZsEl5wcDAyMzORmZmJ8vJyocMxumHDhsHKyor/WSKRQCKRYPLkyRg2bJiAkTF6wcHByMrKQlVVVa+8ljdkyBB+4PHdtFot3nvvPQEiaspsnrSYMGECP1bt559/xpw5c4yyX5VKhdLSUgD15Zmqq6v5xlxdXQ21Ws2va2Njw3f5HRwcYG1tzY+Lc3Z25h996wiJRIKwsDCcOXMGRMTfnFm9enWH98kY16hRo0BEOHnyJABg9uzZRtt3VVUVFAoFgPp2WFFRAa1WC7VazT9mCAB2dnYQi8WwtLSEXC6Hq6srXFxcAKDZ09H2GDJkCO5+Nl9/NzYqKgpBQUGd2rexmE3Cc3Z2xvDhwwEA33//vUEJr7CwEGlpacjOzkZ2djYAICcnBzk5Obh58yaf4IzF3t4ebm5u8PLygp+fH3x9fQEAfn5+CAgIwODBg/lKG80ZP348zp07B61Wi5CQEAD1NzOY7sHd3R33338/XzygPQnvxo0buHz5MgDg77//xtWrV3H16lVkZ2ejuLjYKO3Q0dER7u7uCAgIQGBgIPr3748BAwbwA4Xt7e1b3X7AgAGQSqX8tWT9F/ibb77Z6diMxWyqpQDAp59+CqC+11NQUMAnj5s3b+LMmTMA6is7pKSkICUlhf/GtLOza5B8fH190bdvX7i4uMDZ2blBL83W1pY/tbS2toZMJuOPX1lZyTeGiooKVFZWNugdlpSUoLi4GDdu3OATKwDk5uZCrVaD4zj4+fnhvvvuQ3BwMAAgPDwcI0eOhFwuR0JCAiZNmgTgf6WHelNFjt5g3bp1eP/99wHUJzFra+sGy/U98/Pnz+Pnn3/G77//juTkZBQVFfHreHp6on///ujfvz/8/Pzg6ekJV1dXuLq6AgBcXFzg4OAAjuMgk8kaHKO8vBw6nQ61tbUoLy+HQqHg23lhYSEKCwtx9epVZGVl4erVqygvL+fv/gcGBiI0NBTh4eGYNGkSAgMDm7y/e++9F3/99RckEgl/dmGChMfKQzXn9u3bAOobzOLFi6FWq5GYmIicnBz+2yg4OBjBwcEICQnh/1/f5RcKESEnJwepqan869KlSwCAa9euQSwWIyQkBPfffz9iY2MREBCAv/76CwAEHwbANKRQKODj4wMAWL9+PZYtW4Zbt27h4MGDOHHiBF9xpKKiAn379kV4eDjCwsIQGhrKX4ttrZdvbDdv3sSFCxcAAMnJyUhOTkZSUhIqKirg4+ODKVOmAAAiIiIwdepUREVFIS4uDs7OzvwXtgmKJbCEd7f09HR89913fK/nwoULkMlkGDlyJMaMGYMxY8bg/vvvB1B/na0nKSoqwq+//orExEQkJibi8uXLkEqlmDp1KoD6Ht7s2bPh5uYmcKSM3pIlSwDUX0seOHAgEhISYGlpiYkTJ/KlyyZPntxsD6o70Gg0OH/+POLj4xEfHw8A+P3332FnZ4eAgABcuHDB1GXIDE54ZnOXlmEYptf28JRKJfbv348dO3bg7NmzcHFxwfTp0wHU93qmTZtm0lMDU1EoFDh27Bi+//57APWFT6urqzFhwgQsXrwY8+fPb3LdiDGNrKwsfPnll4iNjQVQfxd/ypQpWLBgAebNm9ej6+TdvHkT3377LbZv347Lly/D398fS5cuBQBER0d3dZVt85zEJykpiRYsWEALFiwgqVRKcrmclixZQr/88gtptdquPny3VFVVRbt27aIpU6aQWCwmJycnWrFiBa1YsYLy8/OFDs8sXLx4kSIiIojjOPL396d169bRunXr6Pbt20KH1iX+/PNPeu6550gul5NcLidbW1t6+eWXSaFQdNUhzWcibp1OR0eOHKExY8YQABoxYgSNGDGCduzYQVVVVV1xyB4rPz+f3nnnHfLw8CAPDw+ysLCgxYsX05UrV4QOrVfKzMykBQsWEMdxFBoaSocPHzarL97y8nIqLy+n9evXU58+fUgul9Mbb7xBFRUVxj6UeSS8M2fO0AMPPEAcx1FERAT9/PPPxj5Er6RWq0mtVtNXX31FgwcPJpFIRJGRkZSXlyd0aD2eSqWiN954g9544w2SyWQ0ZMgQOnToEOl0OqFDE1RlZSW999575OTkRF5eXnTgwAFj7r73JrycnBzKycmh2bNnEwCaOHEiXbhwwRi7Nks6nY52795N/v7+ZGVlRatWraJVq1aRSqUSOrQeJzExkQICAvhTuQ0bNpBGoxE6rG6lpKSEIiMj+U5KRESEMeau7X0JT6fT0ZYtW/jGdO+999KxY8c6s0vmLmq1mj755BOys7MjOzs7GjhwIP32229Ch9Uj6HQ6WrduHUkkEoqIiKD8/Hx2fbQNP//8M/n7+5O/vz95enrS6dOnO7M7gxMeG5bCMIz5MDQzGunVIQqFgqZMmUISiYReffVVevXVV9kpVxfJzc2l3Nxcmjp1KonFYnr99deNcqE9Ozub3nzzTXrzzTcpKCiILl++bIRohadSqWjevHkkkUho3bp1Zn+trj2USiUplUqaM2cOSSQS+u9//9vRXfWeU9qUlBTy8/MjPz8/+v333zuyC6YDdDodxcbGkkwmo5kzZ1J5eXm796FSqWj37t00fvx44jiOLCwsyMLCggDQ/v37uyBq06moqKCKigoaP348OTk50ZkzZ4QOqcfS6XT0zjvvEMdx9NZbb3VkF70j4Z04cYJsbW1p/PjxXTmGh2nFuXPnyMPDg4KCgujmzZsGbZOWlkYrV64kBwcH4jiOxGIxoX7QOQEgkUhEO3fu7OLIu86dO3coLCyMwsLCyMPDgw3rMZLPP/+cxGIxvfrqq+3d1OCE1y3LQ/3yyy8AgDlz5mDBggX48ssvBZ3pyJyFh4cjOTkZkydPxsSJE/nfzd3TDZaXl+Obb77Bpk2bAACpqakNygQ1nr5PJBKhpqbGNG/AyHQ6HR577DH+wfizZ8/2yrmOhRAdHQ1LS0s8+eST8Pf35//NmLpdwktOTuZLGs2ePRtfffVVr53arqfw8vJCQkICxo0bxz/cfubMGaSlpeHrr7/Gzp07odFoGkwG3tz8Gnocx+HKlSt8IczGlEplmzFJJJJWHw3UT1tpaWkJAHzJLn3ZJAsLiw4Vili1ahXi4+P5qiYs2RnX4sWLcf36dfzzn/8EUF+Saty4cUbbf7d6lrasrAxDhw7FwIEDAQBHjx7tVBXgxhqXSvLy8sKlS5f4OmJtra9n4s+s27hx4wZfRFVf2VkkEjVIdIawsLDoVr08qVQKW1tb2Nvbw8rKCtbW1nB0dGxQ19De3h7+/v5YtWoVPv/8c0RFRXV5XHl5efDz80NgYCAyMjJaXbe1MmByuRxeXl4A6qsuR0dH89WBWtuHs7Nzs7OsVVZWNvtlY6y/CyLCvHnzAIAvh9bGlxOrlsIwDNNYt+rhPfzwwzh//jxfyrorC2/qv80mTpyIEydOtHrazHGc2fbqGvvhhx8A1FeccXBwgFKphIWFRbOTt7REJpPh5ZdfxksvvdTscltb2zav2apUqhZ7iFqtFhUVFQDAlz5Xq9UgIpSVlTXZXj/vQ1lZGVQqFVQqFZRKJb+t/ue//voLgwcPxokTJwx+r52xZs0avPXWWwDqrxWOHDnSoO30bVt/oV6pVOLixYsAgE2bNuG7775DVFQUNm3a1KAiNwA8/fTT2Lp1K1auXMlXZm5JVFQUX7Nv5cqV7XlrbdJXeA4KCsKSJUvw8ccft7Z6z6uWEh8fTwDo1KlT7b1D0yEAyN3dnQDQqlWr2lyXaeill14iZ2dnSkpKovfff59GjBhBHMcRx3EklUpJJBI1uDN790smk9Ebb7wh9Ftol61bt5KFhQVlZ2eb5Hg6nY58fX1pxowZBICioqIM3lb/Obfk3XffJQD05JNPNln266+/EgDy9PRs9bG4yspKcnBwoIKCAiooKDA4tvbavHkzWVhYtDVCoOcNSwkPD6eZM2ca/EF0FgD65ZdfSCwWE8dxdPToUTp69GiL6zINKZVKcnJyajBuKjs7m7Kzs2nDhg00btw4EovF/KtxwnvllVcEjL59dDodDRo0iCIjI012zJMnT9KwYcMoMzOTAJBcLqfKykqqrKxsc9u2Eh4R0bhx4/i/gcb69+9PAOjHH39scfvt27dTRERE22+kk2pqasjDw6OtoSo9K+GdPn2aAJi0CIC+QXz44YcEgBwdHcnR0ZGuX7/e4rpMQ2vXriUHBwdSq9XNLi8qKqLY2FiaMmUKSaVSvgfIcRzFxMSYONqOS0xMJAAmfTpk0aJF/JMHo0aNIgC0bds22rZtW5vbGpLw9uzZQwDo8ccfb7LsnXfeIQC0YMGCFrcfN26csSuetGjt2rXk7OzcWo+zZyW8F198kQYNGtTuD6Iz7m4Qc+bM4RvJ0KFDmzy21lrjKSwspKVLl9LSpUvJy8uLpFIpeXl50TPPPENFRUXNHrelBmnosry8PHrooYfooYceIltbW3Jzc6PHHnuMSkpK2vMRdFpubi5xHEfx8fFtrlteXk579uyhPXv20KOPPkp79+41QYTGsXr1avL19TXJscrKyqisrIwcHByotLSUiOpPpwHQmDFjaMyYMW3uw5CEl5OTQwDI39+/ybIbN26QSCQimUzGx3C3rKwscnNzo9raWgPfVeekpaURgNaetOpZCa9///5tXkcztrsbRFlZGQUEBFBAQECz10taajyFhYXUt29f8vT0JE9PT0pISKCKigo6efIkubu7k4+PDxUVFTVJfK01SEOWPfbYY/Tnn3/Sn3/+SWVlZfTss88SAJOecukNHTqUnn/+eZMf15RGjhxJzzzzjEmOtXnzZtq8eXOD3tWdO3fIxsaG//1fvXq11X0YkvBqamoIAFlZWTW7fMqUKQSg2edbV69eTS+++KIB78Y4dDodeXp60rvvvtvSKqxaCsMwTBOGZkYjvZqora0lkUhEhw4d6lD27yg0+gZMSUmhlJQUsrKyIgD01VdftbiuXnR0NAGgnTt3Nnk2dPv27QSAnnnmmSa9A3Syh9f4QnN2djZ/Z83Uli5dSpMmTTL5cU3J1dWVNm7caJJjhYaGUmhoaJMbBpGRkfzv35BRBS21Ib3q6moCQNbW1s0u11/jGzZsWIN/12q11LdvX0pNTTXg3RjP1KlT6amnnmppcc85pc3NzSUAlJSU1KEPoqNaahD6RGVlZcVfpG5pXQ8PDwJAN2/ebHLbPD8/nwCQl5cXeXl5NTl2ZxJe4zkB1Go1ASCO45rdrivpyz31RhqNhjQaDYlEIpNUd0lLS+MvjzS+QH/mzBn+9+/t7d1qyS5DEt7169cJAPXr16/Z5SqVihwcHAgApaam8gkuPj6ehg8f3s531nlPPPFEa6M4es4prX6AaHeZMvHJJ5/E0qVLoVKp8PDDD/MDVZujUCgA1A+QbjxIWv/zrVu3cOvWLaPG2PizsrCwACDMI2/29vYoLy83+XFNobq6GtXV1dDpdCaZoH3r1q0oKChAQUEBJBIJOI7jX2PHjuXXy8/P7/Tg599++w1A/aNmzbG0tMSjjz4KANi2bRu2bdsGAPjqq6/w1FNPderYHSGXy3Hnzp1O70fwhKevumHspNAZ//nPfzB8+HBcu3YNTz75ZIvrubm5AQBKSkqaPHOo/9nNzY1fT08/Er7xA/Y9MXEUFRXBw8ND6DC6hFwuh1wuh6WlZbPPlBpTXV0d4uLicP36dVy/fr3Z3snbb7+Nt99+G0B94umMzZs3A2i9GsmSJUsAALt27cKuXbugUChw/Phx/OMf/+jUsTtCoVA0+TvqCMETnouLC2QyGW7cuCF0KDyZTIZvv/0Wjo6OOHLkSIvr6au6JCQkICEhocEyfSWQWbNm8evpubu7AwAKCwsb/PulS5c6Hbup3bhxA56enkKH0aXc3Nz4R526ytGjR3HvvffCz88Pfn5+za4TGRmJyMhIiMViHDlyBKWlpR061ttvv42zZ8/iqaeewujRo1tcLywsDEFBQVAoFFAoFFi8eDGmTZvGV6IxpcLCQqMkPMGv4RERjR07lp544ol2nNF3HgwYTPzDDz8Qx3EtrltUVEQ+Pj5NhqUkJCSQh4dHi8NSnnjiCQJAMTEx/LirjIwMevzxxzt9fc+UtFotubu703vvvWfS43aUUqk0uIjp3WbPnk2zZ8/ugoj+JyIiwqBBxUTEP262YcOGZpff3RZ0Oh0plUr66aef6KeffuJn+4uOjm5xwPjd9APz9a8TJ04Y/J6MRaVSkbW1NW3durWlVXrONTyGYRiTMTQzGunVrI8++oicnZ2prq6u3dm/PdDCw+ytWb16davrFBUV8UNPPD09SSKRkKenJy1durTZJy2I6iclWrRoEbm6upKNjQ3Z2NjQrFmzKC8vr8W4Wou5Pe/HmM6ePUsAKD093WTH7IynnnqKOI6j4OBgWrNmDV28eJEuXrzY5nYbN24kOzu7Lmufd//uJk6cSBMnTmxzvca/75aWASAbGxsaMGAADRgwgKKiouj8+fMGx1ZYWMg/D923b1+jTOjUXidPniQAlJub29IqPWdYClH9ODKJRELbt29v94fBCGfhwoU0ZMgQocMwWFRUFF/FRSKR8AnBzc2NnnvuOTp+/DjV1NQ02S47O5vEYjHt27dPgKiZxYsXU0hISGur9KyER0T09NNPk6+vr0HXFRjhpaamkkgkooMHDwodisFeeOEFfta0xi+pVEpAfSWXGTNmUGxsLBUWFlJhYSEREc2bN49CQ0MFfgfmJz8/nywsLNrqDPW8hJebm0uWlpa0Zs0agz8MxvTq6uqorq6ORo0aRaGhoT1qHtbXXnuNZDJZq6d/AEgsFpNIJOJf4eHhFBMTQwDo8OHDQr8NsxIVFUXe3t5tdYQMTnjdquLxxo0bsXz5cvz0008AgAcffNAkQTGGW716NQDggw8+QGBgIOzt7WFjY9NgUhyZTAZra2v+v8D/Js+5m52dXZNK04ZUO9ZrrupxbW0tqqqq+J/vrnh85coVJCQkQK1Wt+Md/w/HcXB1dUVmZiYcHBw6tA/GMKdPnwYATJgwAfv27cP8+fNbW93gisfdKuEBwLx585CUlAQASExMRL9+/bo8KMYw3377LRYuXAgAiImJgZWVFerq6lBZWdkg0ehLpusTkk6na3ZQdXOzk5WXl7drUqDGiVQsFsPOzo7/WV/C3NraGhqNBhkZGQaXo9dPIKXT6RAREYHHH38cMTExmDx5Mnbu3Amg9clzmI4pKSlBWFgYACA4OBjfffddW5uwSXwYhmEa63bz0m7btg2TJk0CUD/BzunTp+Hj4yNwVMzhw4exaNEixMTEAAA+/fRTgSNqv9jYWD7+lkgkEmg0GvTp0weRkZEAgOeeew733HMPgPpnh2fOnMlPJao/xWeMQ61WY+7cufzPn3/+uXEPYOjFPiO9DFJaWkqlpaUUEhJCfn5+lJaWZuimTBeIi4sjCwsLWrZsGel0uh51o+JuO3bsaHFyIYlEQiKRiB588EHat29fqxPYfPbZZ3yp+i+++MKE76B3U6vVNH/+fLK3t6f09PT2jO/seXdpm6NQKGj06NEkl8vpyJEj7d2c6SSNRkMrVqwgjuPo5Zdf7rGJTu/gwYMNEpz+/318fOiDDz6g4uJig/e1Zs0aWrNmDXEcRx9//HEXRm0eqqqqaPr06WRnZ0enT59u7+a9I+ER1Wf96OhoEolEtGLFClKpVE3mnGCMLycnhyZOnEiWlpb09ddfCx2OUeinApVKpfTII49QQkICJSQkdCqRb9iwgTiOo+XLl5tsjofeJi8vj8LDw8nFxYWSk5M7sovek/D0tm7dSvb29jRw4EAaOHAg/fbbb53ZHdMMnU7Hz6kgl8spKCioow2wW6qurqYdO3aQQqEw6n76oVcnAAAGZUlEQVT37t1Ltra2FB4eTnl5eUbdd2927NgxOnbsGLm4uFBQUBBlZGR0dFe9L+ERERUUFPCzdYlEIlqwYEGz0yoy7Xfu3DkaM2YMSSQSkkgktHLlymYfs2Kal5GRQYMGDSJHR0eKjY2l2NjYHn8JoKsolUp69tln+YHdixcvNmi+3VawaikMwzBNGJoZjfQymj179pC/vz9ZWVnRihUrjH6aYi5SU1P5GmmTJk2iS5cu0aVLl4QOq0eqqqqil19+me8ljxo1yqSTd3d3Op2O9uzZQx4eHtSnTx/avXs37d692xi77p2ntI2p1Wr65JNPyMXFhaysrGjZsmWUmZlJmZmZxj5Ur6HT6Sg+Pp7i4+Np6tSpfLmkY8eOCR1ar3H58mW6fPkyhYeHk0gkooULF3bm+lSvcPToURo6dCiJRCKKjo6m27dvG3P35pHw9CorK2njxo0UEBDAXxeYMWMG7du3j93R/f+Ki4vpk08+ocGDB/PDMR588EH64Ycf2LWmLqLT6ejbb7+lQYMGkVgspkcffdRsbrap1WqKi4ujuLg4Cg0NJY7jaM6cOV01vaN5JTw9rVZLBw4coAMHDtC0adNILBaTo6MjLVu2jE6dOsVX+jAHSqWSlEol7dmzh2bNmkVSqZTs7OwoKirK4MKXjHFotVqKi4uj4cOHEwAaPnw4DR8+nLZu3UplZWVCh2dUWVlZ9H//93/k7u7On9rPnz+/q+/298xqKcZWUFCAuLg47Ny5E1euXOErXEybNg2zZs3CpEmTjDMxSDfx119/4dixY/j++++RmJgIoP4LbeLEiXjiiScwd+5cWFlZCRyleTt37hw2btwIADhw4AAAYPr06Vi4cCGmT58Oe3t7IcNrt+zsbADAwYMH8c033yA5ORl9+vTB008/jWXLlgEAvL29uzqMnlstpatcu3YNR48eBQB8//33OHPmDOrq6jBw4ECMHj2an/czPDwc/fr1g0jUfW9g60sdAcDZs2eRmJiIxMRE3Lp1Cw4ODpg2bRoeeughABBslimmbeXl5Th06BD27t3Lz3p3//33AwCmTp2KSZMmYdiwYXzFl+7g9u3bOHv2LOLj4xEfH4+///4bAODo6Ii5c+fi0UcfxYMPPtik7FcXY9VSGIZhGjObHl5jd+7cwa+//sr3jpKTkwHU955sbGwwePBgBAcHY8iQIejXrx98fX0BAH5+fiY5Lbx9+zZ/upCTk4OrV6/i8uXLuHLlCv7++29oNBoAgJOTE0aNGoWxY8di9OjRGDFiBF/Hjek5SktLcfLkScTHxwMA4uPjkZ+fD6lUiuDgYISGhmLYsGEAgMDAQAQGBnbZBOgajYZvc3/++ScuXLgAAEhOTsa1a9fAcRyGDh2KKVOmYMqUKQCAUaNGwcLCokviMQA7pW0vfeXcK1euICUlBampqUhNTUVaWlqTCY/d3d3h7e0NFxcXODs7w9nZGUD9pOKWlpb8dZi7K/4C/ytuqdVqUVFRgaqqKn7fJSUlUCgUKCkpQW5uLioqKvjtRCIR+vbtiyFDhiA4OBghISEIDg4GUN/4u/PpN9NxV69eRXJyMv9KSUkBAFRWVgKorw7t7+8PDw8PuLm5wdXVFQDg6uoKGxsbyGQySCQSyOVy6P/Oy8rK+H1UVVVBoVCguLgYAFBUVISioiJkZ2ejrq4OQH1bHz58OAAgNDQUoaGhCAsLg4uLi+k+iLaxhGdMFRUVyMnJAVB/kTY7Oxs3b95EaWkp/wLqv6Vra2v5Sr76yr96+pLmUqkUtra2sLGx4ZOls7MzXF1d4ezsjL59+8LPz4/vVfr4+HSr6ziMsAoKCvD333/j6tWryM7ORlFREf9lCQAKhQLV1dWoqanhK1Lr6a/nyuVyWFtbw9XVlU+U7u7ucHNzQ2BgIPr374/+/fv3lJsoLOExDGM22E0LhmGYxljCYxjGbLCExzCM2WAJj2EYs2HqAVv7TXw8hmF6v/OGrmjqu7QMwzCCYae0DMOYDZbwGIYxGyzhMQxjNljCYxjGbLCExzCM2WAJj2EYs8ESHsMwZoMlPIZhzAZLeAzDmA2W8BiGMRss4TEMYzZYwmMYxmywhMcwjNlgCY9hGLPBEh7DMGaDJTyGYcwGS3gMw5gNlvAYhjEbLOExDGM2WMJjGMZssITHMIzZYAmPYRizwRIewzBm4/8BCWDPgNr6+WAAAAAASUVORK5CYII=\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7f093f177898>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Scratch block \n",
"# Lets add one state first, on transition & bake the model \n",
"\n",
"model = HiddenMarkovModel(name=\"Example Model\")\n",
"\n",
"noun_emissions = DiscreteDistribution(Dict_Of_Tags_Words['NOUN'])\n",
"noun_state = State(noun_emissions, name=\"Noun\")\n",
"\n",
"ADV_emissions = DiscreteDistribution(Dict_Of_Tags_Words['ADV'])\n",
"ADV_state = State(ADV_emissions, name=\"ADV\")\n",
"\n",
"\n",
"model.add_states(noun_state, ADV_state)\n",
"\n",
"# Transitions \n",
"model.add_transition(model.start, noun_state, 0.5)\n",
"model.add_transition(model.start, ADV_state, 0.5)\n",
"\n",
"model.add_transition(noun_state, noun_state, 0.8)\n",
"model.add_transition(noun_state, ADV_state, 0.2)\n",
"\n",
"\n",
"\n",
"model.add_transition(ADV_state, ADV_state, 0.4) \n",
"model.add_transition(ADV_state, noun_state, 0.6)\n",
"\n",
"type(model)\n",
"model.bake()\n",
"show_model(model, figsize=(5, 5), filename=\"example.png\", overwrite=True, show_ends=False)"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pomegranate.hmm.HiddenMarkovModel"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# loop through possible states \n",
"# create an emission prob distribution \n",
"# create a state adding prob distro to it \n",
"# add the state to the model \n",
"\n",
"test_model = HiddenMarkovModel(name=\"base-hmm-tagger\")\n",
"type(test_model)\n",
"\n",
"Dict_Of_Tags_Words = pair_counts(tags,words)\n",
"for tag in Dict_Of_Tags_Words.keys():\n",
" for word in Dict_Of_Tags_Words[tag].keys():\n",
" Dict_Of_Tags_Words[tag][word] = (Dict_Of_Tags_Words[tag][word])/(tag_count.get(tag))\n",
" \n",
"# Initiating a counter object for counting the occurances of tags at the start of sequences.\n",
"#This is a counter object whic can be accessed to get the number of times each tag was at the start of a sequence.\n",
"#Now we have divided the word count by the total tag count. This is done to have emission probability of words for each tag. \n",
"# The emission distribution at each state should be estimated with the formula: 𝑃(𝑤|𝑡)=𝐶(𝑡,𝑤)/𝐶(𝑡)\n",
"\n",
"\n",
"# Setting up Probabilities \n",
"tag_start_count = starting_counts(data.training_set.Y) \n",
"sum_of_tag_start_count_values = sum(tag_start_count.values()) # sum of tags at the sequence\n",
"\n",
"\n",
"tag_end_count = ending_counts(data.training_set.Y) # Counter of tags at the end of sequences \n",
"sum_of_tag_end_count_values = sum(tag_end_count.values())\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"\n",
"state_list = []\n",
"for tag in Dict_Of_Tags_Words.keys(): # Dictionary of tags, with prob of words \n",
" \n",
"# Task 1 Add one state per tag \n",
" state_emission = DiscreteDistribution(Dict_Of_Tags_Words[tag]) \n",
" state_initiated = State(state_emission, name = tag) \n",
" test_model.add_state(state_initiated)\n",
" state_list.append(state_initiated)\n",
" \n",
"\n",
"# Task 2 \n",
"# Add an edge from the starting state basic_model.start to each tag\n",
"# The transition probability should be estimated with the formula: 𝑃(𝑡|𝑠𝑡𝑎𝑟𝑡)=𝐶(𝑠𝑡𝑎𝑟𝑡,𝑡)/𝐶(𝑠𝑡𝑎𝑟𝑡)\n",
"# For this step, for each tag we need the count it was in the start of the sequence \n",
"# and we need a total number of start. we divide these to get the prob of tag being reached from start state. \n",
"# For this step, we calculated the tag_start_count for each tag, this value gives us the number of times a specific tag was in the start.\n",
"# we divide this value by the number of times a seq started, or the sum of times the tags were in the start. \n",
"# tag_start_count = starting_counts(data.training_set.Y), this was initialized in the start. \n",
"# tag_start_count.get(('NOUN')) This is what we call for each state. \n",
"\n",
"\n",
" Prob_Tag_At_Start = (tag_start_count.get(tag))/(sum_of_tag_start_count_values)\n",
" test_model.add_transition(test_model.start, state_initiated,Prob_Tag_At_Start )\n",
" \n",
"# Task 3 \n",
"# Add an edge from each tag to the end state basic_model.end\n",
"# The transition probability should be estimated with the formula: 𝑃(𝑒𝑛𝑑|𝑡)=𝐶(𝑡,𝑒𝑛𝑑)𝐶(𝑡)\n",
" Prob_Tag_At_End = tag_end_count.get((tag))/(sum(tag_end_count.values()))\n",
" test_model.add_transition( state_initiated,test_model.end, Prob_Tag_At_End )\n",
" \n",
"\n",
" \n",
"\n",
" \n",
" \n",
"test_model.bake() "
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{('ADV', 'ADV'): 0.09664193239298527,\n",
" ('ADV', 'NOUN'): 0.006708002465644149,\n",
" ('ADV', '.'): 0.06435286225022717,\n",
" ('ADV', 'VERB'): 0.07413742379978243,\n",
" ('ADV', 'ADP'): 0.054866675877314176,\n",
" ('ADV', 'ADJ'): 0.09202444797315516,\n",
" ('ADV', 'CONJ'): 0.025903002914497167,\n",
" ('ADV', 'DET'): 0.030199414612796453,\n",
" ('ADV', 'PRT'): 0.054630636660252654,\n",
" ('ADV', 'NUM'): 0.050260986698097324,\n",
" ('ADV', 'PRON'): 0.05431277454739355,\n",
" ('ADV', 'X'): 0.003656307129798903,\n",
" ('NOUN', 'ADV'): 0.13075740356975735,\n",
" ('NOUN', 'NOUN'): 0.14994651727763877,\n",
" ('NOUN', '.'): 0.5323929787613475,\n",
" ('NOUN', 'VERB'): 0.23948248848872133,\n",
" ('NOUN', 'ADP'): 0.46601271069356176,\n",
" ('NOUN', 'ADJ'): 0.043083560535698236,\n",
" ('NOUN', 'CONJ'): 0.4329501915708812,\n",
" ('NOUN', 'DET'): 0.032697796135715,\n",
" ('NOUN', 'PRT'): 0.16610892662929808,\n",
" ('NOUN', 'NUM'): 0.1510355278666442,\n",
" ('NOUN', 'PRON'): 0.11413554071553716,\n",
" ('NOUN', 'X'): 0.06764168190127971,\n",
" ('.', 'ADV'): 0.20536132094391335,\n",
" ('.', 'NOUN'): 0.07312175930961964,\n",
" ('.', '.'): 0.14118056676036245,\n",
" ('.', 'VERB'): 0.07580681577164908,\n",
" ('.', 'ADP'): 0.1128419452887538,\n",
" ('.', 'ADJ'): 0.07302933157563592,\n",
" ('.', 'CONJ'): 0.3408651799456397,\n",
" ('.', 'DET'): 0.16018819925048555,\n",
" ('.', 'PRT'): 0.16121475780138877,\n",
" ('.', 'NUM'): 0.18134366054891396,\n",
" ('.', 'PRON'): 0.3199857806667852,\n",
" ('.', 'X'): 0.15722120658135283,\n",
" ('VERB', 'ADV'): 0.3362524232903269,\n",
" ('VERB', 'NOUN'): 0.06453732912723449,\n",
" ('VERB', '.'): 0.09936564280679705,\n",
" ('VERB', 'VERB'): 0.18445412935051073,\n",
" ('VERB', 'ADP'): 0.2153478170765405,\n",
" ('VERB', 'ADJ'): 0.12619468496269887,\n",
" ('VERB', 'CONJ'): 0.06893277008219537,\n",
" ('VERB', 'DET'): 0.2168121016494789,\n",
" ('VERB', 'PRT'): 0.39985777629047103,\n",
" ('VERB', 'NUM'): 0.1113823876073413,\n",
" ('VERB', 'PRON'): 0.20351420663738162,\n",
" ('VERB', 'X'): 0.025594149908592323,\n",
" ('ADP', 'ADV'): 0.040243331773514274,\n",
" ('ADP', 'NOUN'): 0.1358189201929004,\n",
" ('ADP', '.'): 0.009341270582640523,\n",
" ('ADP', 'VERB'): 0.032087903065797306,\n",
" ('ADP', 'ADP'): 0.02026630284609008,\n",
" ('ADP', 'ADJ'): 0.1428079216226743,\n",
" ('ADP', 'CONJ'): 0.007138880702099093,\n",
" ('ADP', 'DET'): 0.4818137885129159,\n",
" ('ADP', 'PRT'): 0.07010792269723082,\n",
" ('ADP', 'NUM'): 0.2918841555817478,\n",
" ('ADP', 'PRON'): 0.20595180661706827,\n",
" ('ADP', 'X'): 0.048446069469835464,\n",
" ('ADJ', 'ADV'): 0.01443946787886891,\n",
" ('ADJ', 'NOUN'): 0.19791326734109285,\n",
" ('ADJ', '.'): 0.05663357592329968,\n",
" ('ADJ', 'VERB'): 0.007991187799755065,\n",
" ('ADJ', 'ADP'): 0.05093775904946118,\n",
" ('ADJ', 'ADJ'): 0.056311232285705726,\n",
" ('ADJ', 'CONJ'): 0.08193339227821986,\n",
" ('ADJ', 'DET'): 0.0035560904888256696,\n",
" ('ADJ', 'PRT'): 0.054421484146239436,\n",
" ('ADJ', 'NUM'): 0.039484761744401416,\n",
" ('ADJ', 'PRON'): 0.0063986999466775,\n",
" ('ADJ', 'X'): 0.0283363802559415,\n",
" ('CONJ', 'ADV'): 0.06147915413240636,\n",
" ('CONJ', 'NOUN'): 0.034002320606258386,\n",
" ('CONJ', '.'): 0.0051971432696145455,\n",
" ('CONJ', 'VERB'): 0.04113272350353377,\n",
" ('CONJ', 'ADP'): 0.01918693009118541,\n",
" ('CONJ', 'ADJ'): 0.05051382688677832,\n",
" ('CONJ', 'CONJ'): 0.0003274715918394079,\n",
" ('CONJ', 'DET'): 0.04228100409406315,\n",
" ('CONJ', 'PRT'): 0.031791182130009206,\n",
" ('CONJ', 'NUM'): 0.048408823034180835,\n",
" ('CONJ', 'PRON'): 0.05225604956453292,\n",
" ('CONJ', 'X'): 0.016453382084095063,\n",
" ('DET', 'ADV'): 0.04318470485995053,\n",
" ('DET', 'NOUN'): 0.3117680118931071,\n",
" ('DET', '.'): 0.011778492998293096,\n",
" ('DET', 'VERB'): 0.04832342416923803,\n",
" ('DET', 'ADP'): 0.008470917380491848,\n",
" ('DET', 'ADJ'): 0.39302513707043774,\n",
" ('DET', 'CONJ'): 0.0022923011428758553,\n",
" ('DET', 'DET'): 0.006054472011744217,\n",
" ('DET', 'PRT'): 0.009244541119384254,\n",
" ('DET', 'NUM'): 0.09033507324465398,\n",
" ('DET', 'PRON'): 0.02782926643475611,\n",
" ('DET', 'X'): 0.1425959780621572,\n",
" ('PRT', 'ADV'): 0.019297190097377275,\n",
" ('PRT', 'NOUN'): 0.0038299068131549367,\n",
" ('PRT', '.'): 0.015243255178036125,\n",
" ('PRT', 'VERB'): 0.10186027736537107,\n",
" ('PRT', 'ADP'): 0.018910610665929816,\n",
" ('PRT', 'ADJ'): 0.006995835455553225,\n",
" ('PRT', 'CONJ'): 0.009332940367423126,\n",
" ('PRT', 'DET'): 0.01843696145745001,\n",
" ('PRT', 'PRT'): 0.010917761231490002,\n",
" ('PRT', 'NUM'): 0.009850143121737666,\n",
" ('PRT', 'PRON'): 0.004265799964451667,\n",
" ('PRT', 'X'): 0.0018281535648994515,\n",
" ('NUM', 'ADV'): 0.005392517325133142,\n",
" ('NUM', 'NOUN'): 0.02055005620218282,\n",
" ('NUM', '.'): 0.027310478357974472,\n",
" ('NUM', 'VERB'): 0.0036877142329349143,\n",
" ('NUM', 'ADP'): 0.013487841945288754,\n",
" ('NUM', 'ADJ'): 0.010591125625430686,\n",
" ('NUM', 'CONJ'): 0.014605232996037594,\n",
" ('NUM', 'DET'): 0.0015592089066389474,\n",
" ('NUM', 'PRT'): 0.0029281351961850583,\n",
" ('NUM', 'NUM'): 0.02256272099680081,\n",
" ('NUM', 'PRON'): 0.0030469999746083336,\n",
" ('NUM', 'X'): 0.002742230347349177,\n",
" ('PRON', 'ADV'): 0.04677228870022506,\n",
" ('PRON', 'NOUN'): 0.0015410275934587912,\n",
" ('PRON', '.'): 0.03463063766909823,\n",
" ('PRON', 'VERB'): 0.1906117226893631,\n",
" ('PRON', 'ADP'): 0.019135120198949987,\n",
" ('PRON', 'ADJ'): 0.0053779548791083685,\n",
" ('PRON', 'CONJ'): 0.014932704587877002,\n",
" ('PRON', 'DET'): 0.0063553719761833116,\n",
" ('PRON', 'PRT'): 0.03844223207562955,\n",
" ('PRON', 'NUM'): 0.003367570298029971,\n",
" ('PRON', 'PRON'): 0.008074549932712083,\n",
" ('PRON', 'X'): 0.0009140767824497258,\n",
" ('X', 'ADV'): 0.00015598190609889253,\n",
" ('X', 'NOUN'): 0.0002628811777076761,\n",
" ('X', '.'): 0.002573095442309162,\n",
" ('X', 'VERB'): 0.00042418976334316266,\n",
" ('X', 'ADP'): 0.0005353688864327162,\n",
" ('X', 'ADJ'): 4.4941127123468255e-05,\n",
" ('X', 'CONJ'): 0.0007859318204145791,\n",
" ('X', 'DET'): 4.55909037028932e-05,\n",
" ('X', 'PRT'): 0.0003346440224211495,\n",
" ('X', 'NUM'): 8.418925745074928e-05,\n",
" ('X', 'PRON'): 0.00022852499809562503,\n",
" ('X', 'X'): 0.5045703839122486}"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
" \n",
"# Task 4 : Add an edge between every pair of tags\n",
"# The transition probability should be estimated with the formula: 𝑃(𝑡2|𝑡1)=𝐶(𝑡1,𝑡2)/𝐶(𝑡1)\n",
"# uni = unigram_counts(tags)\n",
"# uni.get('NOUN'), we use this to find the number of times a tag occurred. \n",
"# We find the number of times a tag occurs with another tag & divide that by the total number of times the first tag occurred. \n",
"# unigrams are the denominator, bi gram function is the numerator here. \n",
"\n",
"# We than loop over all the states, \n",
"\n",
"bi = bigram_counts(data.training_set.Y)\n",
"uni = unigram_counts(tags)\n",
"\n",
"\n",
"transition_prob_dict = {}\n",
"\n",
" for tag1 in Dict_Of_Tags_Words.keys():\n",
" for tag2 in Dict_Of_Tags_Words.keys():\n",
" transition_prob_dict[tag1,tag2] = (bi.get((tag1,tag2)))/(uni.get(tag2))\n",
"transition_prob_dict "
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
"for state1 in state_list:\n",
" for state2 in state_list:\n",
" test_model.add_transition( state1,state2, transition_prob_dict[(state1.name,state2.name)])\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div class=\"alert alert-block alert-success\">Your HMM network topology looks good!</div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_model.bake() \n",
"#test_model\n",
"assert all(tag in set(s.name for s in test_model.states) for tag in data.training_set.tagset), \\\n",
" \"Every state in your network should use the name of the associated tag, which must be one of the training set tags.\"\n",
"assert test_model.edge_count() == 168, \\\n",
" (\"Your network should have an edge from the start node to each state, one edge between every \" +\n",
" \"pair of tags (states), and an edge from each state to the end node.\")\n",
"HTML('<div class=\"alert alert-block alert-success\">Your HMM network topology looks good!</div>')"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"220632"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"5868"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"45872\n"
]
},
{
"data": {
"text/plain": [
"int"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"6469"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"uni = unigram_counts(tags)\n",
"uni.get('NOUN')\n",
"\n",
"bi.get(('NOUN','ADV')) \n",
"\n",
"t = sum(tag_start_count.values())\n",
"print(t)\n",
"type(t)\n",
"tag_start_count.get(('NOUN'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### IMPLEMENTATION: Basic HMM Tagger\n",
"Use the tag unigrams and bigrams calculated above to construct a hidden Markov tagger.\n",
"\n",
"- Add one state per tag\n",
" - The emission distribution at each state should be estimated with the formula: $P(w|t) = \\frac{C(t, w)}{C(t)}$\n",
"- Add an edge from the starting state `basic_model.start` to each tag\n",
" - The transition probability should be estimated with the formula: $P(t|start) = \\frac{C(start, t)}{C(start)}$\n",
"- Add an edge from each tag to the end state `basic_model.end`\n",
" - The transition probability should be estimated with the formula: $P(end|t) = \\frac{C(t, end)}{C(t)}$\n",
"- Add an edge between _every_ pair of tags\n",
" - The transition probability should be estimated with the formula: $P(t_2|t_1) = \\frac{C(t_1, t_2)}{C(t_1)}$"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div class=\"alert alert-block alert-success\">Your HMM network topology looks good!</div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"basic_model = HiddenMarkovModel(name=\"base-hmm-tagger\")\n",
"\n",
"Dict_Of_Tags_Words = pair_counts(tags,words)\n",
"\n",
"for tag in Dict_Of_Tags_Words.keys():\n",
" for word in Dict_Of_Tags_Words[tag].keys():\n",
" Dict_Of_Tags_Words[tag][word] = (Dict_Of_Tags_Words[tag][word])/(tag_count.get(tag))\n",
"\n",
"# Setting up Probabilities \n",
"tag_start_count = starting_counts(data.training_set.Y) \n",
"sum_of_tag_start_count_values = sum(tag_start_count.values()) # sum of tags at the sequence\n",
"\n",
"tag_end_count = ending_counts(data.training_set.Y) # Counter of tags at the end of sequences \n",
"sum_of_tag_end_count_values = sum(tag_end_count.values())\n",
"\n",
"\n",
"state_list = []\n",
"for tag in Dict_Of_Tags_Words.keys(): # Dictionary of tags, with prob of words \n",
" \n",
"# Task 1 Add one state per tag \n",
" state_emission = DiscreteDistribution(Dict_Of_Tags_Words[tag]) \n",
" state_initiated = State(state_emission, name = tag) \n",
" basic_model.add_state(state_initiated)\n",
" state_list.append(state_initiated)\n",
" \n",
"\n",
" Prob_Tag_At_Start = (tag_start_count.get(tag))/(sum_of_tag_start_count_values)\n",
" basic_model.add_transition(basic_model.start, state_initiated,Prob_Tag_At_Start )\n",
" \n",
"\n",
" Prob_Tag_At_End = tag_end_count.get((tag))/(sum(tag_end_count.values()))\n",
" basic_model.add_transition( state_initiated,basic_model.end, Prob_Tag_At_End )\n",
" \n",
" \n",
" \n",
"basic_model.bake() \n",
"bi = bigram_counts(data.training_set.Y)\n",
"uni = unigram_counts(tags)\n",
"\n",
"\n",
"transition_prob_dict = {}\n",
"\n",
"for tag1 in Dict_Of_Tags_Words.keys():\n",
" for tag2 in Dict_Of_Tags_Words.keys():\n",
" transition_prob_dict[tag1,tag2] = (bi.get((tag1,tag2)))/(uni.get(tag2))\n",
" \n",
"for state1 in state_list:\n",
" for state2 in state_list:\n",
" basic_model.add_transition( state1,state2, transition_prob_dict[(state1.name,state2.name)])\n",
" \n",
"\n",
"basic_model.bake()\n",
"assert all(tag in set(s.name for s in basic_model.states) for tag in data.training_set.tagset), \\\n",
" \"Every state in your network should use the name of the associated tag, which must be one of the training set tags.\"\n",
"assert basic_model.edge_count() == 168, \\\n",
" (\"Your network should have an edge from the start node to each state, one edge between every \" +\n",
" \"pair of tags (states), and an edge from each state to the end node.\")\n",
"HTML('<div class=\"alert alert-block alert-success\">Your HMM network topology looks good!</div>')"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"training accuracy basic hmm model: 97.41%\n",
"testing accuracy basic hmm model: 95.67%\n"
]
},
{
"data": {
"text/html": [
"<div class=\"alert alert-block alert-success\">Your HMM tagger accuracy looks correct! Congratulations, you've finished the project.</div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hmm_training_acc = accuracy(data.training_set.X, data.training_set.Y, basic_model)\n",
"print(\"training accuracy basic hmm model: {:.2f}%\".format(100 * hmm_training_acc))\n",
"\n",
"hmm_testing_acc = accuracy(data.testing_set.X, data.testing_set.Y, basic_model)\n",
"print(\"testing accuracy basic hmm model: {:.2f}%\".format(100 * hmm_testing_acc))\n",
"\n",
"assert hmm_training_acc > 0.97, \"Uh oh. Your HMM accuracy on the training set doesn't look right.\"\n",
"assert hmm_testing_acc > 0.955, \"Uh oh. Your HMM accuracy on the testing set doesn't look right.\"\n",
"HTML('<div class=\"alert alert-block alert-success\">Your HMM tagger accuracy looks correct! Congratulations, you\\'ve finished the project.</div>')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example Decoding Sequences with the HMM Tagger"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sentence Key: b100-28144\n",
"\n",
"Predicted labels:\n",
"-----------------\n",
"['CONJ', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'CONJ', 'NOUN', 'NUM', '.', '.', 'NOUN', '.', '.']\n",
"\n",
"Actual labels:\n",
"--------------\n",
"('CONJ', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'CONJ', 'NOUN', 'NUM', '.', '.', 'NOUN', '.', '.')\n",
"\n",
"\n",
"Sentence Key: b100-23146\n",
"\n",
"Predicted labels:\n",
"-----------------\n",
"['PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'VERB', '.', 'ADP', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADP', 'DET', 'NOUN', '.']\n",
"\n",
"Actual labels:\n",
"--------------\n",
"('PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'VERB', '.', 'ADP', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADP', 'DET', 'NOUN', '.')\n",
"\n",
"\n",
"Sentence Key: b100-35462\n",
"\n",
"Predicted labels:\n",
"-----------------\n",
"['DET', 'ADJ', 'NOUN', 'VERB', 'VERB', 'VERB', 'ADP', 'DET', 'ADJ', 'ADJ', 'NOUN', 'ADP', 'DET', 'ADJ', 'NOUN', '.', 'ADP', 'ADJ', 'NOUN', '.', 'CONJ', 'ADP', 'DET', 'NOUN', 'ADP', 'ADJ', 'ADJ', '.', 'ADJ', '.', 'CONJ', 'ADJ', 'NOUN', 'ADP', 'ADJ', 'NOUN', '.']\n",
"\n",
"Actual labels:\n",
"--------------\n",
"('DET', 'ADJ', 'NOUN', 'VERB', 'VERB', 'VERB', 'ADP', 'DET', 'ADJ', 'ADJ', 'NOUN', 'ADP', 'DET', 'ADJ', 'NOUN', '.', 'ADP', 'ADJ', 'NOUN', '.', 'CONJ', 'ADP', 'DET', 'NOUN', 'ADP', 'ADJ', 'ADJ', '.', 'ADJ', '.', 'CONJ', 'ADJ', 'NOUN', 'ADP', 'ADJ', 'NOUN', '.')\n",
"\n",
"\n"
]
}
],
"source": [
"for key in data.testing_set.keys[:3]:\n",
" print(\"Sentence Key: {}\\n\".format(key))\n",
" print(\"Predicted labels:\\n-----------------\")\n",
" print(simplify_decoding(data.sentences[key].words, basic_model))\n",
" print()\n",
" print(\"Actual labels:\\n--------------\")\n",
" print(data.sentences[key].tags)\n",
" print(\"\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Finishing the project\n",
"---\n",
"\n",
"<div class=\"alert alert-block alert-info\">\n",
"**Note:** **SAVE YOUR NOTEBOOK**, then run the next cell to generate an HTML copy. You will zip & submit both this file and the HTML copy for review.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!!jupyter nbconvert *.ipynb"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4: [Optional] Improving model performance\n",
"---\n",
"There are additional enhancements that can be incorporated into your tagger that improve performance on larger tagsets where the data sparsity problem is more significant. The data sparsity problem arises because the same amount of data split over more tags means there will be fewer samples in each tag, and there will be more missing data tags that have zero occurrences in the data. The techniques in this section are optional.\n",
"\n",
"- [Laplace Smoothing](https://en.wikipedia.org/wiki/Additive_smoothing) (pseudocounts)\n",
" Laplace smoothing is a technique where you add a small, non-zero value to all observed counts to offset for unobserved values.\n",
"\n",
"- Backoff Smoothing\n",
" Another smoothing technique is to interpolate between n-grams for missing data. This method is more effective than Laplace smoothing at combatting the data sparsity problem. Refer to chapters 4, 9, and 10 of the [Speech & Language Processing](https://web.stanford.edu/~jurafsky/slp3/) book for more information.\n",
"\n",
"- Extending to Trigrams\n",
" HMM taggers have achieved better than 96% accuracy on this dataset with the full Penn treebank tagset using an architecture described in [this](http://www.coli.uni-saarland.de/~thorsten/publications/Brants-ANLP00.pdf) paper. Altering your HMM to achieve the same performance would require implementing deleted interpolation (described in the paper), incorporating trigram probabilities in your frequency tables, and re-implementing the Viterbi algorithm to consider three consecutive states instead of two.\n",
"\n",
"### Obtain the Brown Corpus with a Larger Tagset\n",
"Run the code below to download a copy of the brown corpus with the full NLTK tagset. You will need to research the available tagset information in the NLTK docs and determine the best way to extract the subset of NLTK tags you want to explore. If you write the following the format specified in Step 1, then you can reload the data using all of the code above for comparison.\n",
"\n",
"Refer to [Chapter 5](http://www.nltk.org/book/ch05.html) of the NLTK book for more information on the available tagsets."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import nltk\n",
"from nltk import pos_tag, word_tokenize\n",
"from nltk.corpus import brown\n",
"\n",
"nltk.download('brown')\n",
"training_corpus = nltk.corpus.brown\n",
"training_corpus.tagged_sents()[0]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment