Skip to content

Instantly share code, notes, and snippets.

@israeldi
Created December 19, 2019 07:17
Show Gist options
  • Save israeldi/405b9632be5ce762d86c957475f843da to your computer and use it in GitHub Desktop.
Save israeldi/405b9632be5ce762d86c957475f843da to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Stats507 Homework 3, February 6th, 2019\n",
"### Israel Diego [(Go to Home Page)](https://israeldi.github.io/coursework/) \n",
"#### israeldi@umich.edu\n",
"\n",
"This notebook shows solutions to homework 3 for Stats507\n",
"\n",
"## Table of Contents\n",
"\n",
"1. [Problem 1: Counting Word Bigrams](#Problem-1:-Counting-Word-Bigrams)\n",
"2. [Problem 2: More Fun with Vectors](#Problem-2:-More-Fun-with-Vectors)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problem 1: Counting Word Bigrams\n",
"#### ([Back to Top](#Table-of-Contents))\n",
"#### Time Spent: 1 hour\n",
"In your previous homework, you wrote a function for counting character bigrams. Now,\n",
"let's write a function for counting word bigrams. That is, for each pair of words, say,\n",
"`cat` and `dog`, we want to count how many times the word \"cat\" occurred immediately\n",
"before the word \"dog\". We will represent this bigram by a tuple, `('cat', 'dog')`. For\n",
"our purposes, we will ignore all spaces, newlines, punctuation and capitalization in our\n",
"counting. So, as an example, the fragment of poem,\n",
"\n",
" Half a league, half a league,\n",
" Half a league onward,\n",
" All in the valley of Death\n",
" Rode the six hundred.\n",
"\n",
"includes the bigrams `('half', 'a')` and `('a', 'league')` both three times, the bigram\n",
"`('league', 'half')` appears twice,while the bigram `('in', 'the')` appears only once.\n",
"\n",
"1. Write a function `count_bigrams_in_file` that takes a filename as its only argument. Your function should read from the given file, and return a dictionary whose keys are bigrams (given in the tuple form above), and values are the counts for those bigrams. Again, your function should ignore punctuation, spaces, newlines and capitalization. The strings in your key tuples should be lower-case. Your function should use a try-catch statement to raise an error with an appropriate message to alert the user in the event that the given file cannot be opened, and a different error in the event that the provided argument isn't a string at all. **Hint:** you will find the Python function `str.strip()`, along with the string constants defined in the string documentation (https://docs.python.org/3/library/string.html ), useful in removing punctuation. **Hint:** be careful to check that your function handles newlines correctly. For example, in the poem above, one of the `('league', 'half')` bigrams spans a newline, but should be counted nonetheless. **Note:** be careful that your function does not accidentally count the empty string as a word (this is a common bug if you aren't careful about splitting the input text). Solutions that merely delete \"bad\" keys from the dictionary at the end will not receive full credit, as all edge cases can handled by correctly splitting the input."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"import string\n",
"\n",
"def bigram_hist(words):\n",
" bigramDict = {}\n",
" \n",
" for i in range(len(words) - 1):\n",
" key = (words[i], words[i + 1])\n",
" if(bigramDict.get(key) == None):\n",
" bigramDict[key] = 1\n",
" else:\n",
" bigramDict[key] += 1\n",
" \n",
" return(bigramDict)\n",
"\n",
"def count_bigrams_in_file(file): \n",
" # Check file name is a string\n",
" if not isinstance(file, str):\n",
" raise ValueError('Argument file name is not of type string!')\n",
" \n",
" extractedWords = []\n",
" \n",
" try:\n",
" f = open(file)\n",
" for line in f:\n",
" for word in line.split():\n",
" extractedWords.append(word.strip(string.punctuation).lower())\n",
" return(bigram_hist(extractedWords))\n",
" except FileNotFoundError:\n",
" print('File: ' + file + ' ,unable to be opened!')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Download the file `WandP.txt` from the course webpage: http://www-personal.umich.edu/~klevin/teaching/Winter2019/STATS507/WandP.txt. This is an ASCII copy of all of Tolstoi's novel *War and Peace*. Run your function on this file, and pickle the resulting dictionary in a file called `mb.bigrams.pickle`. Please include this file in your submission, along with `WandP.txt`, so that we can run your notebook directly from your submission."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"\n",
"file = open('mb.bigrams.pickle.txt', 'wb')\n",
"file.write(pickle.dumps(count_bigrams_in_file('WandP.txt')))\n",
"file.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. We say that word A is *collocated* with word B in a text if words A and B occur immediately one after another (in either order). That is, words A and B are collocated if and only if either of the tuples `(A, B)` or `(B, A)` are present in the text. Write a function `collocations` that takes a filename as its only argument and returns a dictionary. Your function should read from the given file (raising an appropriate error if the file cannot be opened or if the argument isn't a string at all) and return a dictionary whose keys are all the strings appearing in the file (again ignoring case and stripping away all spaces, newlines and punctuation) and the value of word A is a Python set containing all the words collocated with A. Again using the poem fragment above as an example, the string `'league'` should appear as a key, and should have as its value the set `{'a', 'half', 'onward'}`, while the string `'in'` should have the set `{'all', 'the'}` as its value. **Hint:** we didn't discuss Python sets in lecture, because they are essentially just dictionaries without values. See the documentation at https://docs.python.org/3/tutorial/datastructures.html#sets for more information."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"def collocations(file):\n",
" # Check file name is a string\n",
" if not isinstance(file, str):\n",
" raise ValueError('Argument file name is not of type string!')\n",
" \n",
" # Extract all words from file\n",
" extractedWords = []\n",
" try:\n",
" f = open(file)\n",
" for line in f:\n",
" for word in line.split():\n",
" extractedWords.append(word.strip(string.punctuation).lower())\n",
" except FileNotFoundError:\n",
" print('File: ' + file + ' ,unable to be opened!')\n",
" \n",
" # check collocations\n",
" bigramDict = {}\n",
" n = len(extractedWords)\n",
" for i in range(n):\n",
" key = extractedWords[i]\n",
" if n == 1:\n",
" bigramDict[key] = set('')\n",
" return (bigramDict)\n",
"\n",
" if i < (n - 1):\n",
" nextKey = extractedWords[i + 1]\n",
" if (bigramDict.get(key) == None):\n",
" bigramDict[key] = {nextKey}\n",
" elif nextKey not in bigramDict[key]:\n",
" bigramDict[key].add(nextKey)\n",
"\n",
" if i > 0:\n",
" prevKey = extractedWords[i - 1]\n",
" if (bigramDict.get(key) == None):\n",
" bigramDict[key] = {prevKey}\n",
" elif prevKey not in bigramDict[key]:\n",
" bigramDict[key].add(prevKey)\n",
"\n",
" return (bigramDict)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"4. Run your function on the file `WandP.txt` and pickle the resulting dictionary in a file called `mb.colloc.pickle`. Please include this file in your submission."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"file = open('mb.colloc.pickle.txt', 'wb')\n",
"file.write(pickle.dumps(collocations('WandP.txt')))\n",
"file.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problem 2: More Fun with Vectors\n",
"#### ([Back to Top](#Table-of-Contents))\n",
"#### Time Spent: 2 hours\n",
"In this exercise, we'll encounter our old friend the vector yet again, this time taking an\n",
"object-oriented approach.\n",
"1. Define a class `Vector`. Every vector should have a dimension (a non-negative integer) and a list or tuple of its entries. The initializer for your class should take the dimension as its first argument and a list or tuple of numbers (ints or floats), representing the vector's entries, as its second argument. Choose sensible default behavior for the case where the user applies only a dimension and no entries. The initializer should raise a sensible error in the case where the dimension is invalid (i.e., wrong type or a negative number), and should also raise an error in the event that the dimension and the number of supplied entries disagree."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Did you choose to make the vector's entries a tuple or a list (there is no wrong answer here, although I would say one is better than the other in this context)? Defend your choice."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* *I made vector entries into a tuple, to protect vectors from being modified. Any vector modifications or vector operations should instead return another tuple.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. Are the dimension and entries class attributes or instance attributes? Why is this the right design choice?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* *I made the dimension entries instance attributes, because I want every object to have their copies of these instance attributes.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"4. Implement the necessary operator(s) to support comparison (equality, less than, less or equal to, greater than, etc) of `Vector` objects. We will say that two `Vector` objects are equivalent if they have the same coordinates. Otherwise, comparison should be analogous to tuples in Python, so that comparison is done on the first coordinate first, then the second coordinate, then the third, and so on. So, for example, the two-dimensional vector (2, 4) is ordered before (less than) (2, 5). Attempting to compare two vectors of different dimensions should result in an error."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"5. Implement a method `Vector.dot` that takes a single `Vector` as its argument and returns the inner product of the caller with the given `Vector` object. Your method should raise an appropriate error in the event that the argument is not of the correct type or in the event that the dimensions of the two vectors do not agree."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"6. We would also like our `Vector` class to support scalar multiplication. Left- or right- multiplication by a scalar, e.g., `2*v` or `v*2`, where `v` is a `Vector` object, should result in a new `Vector` object with its entries all scaled by the given scalar. We will also follow `R` and `numpy` (which you will learn in a few weeks), and use * to denote entrywise vector-vector multiplication, so that for `Vector` objects `v` and `w`, `v*w` results in a new `Vector` object, with the $i$-th entry of `v*w` equal to the $i$-th entry of `v` multiplied by the $i$-th entry of `w`. Implement the appropriate operators to support this multiplication operation. Many languages have a convention for dealing with multiplication of vectors that differ in their dimension, but we will punt on this matter. Your method should raise an appropriate error in the event that v and w disagree in their dimensions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"7. For a real number $0\\leq p\\leq\\infty$, and a vector $v\\in\\mathbb{R}^{d}$, the $p$-norm of $v$, written $\\Vert v\\Vert_{p}$, is given by\n",
"\n",
"\\begin{align*}\n",
"\\Vert v\\Vert_{p}= & \\begin{cases}\n",
"\\sum_{i=1}^{d}1_{v_{i}\\neq0} & \\textrm{if } p=0\\\\\n",
"\\left(\\sum_{i=1}^{d}\\mid v_{i}\\mid^{p}\\right)^{1/p} & \\textrm{if }0<p<\\infty\\\\\n",
"max_{i=1,2,\\ldots,d\\mid v_{i}\\mid} & \\textrm{if }p=\\infty\n",
"\\end{cases}\n",
"\\end{align*}\n",
"\n",
"Strictly speaking, this is only a norm for $p\\geq1$, but that's beside the point. Implement a method `Vector.norm` that takes a single int or float `p` as an argument and returns the `p`-norm of the calling `Vector` object. Your method should work whether `p` is an integer or float. Your method should raise a sensible error in the event that $p$ is negative. **Hint:** see https://docs.python.org/3/library/functions.html# float for documentation on representing positive infinity in Python."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"class Vector:\n",
" # 1. Constructor for Vector Class\n",
" def __init__(self, dimension, entries = None):\n",
" # Error Checking\n",
" if not isinstance(dimension, int):\n",
" raise TypeError('Dimension should be an int!')\n",
" \n",
" if dimension < 0:\n",
" raise ValueError('Cannot Negative dimension!')\n",
" \n",
" if entries != None:\n",
" if not isinstance(entries, (list, tuple)):\n",
" raise TypeError('Input elements should be list or a tuple!')\n",
"\n",
" if not all(isinstance(elmt, (int, float)) for elmt in entries):\n",
" raise TypeError('Elements of list/tuple should of type int/float!')\n",
"\n",
" if dimension != len(entries):\n",
" raise ValueError('Dimension and Number of Elements in entries disagree!')\n",
" \n",
" # Initialize Instance Attributes\n",
" self.dimension = dimension\n",
" if entries == None:\n",
" self.entries = tuple([0 for _ in range(dimension)])\n",
" else:\n",
" self.entries = tuple(entries)\n",
" \n",
" # 2. I made vector entries into a tuple, to protect vectors from being modified. Any \n",
" # vector modifications or vector operations should instead return another tuple.\n",
" \n",
" # 3. I made the dimension entries instance attributes, because I want every object \n",
" # to have their own copies of these instance attributes.\n",
" \n",
" # 4. Defining operators for Vector Class\n",
" # Equality '=='\n",
" def __eq__(self, other):\n",
" if (self.dimension != other.dimension):\n",
" raise ValueError('Vectors should have the same dimension!')\n",
" if self.entries == other.entries:\n",
" return(True)\n",
" else:\n",
" return(False)\n",
" \n",
" # Not equal '!='\n",
" def __ne__(self, other):\n",
" if (self.dimension != other.dimension):\n",
" raise ValueError('Vectors should have the same dimension!')\n",
" if self.entries != other.entries:\n",
" return(True)\n",
" else:\n",
" return(False)\n",
" \n",
" # Less than '<' \n",
" def __lt__(self, other):\n",
" if (self.dimension != other.dimension):\n",
" raise ValueError('Vectors should have the same dimension!')\n",
" if self.entries < other.entries:\n",
" return(True)\n",
" else:\n",
" return(False)\n",
" \n",
" # Greater than '>'\n",
" def __gt__(self, other):\n",
" if (self.dimension != other.dimension):\n",
" raise ValueError('Vectors should have the same dimension!')\n",
" if self.entries > other.entries:\n",
" return(True)\n",
" else:\n",
" return(False)\n",
" \n",
" # Less than or equal '<='\n",
" def __le__(self, other):\n",
" if (self.dimension != other.dimension):\n",
" raise ValueError('Vectors should have the same dimension!')\n",
" if self.entries <= other.entries:\n",
" return(True)\n",
" else:\n",
" return(False)\n",
" \n",
" # Greater than or equal '>='\n",
" def __ge__(self, other):\n",
" if (self.dimension != other.dimension):\n",
" raise ValueError('Vectors should have the same dimension!')\n",
" if self.entries >= other.entries:\n",
" return(True)\n",
" else:\n",
" return(False)\n",
" \n",
" # 5. Vector.dot method\n",
" def dot(self, other):\n",
" if not isinstance(other, Vector):\n",
" raise TypeError('Should supply argument of type Vector!')\n",
" \n",
" if (self.dimension != other.dimension):\n",
" raise ValueError('Vectors should have the same dimension!')\n",
" \n",
" x = self.entries; y = other.entries\n",
" return(float(sum(x[i] * y[i] for i in range(len(x)))))\n",
" \n",
" # 6. Vector and Scalar multiplication\n",
" def __mul__(self, other):\n",
" if not isinstance(other, (int, float, Vector)):\n",
" raise TypeError('Should only multiply with scalars or Vectors!')\n",
" \n",
" if isinstance(other, (int, float)):\n",
" return(tuple([other * self.entries[i] for i in range(self.dimension)]))\n",
" \n",
" if isinstance(other, Vector):\n",
" if (self.dimension != other.dimension):\n",
" raise ValueError('Vectors should have the same dimension!')\n",
" return(tuple([other.entries[i] * self.entries[i] for i in range(self.dimension)]))\n",
" \n",
" def __rmul__(self, other):\n",
" if not isinstance(other, (int, float, Vector)):\n",
" raise TypeError('Should only multiply with scalars or Vectors!')\n",
" \n",
" if isinstance(other, (int, float)):\n",
" return(tuple([other * self.entries[i] for i in range(self.dimension)]))\n",
" \n",
" if isinstance(other, Vector):\n",
" if (self.dimension != other.dimension):\n",
" raise ValueError('Vectors should have the same dimension!')\n",
" return(tuple([other.entries[i] * self.entries[i] for i in range(self.dimension)]))\n",
" \n",
" # 7. Vector.norm method\n",
" def norm(self, p):\n",
" if not isinstance(p, (int, float)):\n",
" raise TypeError('Input should be of type int/float!')\n",
" \n",
" if p < 0:\n",
" raise ValueError('Input should be non-negative!')\n",
" \n",
" if p == 0:\n",
" return(float(sum(1 for x in self.entries if x != 0)))\n",
" \n",
" elif p > 0 and p != float('inf'):\n",
" return(sum(abs(x)**p for x in self.entries)**(1 / p))\n",
" \n",
" elif p == float('inf'):\n",
" return(max(abs(x) for x in self.entries))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment