Skip to content

Instantly share code, notes, and snippets.

@israeldi
Created December 19, 2019 07:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save israeldi/55475db8c71a2df4572830ae842a6669 to your computer and use it in GitHub Desktop.
Save israeldi/55475db8c71a2df4572830ae842a6669 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Stats507 Homework 7, March 13, 2019\n",
"### Israel Diego [(Go to Home Page)](https://israeldi.github.io/coursework/) \n",
"#### israeldi@umich.edu\n",
"\n",
"This notebook shows solutions to homework 7 for Stats507\n",
"\n",
"## Table of Contents\n",
"\n",
"1. [Problem 1: Regular Expressions: Warmup](#Problem-1:-Regular-Expressions:-Warmup)\n",
"2. [Problem 2: Exploring Internet Traffic with Regexes](#Problem-2:-Exploring-Internet-Traffic-with-Regexes)\n",
"3. [Problem 3: Retrieving Data from the Web](#Problem-3:-Retrieving-Data-from-the-Web)\n",
"4. [Problem 4: Retrieving Data from the Web](#Problem-4:-Relational-Databases-and-SQL)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problem 1: Regular Expressions: Warmup\n",
"#### Time Spent: 2 hours\n",
"#### ([Back to Top](#Table-of-Contents))\n",
"In this problem, you'll get practice with basic regular expressions. Pay particular attention to edge cases such as the empty string and single-character strings when writing your regexes. At the URL http://www.greenteapress.com/thinkpython/code/words.txt is a list of about $100,000$ English words.\n",
"1. Use `urllib` to open the URL and read the file, and produce a list of ASCII strings so that each line of the file corresponds to an element of the list. You will likely need to convert the raw bytes read from the webpage to ASCII characters, for which you should see the documentation for the string methods `encode` and `decode`. How many words are in the file?"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"113809"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import urllib.request\n",
"import re\n",
"\n",
"response = urllib.request.urlopen('http://www.greenteapress.com/thinkpython/code/words.txt')\n",
"words = []\n",
"for line in response:\n",
" words.append(line.decode(\"ascii\").replace('\\r\\n', ''))\n",
"\n",
"numWords = len(words)\n",
"numWords"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. It is a good habit to always look at your data to check that it makes sense. Have a look at the words in the list. Does anything jump out at you? **Note:** I am not requiring you to do anything specific, here. Just look at the data!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- *After removing unwanted characters, `\\r\\n`, each element in the list represents a word of type string*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. Write a regular expression that matches any string containing exactly four consecutive consonants. Compile this regular expression, and assign it to a variable called `four_consecutive_consonants`. Use this regex to determine how many words from the list start with exactly four consecutive consonants. For the purposes of this **specific** problem, the vowels are `a, e, i, o, u, y`. All other letters are consonants. Produce a list of all such words."
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"1629"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"four_consecutive_consonants = re.compile(r'[^aeiouy]{4}')\n",
"filterWords = list(filter(four_consecutive_consonants.search, words))\n",
"\n",
"num_four_consonants = len(filterWords)\n",
"num_four_consonants"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"4. Write a regular expression that matches any string that contains no instances of the letter `e`. Compile this regular expression, and assign it to a variable called `gadsby`. (*Gadsby* is the title of an English novel written in the $1930$s that contains *almost* no instances of the letter `e`). How many words in the list do not contain the letter `e`?"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"37641"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gadsby = re.compile(r'^[^e]*$')\n",
"filterWords = list(filter(gadsby.search, words))\n",
"\n",
"numGadsby = len(filterWords)\n",
"numGadsby"
]
},
{
"cell_type": "markdown",
"metadata": {
"scrolled": false
},
"source": [
"5. Write a regular expression that matches any string that begins and ends with a vowel and has no vowels in between. For the purposes of this specific problem, y is neither consonant nor vowel, so consonants are the $20$ letters that are not one of `a, e, i, o, u, y` and vowels are `a, e, i, o, u`. The words need not begin and end with the *same* vowel, so `angle` is a valid match. Compile this regular expression, and assign it to a variable called `vowel_vowel`. How many words begin and end with a vowel with no vowels in between?"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"169"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vowel_vowel = re.compile(r'^[aeiou]{1}[^aeiouy]*[aeiou]{1}$')\n",
"filterWords = list(filter(vowel_vowel.search, words))\n",
"\n",
"# vowel_vowel\n",
"num_vowel_vowel = len(filterWords)\n",
"num_vowel_vowel"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"6. Write a regular expression that matches any string whose last two characters are the first two characters in reverse order. So, for example, your regex should match `repeater` and `stats`, but not `neoprene`. Compile this regular expression and assign it to a variable called `bookends`. How many words in the list have this property? **Hint:** be careful of the cases in which the word is length less or equal to $3$. You may handle the case of a single character (e.g., `a`), as you like, but please give an explanation for your choice."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"886"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bookends = re.compile(r'^(\\w)(\\w).*\\2\\1$|^(\\w).\\3$|^(\\w)\\4$|^.$')\n",
"filterWords = list(filter(r.search, words))\n",
"\n",
"numBookends = len(filterWords)\n",
"numBookends"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* *I chose to include all single characters as part of the bookends list*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problem 2: Exploring Internet Traffic with Regexes\n",
"#### Time Spent: 2 hours\n",
"#### ([Back to Top](#Table-of-Contents))\n",
"In this problem, you'll get a taste of a more realistic application of regular expressions. The\n",
"file http://umich.edu/~klevin/teaching/Winter2019/STATS507/SkypeIRC.txt contains data generated by web traffic associated with Skype and IRC, captured using the Wireshark program, a common tool for analyzing web traffic. The original data file can be found on the Wireshark wiki, https://wiki.wireshark.org/SampleCaptures, but please use the file provided on my website for this assignment.\n",
"1. Download the file from the URL above (or use `urllib` or `requests` to open it directly, being careful to convert the raw bytes back to UTF-8) and read its contents into a string. Each line of this file corresponds to a single packet sent over the internet. How many packets are in this file? Save the answer in a variable `n_packets`. **Note:** if you decide to download the file, don't forget to include a copy of it in your submission so that we can run your code."
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2263"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"response = urllib.request.urlopen('http://umich.edu/~klevin/teaching/Winter2019/STATS507/SkypeIRC.txt')\n",
"packets = []\n",
"for line in response:\n",
" packets.append(line.decode('utf-8'))\n",
" \n",
"n_packets = len(packets)\n",
"n_packets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Use regular expressions to extract all the IP addresses from the file and collect them in a Python list. An IP address consists of four numbers, which are displayed as `A.B.C.D` where `A,B,C` and `D` are each numbers between $0$ and $255$. How many unique IP addresses appear in the data set? Save the answer in a variable `ip_addresses`"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"368"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ip_addresses_set = set()\n",
"for line in packets:\n",
" [ip_addresses_set.add(x) for x in re.findall(r'\\d+\\.\\d+\\.\\d+\\.\\d+', line)]\n",
"\n",
"ip_addresses = len(ip_addresses_set)\n",
"ip_addresses"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. Write a function called `get_packets_by_regex` that takes a single raw string as its argument and returns all lines of the input file that match the input raw string as a regular expression. So, for example, `get_packets_by_regex(r'comcast')` will return all lines from the file containing the string `'comcast'`. Your function should perform appropriate error checking to ensure that the input is a string, but you do not need to check that it is a raw string."
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [],
"source": [
"def get_packets_by_regex(string):\n",
" # Perform Error-checking\n",
" if not isinstance(string, str):\n",
" raise TypeError('Should a string!')\n",
" \n",
" r = re.compile(string)\n",
" return list(filter(r.search, packets))\n",
"\n",
"# get_packets_by_regex('comcast')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"4. The second piece of text (i.e., non-whitespace) on each line is a time stamp, counting the time (in seconds) since the beginning of the traffic recording. Using `matplotlib`, create a plot displaying how many packets appeared in each second of the recording. A histogram or line plot is the most obvious way to do this, but you should feel free to use a more creative way of displaying this information if you wish to do so. **Note:** in case it wasn't obvious, there is no need to use a regular expression for this subproblem if you do not want to."
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAfAAAAHwCAYAAABZrD3mAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3XmYbGV5rvH7UUQUVAS2BEEFFCOaGCU7BOcBTZSgqNEET0REIg5xinrikMQQTYwaj0OOiYpCAmoEIigcZ4IQZ3SDiiAqqMgobEVUcADkPX+s1VIUPVTv7uqqr/v+XVddXfXVqlrvV6u6n17fmlJVSJKkttxs0gVIkqTFM8AlSWqQAS5JUoMMcEmSGmSAS5LUIANckqQGGeBaM5Kcn+Qlk65jJSW5KsnTJl3HSkhSSZ44wfn/dpIvJvlFkm/M1pZki77OfSdVp1YPA1yrQpJ1Sf6tD+lfJrksyclJHjnp2qZZkkOTnDXw+Gl9wMzcLkvy/5Lca5bXbp/kLUm+3X/mFyf5aJJ95pjXPfr3fMBQ+8lJrk+ybqj9oiSvWqZ+DvdrtttDlzibfwJ+CPwm8MDZ2qrqF8AOwElLnJfEZpMuQFomxwG3Bg4GzgPuADwE2HaSRTXqZ8BdgQA7Aq8HPpzk7lV1DUCSnYHPAj8FXg58lW6FYG/g7cCdh9+0qr6R5FLgYf1rSbI5cD/gQrrl9f6+fbd+3qcsU5+OAT428PjdwBXACwbarhh+UZJbVNW1I87jbsCRVfW9+dqq6vsjVy3NwzVwNS/J1sCDgJdV1clV9b2q+lJVvaGqjp7ndU9J8pMkj03y1CQ/THLLoWnem+TE/v6hSc5K8udJLkjy8yQfTLLd0GsOSvL1ftj0W0n+Msmcv2tJ7prkhCTfT3J1kjOGh1j7kYW/SfKOvuaLkvzvoWnuluTUfr7fXMIwbVXV96vq0qraALwJuAvdWuSMf6ML+PVVdWxVfbOqzqmqtwK/M897n0IX4DP2oltDfc9Q+8OAXwCfH+jfKJ/rbyT5cJKfJflekqf0Hfp536fv9wH6S+BGbVV1TZLXJtmQ5JAk3wV+mWSzJI9J8tkkV/bfk4/0/2QwMyzefz6v6dfmXzZH202G0JPcOcnR/fv+LMnpSR6ItAADXKvBVf3tsUm2GOUFSZ4P/F9g36o6Efgvut+H/QamuR3weODwgZfuDDyln+4RwG7AEQOveQbwGuCVwO7Ai4GXAs+Zp5ytgI8Cj6QLv+OA45PcY2i6vwS+BuwBvA54fZL79fO9GfCBvg/3A54OHArckiXo/zn6X/3Da/u2bYBHAW+tqquGX1NVP5rnLU8B7j/wj9LDgFP723CAf74fcl7M5/r3wInAfYDDgKOSrB+hq4PuQbd8n9C/z6/oRnf+GVhPN8pwDXBiks0GhsXPB/6xv/9/52i7kSS3BT4F/AbwWODedMPuWWTNWouqypu35m/AH9MNgc6stb0B+P2hac4HXgK8CrgMuO/Q828FPjbw+NnA94HN+seH0v0xv/PANA8ECtitf3wBcMDQ+74Q+Poi+/MF4G+Gan/f0DTnzkwD/ME8tT1tnvkcCpw18Php/WuuAq7u7xdwwsA0e/Ztj9+E5XTX/rUP6R+fSrfZY0u6UPyNvv3Sof4v+Ln27/vOoWn+G3jPLHV8CPiPWdpf23+HtlmgH7fv57d+oO08ulEg5moDtuhft2//+HnAlcDWk/4d8tbezTVwrQpVdRxwR+AxdGuz9we+kOQVQ5O+AHg+3Q5FXx567p3AI5Ps1D9+Ot32y+sGprm4qi4YeHwacD2we78T1p2Ad6Tb+/uqJFfRhcJd56o9yZZJXt8PD/+of816brod+cyhx5fQbeuHbq10rtoW62d0a56/CzyT7h+FZw6WvAnvCUBVfZsujB/Wj5bsBZxaVVcDG4CHJtmdbo30k9DtoMjon+vnZ3l8z0WW+d2qutH28CR3T3JMku8k+QndNnuYZVv/It0XOL2qrlzi+2gNcic2rRrVDWWe1N9eleRdwKFJ3lD9zlfAZ+iGf59MtyY++PqvJjkDeFqSD9KF6FMWUcLMP8TPAj63iNe9oa/pJXRh+TPgKGDzoemGd6aqgXku55BrVdV5/f1vJNkBeB83DHGf2897d7ph+8U6BXgo8D/Axj7U6R8/lG7t9mrgS337pn6um+rqwQdJQvdP4TeBP6cbHQA4m5suo8VyqFybzDVwrWZfp/sndXC7+Ol0w80vSvK3s7zmnXTDyH8OfLaqvjn0/I5J7jTweE+636Nzquoy4GLgrlV13vBtnjofCBxVVcdV1ZnARcyzxj6Hr89T21K9CdgjyRMA+rXTjwPPTbLV8MT9dvP5nEK35r0P3RD6jFPp/kl4GPDp6vf+XuTnutcsj89ZuIvzuiOwK/DqqvpkVZ1D90/GcoTvGXSf7UKfmXQTBrial2TbJJ9Mt1f5vZPskuRJwF8BJ1fVTwanr6ov0YX4i5P8zdDbvY9u+PbZ3HjntRk/B45Mcp9+B7K3Ax+uqnP75w8F/qrfQ/o3k/xWuj3cXz5PF74FPD7JHkl+m26P7JF2xhvw38A36HbamqntTcB1879sYf3n9y7g7wf2+n4OXYBtSPKkvq/3SPJsbjrUP+wUup3rnsmNA/yzdEH5KG56+NihjPa5PiHJM5Ls1j+3N/DmRXX4pi6n2079zH5P/4cDb6EbhViqo+gOxftgkgck2TXJ45M8aBneW6ucAa7V4Cq6nb5eQDcMezbdHsv/CfzpbC+oqi/ShfhLBkO8qn4KHEu3Q9Wxs7z0fOBo4P/RbaP9DnDQwOvfRbft/AC6Y6M/DRwCfHee+l9EFxKfphuq/UJ/f2RVdT3dHvM3o9v2fRTwD3SHSy2Ht9Dtnb1/P7/v0u0NfxLdHvFn0n0ej+XG28tnq/UCus/tNgwEeHV7tJ/et39y6DWjfq6H0u3QeCbdP2EH9f+wbbJ+JGB/uhGNs+j+IXgpm7Z/wfB7/xh4MPAD4CN0Rxn89XK8t1a/VC3HP5HS6pHko8BFVfWMofZDgSdW1W9NpDBJGuBObFKvP775EXRr5vOdjESSJs4Al25wBrAN8IqqOmuhiSVpkhxClySpQe7EJklSg1bNEPp2221XO++886TLkCRpSU4//fQfVNW6haZbNQG+8847s2HDhkmXIUnSkiT53sJTOYQuSVKTDHBJkhpkgEuS1CADXJKkBhngkiQ1yACXJKlBBrgkSQ0ywCVJapABLklSgwxwSZIaZIBLktQgA1ySpAYZ4JIkNcgAlySpQQa4JEkNMsAlSWqQAS5JUoMMcEmSGmSAS5LUIANc0sQlk65Aao8BLklSgwxwSZIaZIBLktQgA1ySpAYZ4JIkNcgAlySpQQa4JEkNMsAlSWqQAS5JUoMMcEmSGmSAS5LUIANckqQGGeCSJDXIAJckqUEGuCRJDVqRAE9yRJLLk5w10LZNkpOSnNv/vH3fniT/kuS8JGcm2WMlapQkqSUrtQb+H8CjhtpeBpxcVbsBJ/ePAR4N7NbfDgHetkI1SpLUjBUJ8Kr6FHDFUPN+wJH9/SOBxw20H1WdLwBbJ9lhJeqUJKkVk9wGvn1VXQrQ/7xD374jcOHAdBf1bZIkqTeNO7FllraadcLkkCQbkmzYuHHjmMuSJGl6TDLAL5sZGu9/Xt63XwTcaWC6nYBLZnuDqjqsqtZX1fp169aNtVhJkqbJJAP8RODA/v6BwAkD7U/t90bfC/jxzFC7JEnqbLYSM0nyPuChwHZJLgL+DngtcGySg4ELgCf1k38E2Ac4D/gZcNBK1ChJUktWJMCr6slzPLX3LNMW8BfjrUiSpLZN405skiRpAQa4JEkNMsAlSWqQAS5JUoMMcEmSGmSAS5LUIANckqQGGeCSJDXIAJckqUEGuCRJDTLAJUmrTma7MPUqY4BLktQgA1ySpAYZ4JIkNcgAlySpQQa4JEkNMsAlSWqQAS5JUoMMcEmSGmSAS5LUIANckqQGGeCSJDXIAJckqUEGuCRJDTLAJUlqkAEuSVKDDHBJkhpkgEuS1CADXJKkBhngkiQ1yACXJKlBBrgkSQ0ywCVJapABLklSgwxwSZIaZIBLktQgA1ySpAYZ4JIkNcgAlySpQQa4JEkNMsAlSWqQAS5JUoMMcEmSGmSAS5LUIANckqQGGeCSJDXIAJckqUEGuCRJDTLAJUlqkAEuSVKDDHBJkhpkgEuS1CADXJKkBhngkiQ1yACXJKlBBrgkSQ0ywCVJapABLklSgwxwSZIaZIBLktQgA1ySpAYZ4JIkNcgAlySpQQa4JEkNMsAlSWqQAS5JUoMMcEmSGmSAS5LUIANckqQGGeCSJDVo4gGe5C+TnJ3krCTvS7JFkl2SnJbk3CTHJNl80nVKkjRNJhrgSXYEng+sr6rfAm4O7A+8DnhTVe0G/Ag4eHJVSpI0fSa+Bg5sBtwqyWbArYFLgYcD7++fPxJ43IRqkyRpKk00wKvqYuANwAV0wf1j4HTgyqq6rp/sImDH2V6f5JAkG5Js2Lhx40qULEnSVJj0EPrtgf2AXYA7AlsCj55l0prt9VV1WFWtr6r169atG1+hkiRNmUkPoT8C+G5Vbayqa4HjgfsDW/dD6gA7AZdMqkBJkqbRpAP8AmCvJLdOEmBv4OvAKcAT+2kOBE6YUH2SJE2lSW8DP41uZ7UzgK/19RwGvBR4UZLzgG2BwydWpCRJU2izhScZr6r6O+Dvhpq/A+w5gXIkSWrCpIfQJUnSJjDAJUlqkAEuSVKDDHBJkhpkgEuS1CADXJKkBhngkiQ1yACXJKlBBrgkSQ0ywCVJapABLklSgwxwSZIaZIBLktQgA1ySpAYZ4JIkNcgAlySpQQa4JEkNMsAlSWqQAS5JUoMMcEmSGmSAS5LUIANckqQGGeCSJDXIAJckqUEGuCRJDTLAJUlqkAEuSVKDDHBJkhpkgEuS1CADXBOVTLoCTZrfAWnTGOCSJDXIAJckqUEGuCRJDTLAJUlqkAEuSVKDDHBJkho0UoAnWZdkq/7+zZMclOSpSfwHQJKkCRg1gD8E7Nbf/0fgJcCLgP8zjqIkSdL8NhtxursDX+nvPwW4P3AVcDbwl2OoS5IkzWPUAP8VsHmSuwM/rqoL+uHzrcZXmiRJmsuoAf4x4FhgW+Dovu2ewMXjKEqSJM1v1AA/GDgQuBY4qm/bDjh0DDVJkqQFjBrgz6uqNww2VNWpSV40hpokSdICRt0L/ZVztP/NchUiSZJGN+8aeJKH93dvnuRhwOCF/3YFfjquwiRJ0twWGkI/vP+5BXDEQHsBlwHPG0dRkiRpfvMGeFXtApDkqKp66sqUJEmSFjLSNvCqemqSWyR5UJI/BUiyZZItx1ueJEmazajnQv9t4FvAO7lhWP0h3HhYXZIkrZBR90J/G/DKqroH3bHgAP8DPHAsVUmSpHmNGuD3At7T3y+AqroauNU4ipIkSfMbNcDPB353sCHJnsB5y12QJEla2KhnYvtb4MNJ3k53UZOXA88CnjG2yiRJ0pxG3Qv9Q8CjgXV0277vAjyhqj4xxtokSdIcRloDT3LHqjoDeM5Q+32r6stjqUySJM1p1G3gn0iyzWBDvw38I8tfkiRJWsioAX4YXYhvBZDk/sCJdJcZlSRp6iTdbbUaaQi9qv4lye2AjyT5J+A/gD+rqv8eZ3GSJGl2o+6FTlW9OsnWwDHAvlX1qfGVJUmS5jNngCe5kP6kLQNu1t/ek35coqruPLbqJEnSrOZbA3/KilWhNS2BGv5XUZI0rzkDvKr+ZyULkSRJoxv1amTHJ3nQUNuDkrx/PGVJkqT5jHoY2UOAzw21fR542PKWI0mSRjFqgP8C2HKobStuuLSoJElaQaMG+MeBdyS5LUD/863Ax8ZVmCRJmtuoAf5i4LbAFUkuB64Abge8cFyFSZKkuY16JrYfAX+UZAdgJ+DCqvr+WCuTJElzGvlMbABVdWmS7wNJcrO+7fqxVCZJm8hzC2gtGPUwsjsm+UCSHwLX0e28NnNbkiRbJ3l/km8kOSfJ/ZJsk+SkJOf2P2+/1PlIkrSajLoN/B3ANcDewFXAHnRXI3vWMtTwFuBjVXUP4HeAc4CXASdX1W7Ayf1jSZLUGzXA7w88vaq+AlRVfZXuUqIvXsrM+73ZHwwcTvfG11TVlcB+wJH9ZEcCj1vKfCRJWm1GDfBf0Q2dA1yZZB1wNbDjEue/K7AR+PckX07yriRbAttX1aXQbXcH7jDbi5MckmRDkg0bN25cYimSJLVj1AA/Ddinv/9xukuKHg9sWOL8N6Mbjn9bVd2X7p+CkYfLq+qwqlpfVevXrVu3xFIkSWrHqAF+ADBzcZMXAp8EzgL+1xLnfxFwUVWd1j9+P12gX9Yfskb/8/IlzkeSpFVlwQBPsjXdUPe1AFX186r6h6p66cww96bqjyW/MMlv9k17A1+n20HuwL7tQOCEpcxHkqTVZt7jwJP8EXAscCvgp0keV1WnLHMNzwPem2Rz4DvAQXT/WByb5GDgAuBJyzxPSZqIpPvpcepaqoVO5PJq4KXAEcAzgH+k2yN92fR7tq+f5am9l3M+kiStJgsNoe9aVW+tqp8B/wrcbQVqkiRJC1gowH/9fFVdxyJPvSpJksZjoUC+dZJPDTy+zdBjqurBy1+WJEmbZmY/g9VuoQA/eOjx4eMqRJIkjW7eAK+qI+d7XpIkTcaoJ3KRJElTxACXJKlBBrgkSQ2aM8CTfGHg/t+tTDmSJGkU862B3z3JFv39JV33W5IkLa/59kI/AfhWkvOBWw0f/z3D48CltS3xvN7SJMwZ4FV1UJIHAjsDv4fHgEuSNDUWOg78M8BnkmzuMeGSJE2Pkc5tXlVHJHkYcACwI3Ax8J6q+uQ4i5MkSbMb6TCyJH8OHAN8HzgeuBT4zyTPGGNtkiRpDqNeXeyvgEdW1VdnGpIcAxwHvHMchUmSpLmNeiKXbYGvD7V9E9hmecuRJEmjGDXAPwO8McmtAZJsCfwz8LlxFSZJkuY2aoA/C7g38OMklwFXAr8DPHNchUmSpLmNuhf6pcBDkuwE3BG4pKouGmtlkiRpTqPuxAZAH9oGtyRJE+bVyCRJapABLklSgxYM8CQ3S/LwJJuvREGSJGlhCwZ4VV0PnFBV16xAPZIkaQSjDqF/KsleY61Ea17S3SRJCxt1L/TvAR9NcgJwIfDrq/9W1SvHUZgkSZrbqAF+K+CD/f2dxlSLJEka0agncjlo3IVIkqTRjXwilyS7A08Etq+q5yb5TeCWVXXm2KqTJEmzGvV64E8CPgXsCDy1b74N8MYx1SVJkuYx6l7or6K7HvizgF/1bV+lu6CJJElaYaMG+B3oAhtu2AO9Bu5LkqQVNGqAnw4cMNS2P/DF5S1HkiSNYtSd2J4PfCLJwcCWST4O3B34g7FVJkmS5jTqYWTfSHIPYF/gQ3Qnc/lQVV01zuIkSdLsRj6MrKp+luSzwHeBSwxvSZImZ9TDyO6c5NPA+cCHgfOTfCbJXcZZnCRJmt2oO7EdSbcj29ZVdQfg9sCX+nZJkrTCRh1C/13gD6rqWoCquirJS4Efjq0ySZI0p1HXwL8A7DnUth74/PKWI0mSRjHnGniSVw08/DbwkSQfptsD/U7APsB/jrc8SZI0m/mG0O809Pj4/ucdgF8CHwC2GEdRkiRpfnMGuJcQlSRpei3mcqK3Bu4GbDXYXlWfW+6iJEnS/EYK8CRPBd4KXAP8fOCpAu48hrokSdI8Rl0Dfz3wx1V10jiLkSRJoxn1MLJrgFPHWIckSVqEUQP8b4E3JtlunMVIkqTRjBrg3wIeC1yW5Ff97fokvxpjbZIkaQ6jbgN/N3AUcAw33olNkiRNwKgBvi3wyqqqcRYjSZJGM+oQ+r8DB4yzEEmSNLpR18D3BJ6b5K+BywafqKoHL3tVkiRpXqMG+Dv7myRJmgIjBXhVHTnuQiRJ0uhGPZXq0+d6rqqOWL5yJEnSKEYdQh/ege03gLsCnwUMcEmSVtioQ+gPG27r18p3X/aKJEnSgkY9jGw2/wEcvEx1SJKkRRh1G/hw0N8aeApw5bJXJEmSFjTqNvDr6K79Pehi4BnLW44kSRrFqAG+y9Djq6vqB8tdjCRJGs2oO7F9b9yFSJKk0c0b4ElO4aZD54OqqvZe3pIkafVLwMtDaSkWWgN/zxztOwLPp9uZTZIkrbB5A7yqDh98nGRb4OV0O68dA7xqfKVJkqS5jHQceJLbJnk1cB6wPbBHVR1SVReNtTpJkjSreQM8ya2SvBz4Dt1Z1x5YVQdU1bdXpDppDUm6mySNYqFt4N8Fbg68HtgAbJ9k+8EJquqTSy0iyc3797+4qvZNsgtwNLANcAZwQFVds9T5SJK0WiwU4L+g2wv92XM8X8Cuy1DHC4BzgNv2j18HvKmqjk7ydrpTtr5tGeYjSdKqMO8QelXtXFW7zHNbcngn2Qn4I+Bd/eMADwfe309yJPC4pc5HkqTVZCkXM1kubwb+Cri+f7wtcGVVXdc/vojusLWbSHJIkg1JNmzcuHH8lUqSNCUmGuBJ9gUur6rTB5tnmXTW0x1U1WFVtb6q1q9bt24sNUqSNI1GPRf6uDwAeGySfYAt6LaBvxnYOslm/Vr4TsAlE6xRkqSpM9E18Kp6eVXtVFU7A/sDn6yqPwNOAZ7YT3YgcMKESpQkaSpNwzbw2bwUeFGS8+i2iR++wPSSJK0pkx5C/7WqOhU4tb//HWDPSdYjSdI0m9Y1cEmSNA8DXJKkBhngkiQ1yACXJKlBBrgkSQ0ywCVJapABvgp5TWlpZfk7p0kwwCVJapABLklSgwxwSZIaZIBLWjK3AUsrzwCXJKlBBrgkSQ0ywCVJapABLklas1ref8MAlySpQQa4JEkNMsAlSWqQAS5JUoMMcEmSGmSAS5LUIANckqQGGeCSpkLS9jG50kozwCVJapABLklSgwxwSZIaZICrWW4vlbSWGeCSJDXIAJckqUEGuCRJDTLAJUlqkAEuSVKDDHBJkhpkgEuS1CADXJKkBhngkiQ1yACXJKlBBrgkSQ0ywCVJapABLklSgwxwSZIaZIBLktQgA1ySpAYZ4JIkNcgAlySpQQa4Vp1k0hVIWip/jxdmgEuS1CADXJKkBhngkiQ1yACXJKlBBrgkSQ0ywCVJapABLklSgwxwTQWP+ZSkxTHAJUlqkAEuSVKDDHBJkhq02aQLkJbLNG5Hn6mparJ1SFp9XAOXJKlBBrgkSQ0ywCVJapABLknSLJLp3LdmhgEuSVKDDHBJkhpkgEuS1CADXNKymPbthVq7Vuv30gCXJKlBBrgkSQ0ywCVJatBEAzzJnZKckuScJGcneUHfvk2Sk5Kc2/+8/STrlCRp2kx6Dfw64MVVtTuwF/AXSe4JvAw4uap2A07uH0uSpN5EA7yqLq2qM/r7PwXOAXYE9gOO7Cc7EnjcZCqUJGk6TXoN/NeS7AzcFzgN2L6qLoUu5IE7TK4ySZKmz1QEeJKtgOOAF1bVTxbxukOSbEiyYePGjeMrUGOxWo/N1Ohm+w74vdA4rMbv1cQDPMkt6ML7vVV1fN98WZId+ud3AC6f7bVVdVhVra+q9evWrVuZgiVJmgKT3gs9wOHAOVX1xoGnTgQO7O8fCJyw0rVJkjTNNpvw/B8AHAB8LclX+rZXAK8Fjk1yMHAB8KQJ1SctiwSqJvd6LZ6f+drVyrKfaIBX1WeAubZM7L2StUiS1JKJbwOXJEmLZ4BLktQgA3zM1vIlFtdqv6W1wt/xyTLAJUlqkAEuSVKDDHBJkhpkgEuS1CADXJKkBhngkiQ1yACXJKlBBrgkaapM6/Hl01aXAS5JUoMMcEmSGmSAS5LUoElfD1yStArMbB9e6nXvNTrXwCVJapABLklSgwxwSZIaZIBLkjaZ260nxwCXJKlBBrgkSQ0ywCVJapDHgWtFub1M0qiW49jy5TKNf7tcA5ckqUEGuCRJDTLAJUlqkAGuJZvGbUNqm9+ptri8JsMAlySpQQa4JEkNMsAlSWqQAa4lcduXpsG4vofJeL/j/v5oKQxwSZIaZIBLktQgA1ySpAYZ4HNw29TqMO5tmJKmx1r7XTfAJUlqkAEuSVKDDHBJkhpkgE85t+FquazU98jvq7QyDHBJkhpkgEuS1CADfEyWexhx2oclp72+aeBntLwW83muxGe/qZu7/F5oUxngkiQ1yACXJKlBBrgkSQ0ywEfk4VwrZ61/zmu9/6vZalq2q6kvrTLAJUlqkAEuSVKDDHBJkhpkgI/BzLahwW1Ebi9avLXwma2FPm6qTf1shn/vJnFs9qbO0+/DeM18xqvlczbAJUlqkAEuSVKDDHBJkhpkgK9iq2E7zyS3V62Gz28pFur/Sm9bnu21bmteHaZ9mUxrbQa4JEkNMsAlSWqQAS5JUoMM8EWa1m0ho5ir9sVeV3mh7VXLcRzuXO+5nJ//8HsN1jr4c9R5jns73qS/e5M4Nnqx7z/pz2hcRvmdmu37O9/7zDfd4Pst1+/vcvz90Y0Z4JIkNcgAlySpQQa4JEkNMsAnyG0/c1vJz2Y1LIfV0IdRTPt2+JWc12pY5ovtw3JeX2K5Pr9JLgcDXJKkBhngkiQ1yACXJKlBm026gGk2yraNmWmqNn0ei3ntcE1VN7zHYo5XnnntOGtb7POjvv+oy2W49lH7M9cxtsvxWcz1Pst9XvH55jPYPtt3YaG+Tnrb62KOZd6U912o77N9fqPUM+q5E+ZqW+wyHWwfvr/Q6zf179koxrHtejm/A6P8XYTxfkajcg1ckqQGGeCSJDXIAJckqUEG+DKZ9HbBSVjsNr1p0cI5szflXOzjqmOUc2zP9Vp1JvW5TeL709LfhaX8nk1DX6Y2wJM8Ksk3k5yX5GWTrkeSpGkylQGe5ObAvwKPBu4JPDnJPSdblSRJ02MqAxzYEzivqr5TVdcARwP7TbgmSZKmxrQeB74jcOHA44uA3x+eKMkhwCH9w6uSfHMZa9gO+MFsT2zqdW2Xej3c+Y5/nOvY2PmOmZ3l/k36PO3nCx51e9sCn8d2yezLetR5LWa6UY9jXuzxrot83xst6009tnqh2pd6zO5Svjez9GnePs/Xvqnn4N7U+ketbcR9E7YDfrCbfG1LAAAI/UlEQVSU786ktw9vwrKa8+/3qPNazHlAxvD37S6jTDStAT7bx3GTw+ar6jDgsLEUkGyoqvXjeO9ptRb7DGuz3/Z57ViL/V4rfZ7WIfSLgDsNPN4JuGRCtUiSNHWmNcC/BOyWZJckmwP7AydOuCZJkqbGVA6hV9V1SZ4LfBy4OXBEVZ29wmWMZWh+yq3FPsPa7Ld9XjvWYr/XRJ9T03BGdkmStCjTOoQuSZLmYYBLktQgA3zIWjqFa5Lzk3wtyVeSbOjbtklyUpJz+5+3n3SdS5HkiCSXJzlroG3WPqbzL/2yPzPJHpOrfGnm6PehSS7ul/dXkuwz8NzL+35/M8kfTqbqpUlypySnJDknydlJXtC3r9rlPU+fV+2yTrJFki8m+Wrf57/v23dJclq/nI/pd4AmyS37x+f1z+88yfqXVVV56290O8x9G9gV2Bz4KnDPSdc1xv6eD2w31PZ64GX9/ZcBr5t0nUvs44OBPYCzFuojsA/wUbrzEOwFnDbp+pe534cCL5ll2nv23/VbArv0vwM3n3QfNqHPOwB79PdvA3yr79uqXd7z9HnVLut+eW3V378FcFq//I4F9u/b3w48u7//HODt/f39gWMm3YflurkGfmOewrXr75H9/SOBx02wliWrqk8BVww1z9XH/YCjqvMFYOskO6xMpctrjn7PZT/g6Kr6ZVV9FziP7nehKVV1aVWd0d//KXAO3VkdV+3ynqfPc2l+WffL66r+4S36WwEPB97ftw8v55nl/35g72QariW2dAb4jc12Ctf5fhlaV8Ankpzen5YWYPuquhS6Pw7AHSZW3fjM1ce1sPyf2w8XHzGweWTV9bsfJr0v3drZmljeQ32GVbysk9w8yVeAy4GT6EYSrqyq6/pJBvv16z73z/8Y2HZlKx4PA/zGRjqF6yrygKrag+6qb3+R5MGTLmjCVvvyfxtwV+A+wKXA/+nbV1W/k2wFHAe8sKp+Mt+ks7Q12e9Z+ryql3VV/aqq7kN3ls49gd1nm6z/uSr6PBsD/MbW1Clcq+qS/uflwAfofhEumxlG7H9ePrkKx2auPq7q5V9Vl/V/+K4H3skNQ6erpt9JbkEXZO+tquP75lW9vGfr81pY1gBVdSVwKt028K2TzJycbLBfv+5z//ztGH3z0lQzwG9szZzCNcmWSW4zcx/4A+Asuv4e2E92IHDCZCocq7n6eCLw1H7v5L2AH88Mva4GQ9t3H0+3vKHr9/793rq7ALsBX1zp+paq3655OHBOVb1x4KlVu7zn6vNqXtZJ1iXZur9/K+ARdNv+TwGe2E82vJxnlv8TgU9Wv0db8ya9F9203ej2TP0W3TaVv550PWPs5650e6N+FTh7pq9024ZOBs7tf24z6VqX2M/30Q0hXkv3n/jBc/WRbqjtX/tl/zVg/aTrX+Z+v7vv15l0f9R2GJj+r/t+fxN49KTr38Q+P5BuaPRM4Cv9bZ/VvLzn6fOqXdbAvYEv9307C3hl374r3T8j5wH/Bdyyb9+if3xe//yuk+7Dct08laokSQ1yCF2SpAYZ4JIkNcgAlySpQQa4JEkNMsAlSWqQAS5NoSSvSPKuFZzfZ5Pcd6Xmt6mS7JykBk7YMdd0j01y9ErVJU3CvL8EksYjyVUDD28N/BL4Vf/4mVX1mhWs5THAT6vqyys1z3GrqhOTvCbJvavqzEnXI42Da+DSBFTVVjM34ALgMQNt713hcp5Fd+KP1eZ9wCELTiU1ygCXplCSQ5O8p78/M2x8UJILk/woybOS/F5/takrk7x16PVPT3JOP+3Hk9xljvlsTncZxv8ZaNszyYYkP0lyWZLBU3TuleRz/Ty/muShA89tk+Tfk1zSz/eDA889I8l5Sa5IcmKSOw48V31/zu1f968zl3vsrzr1hiQ/SPId4I+G6n9aku8k+WmS7yb5s4GnTx2eXlpNDHCpHb9Pd+7qPwXeTHdKzEcA9wL+JMlDAJI8DngF8ARgHfBpurXR2ewGXF9VFw20vQV4S1Xdlu6KVsf277sj8GHgH4BtgJcAxyVZ17/u3XSbA+5Fd8nON/WvezjwT8CfADsA3wOGt0/vC/we8Dv9dH/Ytz+jf+6+wHpuONf1zDn8/4XudKC3Ae5PdyrRGecAOye57Rx9l5pmgEvteHVV/aKqPgFcDbyvqi6vqovpQnpmJ7RnAv9UVedUd/3j1wD3mWMtfGvgp0Nt1wJ3S7JdVV1VVV/o258CfKSqPlJV11fVScAGYJ/+4hmPBp5VVT+qqmuramat/s+AI6rqjKr6JfBy4H799atnvLaqrqyqC+guSnGfvv1PgDdX1YVVdQXdPwKDrgd+K8mtqurSqjp74LmZfm0924cptc4Al9px2cD9n8/yeKv+/l2At/TD3FfSXToxwI6zvOePgNsMtR0M3B34RpIvJdl34H2fNPO+/Xs/kG6t+k7AFVX1o1nmcUe6tW4Aquoq4IdD9Xx/4P7PBvpyR+DCgecG3+dqutGIZwGXJvlwknsMTDvTrytnqUlqngEurT4X0u3JvvXA7VZV9blZpj2X7qqUvw7Tqjq3qp5MNwz+OuD9/XD1hcC7h953y6p6bf/cNjOXeRxyCV34A78e+t4WuHiEvlzKja9ffefBJ6vq41X1SLp/Ir5Bd+3rGbsD51fVT0aYj9QcA1xafd4OvDzJvQCS3C7Jk2absKquBf4beMhMW5KnJFlXVddzw9rrr4D3AI9J8of9zmVbJHlokp2qu472R4F/S3L7JLdI8uD+tf8JHJTkPkluSTekf1pVnT9CX44Fnp9kpyS3B142UOf2/fHeW9IdhncVNxyKR9+nj44wD6lJBri0ylTVB+jWnI9O8hO6ayY/ep6XvAM4YODxo4Cz+2PV3wLs3297vxDYj24HuY10a93/mxv+jhxAt/38G8DlwAv7ek4G/hY4jm6N+q7A/iN2553Ax+muW38GcPzAczcDXky3hn8FXWA/Z+D5J/d9k1YlrwcuiSSfAZ63Wk7m0p+c5oCq+pNJ1yKNiwEuSVKDHEKXJKlBBrgkSQ0ywCVJapABLklSgwxwSZIaZIBLktQgA1ySpAb9f1c/6ZyM+MlwAAAAAElFTkSuQmCC\n",
"text/plain": [
"<Figure size 504x504 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"import math\n",
"import numpy as np\n",
"\n",
"# Get list of Time Stamps\n",
"timeStamps = []\n",
"for line in packets:\n",
" timeStamps.append(float(re.findall(r'\\d+\\.\\d+', line)[0]))\n",
"\n",
"# Specify the bins for each second\n",
"bins = np.linspace(0, math.ceil(timeStamps[-1]), math.ceil(timeStamps[-1]))\n",
"\n",
"# Plot the histogram showing web traffic per second\n",
"plt.rcParams['figure.figsize'] = [7, 7]\n",
"plt.hist(timeStamps, bins, facecolor = 'blue')\n",
"plt.title('Skype and IRC WebTraffic', fontsize = 14)\n",
"plt.xlabel('Time (seconds)', fontsize = 12)\n",
"plt.ylabel('Number of Packets', fontsize = 12)\n",
"_ = plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problem 3: Retrieving Data from the Web\n",
"#### Time Spent: 3 hours\n",
"#### ([Back to Top](#Table-of-Contents))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this problem, we'll scrape data from Wikipedia using BeautifulSoup. Documentation for BeauitfulSoup can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/. As mentioned in lecture, there is another package, called requests, which is becoming quite popular, which you are welcome to use for this problem instead, if you wish. Documentation for the requests package can be found at http://docs.python-requests.org/en/master/ .\n",
"\n",
"Suppose you are trying to choose a city to vacation in. A major factor in your decision is weather. Conveniently, lots of weather information is present in the Wikipedia articles for most world cities. Your job in this problem is to use `BeautifulSoup` to retrieve weather information from Wikipedia articles. We should note that in practice, such information is typically more easily obtained from, for example, the National Oceanic and Atmospheric Administration (NOAA), in the case of American cities, and from analogous organizations in other countries.\n",
"1. Look at a few Wikipedia pages corresponding to cities. For example:\n",
" * https://en.wikipedia.org/wiki/Ann_Arbor,_Michigan\n",
" * https://en.wikipedia.org/wiki/Buenos_Aires\n",
" * https://en.wikipedia.org/wiki/Harbin\n",
"\n",
"Note that most city pages include a table titled something like \"Climate data for \\[Cityname\\] (normals YYYY-YYYY, extremes YYYY-YYYY)\" Find a Wikipedia page for a city that includes such a table (such as one of the three above). In your jupyter notebook, open the URL and read the HTML using either `urllib` or `requests`, and parse it with `BeautifulSoup` using the standard parser, `html.parser`. Have a look at the parsed HTML and find the climate data table, which will have the tag `table` and will contain a child tag `th` containing a string similar to\n",
"\n",
"`Climate data for \\[Cityname\\] (normals YYYY-YYYY, extremes YYYY-YYYY)`. \n",
"\n",
"Find the node in the `BeautifulSoup` object corresponding to this table. What is the structure of this node of the tree (e.g., how many children does the table have, what are their tags, etc.)? You may want to learn a bit about the structure of HTML tables by looking at the resources available on these websites:\n",
"- https://developer.mozilla.org/en-US/docs/Web/HTML/Element/table\n",
"- https://www.w3schools.com/html/html_tables.asp\n",
"- https://www.w3.org/TR/html401/struct/tables.html"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup\n",
"\n",
"response = urllib.request.urlopen('https://en.wikipedia.org/wiki/Ann_Arbor,_Michigan')\n",
"parsed = BeautifulSoup(response, 'html.parser')\n",
"soup = parsed.find('table', {'class': 'wikitable collapsible'})\n",
"\n",
"# Direct Children of tBody\n",
"tBodyChildren = [child.name for child in soup.tbody.children if child != '\\n']\n",
"\n",
"# All the children of tBody\n",
"tBodyTotChildren = [child.name for child in soup.tbody.descendants if child.name != None]\n",
"\n",
"# Number of direct children of tBody\n",
"n_tBodyChildren = len(tBodyChildren)\n",
"\n",
"# Total number of children of tBodyTotChildren\n",
"n_tBodyTotChildren = len(tBodyTotChildren)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- *A `table` node has only one `tbody` tag. This `tbody` tag has 12 children which represent its table rows. Each table row even more children of their own. In total, the `tbody` tag has 252 total number of children.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Write a function `retrieve_climate_table` that takes as its only argument a Wikipedia URL, and returns the `BeautifulSoup` object corresponding to the climate data table (if it exists in the page) and returns `None` if no such table exists on the page. You should check that the URL is retrieved successfully, and raise an error if `urllib2` fails to successfully read the website. You may notice that some city pages include more than one climate data table or several nested tables (see, for example, https://en.wikipedia.org/wiki/Los_Angeles). In this case, your function may arbitrarily choose one of the tables to return as a `BeautifulSoup` object. **Note:** a good way to check for edge cases is to test your script on the Wikipedia pages for a few of your favorite cities. The pages for Los Angeles, Hyderabad and Boston will give good examples of edge cases that you should be able to handle, but note that these are by no means exhaustive of all the possible edge cases. **Hint:** make use of the `contents` attribute of the `BeautifulSoup` objects and the ability to change the elements of the contents list to Unicode."
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [],
"source": [
"import urllib\n",
"\n",
"def retrieve_climate_table(url):\n",
" try:\n",
" response = urllib.request.urlopen(url)\n",
" except:\n",
" raise urllib.error.URLError('Unable to read website!')\n",
" parsed = BeautifulSoup(response, 'html.parser')\n",
" return(parsed.find('table', {'class': 'wikitable collapsible'}))\n",
"\n",
"# s = 'https://en.wikipedia.org/wiki/Los_Angeles'\n",
"# parsed = retrieve_climate_table(s)\n",
"# parsed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. As you look at some of the climate data tables, you may notice that different cities' tables contain different information. For example, not all cities include snowfall data. Write a function `list_climate_table_row_names` that takes as its only argument a Wikipedia URL and returns a list of the row names of the climate data table, or returns `None` if no such table exists. The list returned by your function should, ideally, consist solely of Python strings (either Unicode or ASCII), and should not include any `BeautifulSoup` objects or HTML (**Hint:** see the `BeautifulSoup` method `get_text()`). The list returned by your script should *not* include an entry corresponding to the `Climate data for`... row in the table. **Second hint:** you are looking for HTML table header (`th`) objects. The HTML attribute `scope` is your friend here, because in the context of an HTML table it tells you when a `th` tag is the header of a row or a column."
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [],
"source": [
"def list_climate_table_row_names(url):\n",
" climateTable = retrieve_climate_table(url)\n",
" tableRows = [line for line in climateTable.find_all('th') if line.get('scope') == 'row']\n",
" rowNames = [row.get_text().replace('\\n', '') for row in tableRows]\n",
" return(rowNames)\n",
"\n",
"# url = 'https://en.wikipedia.org/wiki/Los_Angeles'\n",
"# list_climate_table_row_names(url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"4. The next natural step would be to write a function that takes a URL and a row name and retrieves the data from that row of the climate data table (if the table exists and has that row name). Doing this would require some complicated string wrangling to get right, so I'll spare you the trouble. Instead, please **briefly** describe either in pseudo code or in plain English how you would accomplish this, using the two functions you wrote above and the tools available to you in the `BeautifulSoup` package. **Note:** just to be clear, you **do not** have to write any code for this last step."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* For each `th` tag that has a row attribute, we can extract the climate data from each `td` tag that follows it. We can use a regular expression to get the temperature stored inside of each `td` tag. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problem 4: Relational Databases and SQL\n",
"#### Time Spent: 2 hours\n",
"#### ([Back to Top](#Table-of-Contents))\n",
"In this problem, you'll interact with a toy SQL database using Python's built-in `sqlite3` package. Documentation can be found at https://docs.python.org/3/library/sqlite3.html . For this problem, we'll use a popular toy SQLite database, called Chinook, which represents a digital music collection. See the documentation at \n",
"\n",
"https://github.com/lerocha/chinook-database/blob/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite \n",
"\n",
"for a more detailed explanation. We'll use the `.sqlite` file `Chinook_Sqlite.sqlite`, which you should download from the GitHub page above. **Note:** Don't forget to save the file in the directory that you're going to compress and hand in, and make sure that you use a relative path when referring to the file, so that when we try to run your code on our machines the file path will still work! \n",
"1. Load the database using the Python `sqlite3` package. How many tables are in the database? Save the answer in the variable `n_tables`."
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [],
"source": [
"import sqlite3\n",
"\n",
"conn = sqlite3.connect('Chinook_Sqlite.sqlite')\n",
"c = conn.cursor()\n",
"\n",
"tables = []\n",
"for table in c.execute(\"SELECT name FROM sqlite_master WHERE type = 'table'\"):\n",
" tables.append(table)\n",
"\n",
"n_tables = len(tables)\n",
"# n_tables"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. What are the names of the tables in the database? Save the answer as a list of strings, `table_names`. **Note:** you should write Python `sqlite3` code to answer this; don't just look up the answer in the documentation!"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# SQL query is done in previous problem\n",
"table_names = [tup[0] for tup in tables]\n",
"# table_names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. Write a function `list_album_ids_by_letter` that takes as an argument a single character and returns a list of the primary keys of all the albums whose titles start with that character. Your function should ignore case, so that the inputs \"a\" and \"A\" yield the same results. Include error checking that raises an error in the event that the input is not a single character."
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [],
"source": [
"def list_album_ids_by_letter(char):\n",
" # Perform Error-checking\n",
" if not isinstance(char, str) or len(char) != 1:\n",
" raise ValueError('Should provide a single character!')\n",
"\n",
" c.execute('SELECT AlbumId FROM Album WHERE Title LIKE \"' + char + '%\"')\n",
" return([tup[0] for tup in c.fetchall()])\n",
"\n",
"# list_album_ids_by_letter('a')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"4. Write a function `list_song_ids_by_album_letter` that takes as an argument a single character and returns a list of the primary keys of all the songs whose album names begin with that letter. Again, your function should ignore case and perform error checking as in `list_album_ids_by_letter`. (again ignoring case). **Hint:** you'll need a JOIN statement here. Don't forget that you can use the `cursor.description` attribute to find out about tables and the names of their columns."
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"def list_song_ids_by_album_letter(char):\n",
" # Perform Error-checking\n",
" if not isinstance(char, str):\n",
" raise TypeError('Input should be a string!')\n",
" elif len(char) != 1:\n",
" raise ValueError('Input should be a single letter!')\n",
" \n",
" script = 'SELECT TrackId FROM Track JOIN (SELECT AlbumId FROM Album WHERE Title LIKE \"' + char + '%\") t_album ON t_album.AlbumId = Track.AlbumId' \n",
" c.execute(script)\n",
" return([tup[0] for tup in c.fetchall()])\n",
"\n",
"# list_song_ids_by_album_letter('A')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"5. Write a function `total_cost_by_album_letter` that takes as an argument a single character and returns the cost of buying every song whose album begins with that letter. This cost should be based on the tracks' unit prices, so that the cost of buying a set of tracks is simply the sum of the unit prices of all the tracks in the set. Again your function should ignore case and perform appropriate error checking."
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [],
"source": [
"def total_cost_by_album_letter(char):\n",
" # Perform Error-checking\n",
" if not isinstance(char, str):\n",
" raise TypeError('Input should be a string!')\n",
" elif len(char) != 1:\n",
" raise ValueError('Input should be a single letter!')\n",
" \n",
" char.upper()\n",
" script = 'SELECT UnitPrice FROM Track JOIN (SELECT AlbumId FROM Album WHERE Title LIKE \"' + char + '%\") t_album ON t_album.AlbumId = Track.AlbumId' \n",
" c.execute(script)\n",
" return(sum([tup[0] for tup in c.fetchall()]))\n",
"\n",
"# total_cost_by_album_letter('F')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment