Skip to content

Instantly share code, notes, and snippets.

@gtfierro
Created February 4, 2014 15:49
Show Gist options
  • Save gtfierro/8806226 to your computer and use it in GitHub Desktop.
Save gtfierro/8806226 to your computer and use it in GitHub Desktop.
Answers for Python, Week 2
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Week 2: Introduction to Python (Solutions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're going to go over the assignment from Week 2. If you want to follow along, download the indiegogo dataset from [http://fierro.me/data/indiegogo_subset.json](http://fierro.me/data/indiegogo_subset.json)\n",
"\n",
"I'll be explaining the answers one by one, line by line, in this format. The full scripts for each question can be found at the end."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's get the dataset into a format that's understandable by Python. Before, when we were working with CSV (comma separated values) files, we had to manually split the file to find the lines and then split each line on the commas to get access to individual data points. With the JSON format, we can simply load the file into Python as a dictionary."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# this next line is an example of importing a module (an external library) that wraps up\n",
"# all the code we need for reading the file in a simple interface\n",
"import json\n",
"indiegogo = json.load(open('indiegogo_subset.json'))\n",
"print len(indiegogo.keys())"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1000\n"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As a sanity check, I usually print out how many keys are in a dictionary to make sure the file loaded in correctly. Remember, dictionaries contain unique keys, so because we see `1000` above, we know that our file contains 1000 unique indiegogo campaign records.\n",
"\n",
"Before we go further, let's see what a single one of these indiegogo records looks like:"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"{\"http://www.indiegogo.com/projects/help-chippy-get-a-new-camera\": {\n",
" \"category\": \"Film\",\n",
" \"campaign_eta\": \"0\", \n",
" \"end_date\": \"June 22, 2013 (11:59pm PT)\", \n",
" \"team_info\": [\n",
" [\"Evan Cruz\", \"http://www.indiegogo.com/individuals/3088171\"]\n",
" ],\n",
" \"page_num\": \"4352\",\n",
" \"amount_raised\": \"990\",\n",
" \"campaign_title\": \"Help Chippy get a new camera\",\n",
" \"cache_file_name\": \"93596ca0b61f4845a691cd89913b183f\",\n",
" \"location\": \"Plantation, Florida, United States\",\n",
" \"target_amount\": \"2500\",\n",
" \"num_funders\": \"53\",\n",
" \"perk_info\": [\n",
" [\"\", \"18\", \"None\"], \n",
" [\"\", \"0\", \"None\"]\n",
" ], \n",
" \"start_date\": \"April 23, 2013\",\n",
" \"currency_code\": \"USD\"\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Starting from the outside and working our way in, we can see that the outermost **key** is a unique URL for an Indiegogo project: `\"http://www.indiegogo.com/projects/help-chippy-get-a-new-camera\"`. The **value** linked to that key is another dictionary, which contains keys such as `category`, `campaign_eta`, `location`, `perk_info`, etc. Each of these internal keys also has a matched value. Most of these are strings, but you can see a couple (`team_info` and `perk_info`) have lists of lists as their values.\n",
"\n",
"Our outermost dictionary is called `indiegogo`, so if I wanted to get access to the dictionary above, I would just use the code"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"indiegogo[\"http://www.indiegogo.com/projects/help-chippy-get-a-new-camera\"]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": [
"{u'amount_raised': u'990',\n",
" u'cache_file_name': u'93596ca0b61f4845a691cd89913b183f',\n",
" u'campaign_eta': u'0',\n",
" u'campaign_title': u'Help Chippy get a new camera',\n",
" u'category': u'Film',\n",
" u'currency_code': u'USD',\n",
" u'end_date': u'June 22, 2013 (11:59pm PT)',\n",
" u'location': u'Plantation, Florida, United States',\n",
" u'num_funders': u'53',\n",
" u'page_num': u'4352',\n",
" u'perk_info': [[u'', u'18', u'None'], [u'', u'0', u'None']],\n",
" u'start_date': u'April 23, 2013',\n",
" u'target_amount': u'2500',\n",
" u'team_info': [[u'Evan Cruz',\n",
" u'http://www.indiegogo.com/individuals/3088171']]}"
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and to access the internal dictionary above, I would just use the code"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"indiegogo[\"http://www.indiegogo.com/projects/help-chippy-get-a-new-camera\"][\"location\"]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": [
"u'Plantation, Florida, United States'"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These relationships exist for every key in the primary dictionary, `indiegogo`.\n",
"\n",
"**NOTE**: the small `u` in front of the (i.e. `u'Evan Cruz'`) means that the strings are Unicode, as opposed to ASCII (the \"default\" string). For our purposes today, Unicode strings act the same way as ASCII strings, so don't worry about the leading `u`."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Question 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Look through the data and figure out a way to extract the country of origin from the location field. Keep in mind that not all records may have a location.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From above, we know that if we have the URL of a project, we can access the location string:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"url = \"http://www.indiegogo.com/projects/help-chippy-get-a-new-camera\"\n",
"internal_dictionary = indiegogo[url] # use the variable 'url' as a shortcut\n",
"print internal_dictionary['location']"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Plantation, Florida, United States\n"
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given this string, how would we extract the portion that refers to the country? Because it's a string, we could certainly just take the last 13 characters. We use `[-13:]` to mean \"start 13 characters from the end, and then go to the end\"."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print internal_dictionary['location'][-13:]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"United States\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But what if we have a country that's not the United States? If we look at the third URL in the data file, \"http://www.indiegogo.com/projects/fatherhood101\", we can look at the location data, and then see if it works:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"url = \"http://www.indiegogo.com/projects/dorothy-documentary\"\n",
"internal_dictionary = indiegogo[url]\n",
"print internal_dictionary['location']\n",
"print internal_dictionary['location'][-13:]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Gabriola, Canada\n",
"riola, Canada\n"
]
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whoops, that didn't work. We need a more generlized way of getting the country. Notice that each of the location strings has commas in it that delineate the components of the location: city, optionally the state, and the country. We can use the commas to separate the string into a list of chunks, so `'Plantation, Florida, United States'` will become `['Plantation', ' Florida', ' United States']`. Note that the spaces before 'Florida' and 'United States' are still preserved...we'll have to get rid of those later.\n",
"\n",
"Let's see if Python has any builtin tools that will help us divide the string up on the commas. We can use the `dir` function to see a list of all methods that we can use on strings:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"location = internal_dictionary['location']\n",
"dir(location)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 11,
"text": [
"['__add__',\n",
" '__class__',\n",
" '__contains__',\n",
" '__delattr__',\n",
" '__doc__',\n",
" '__eq__',\n",
" '__format__',\n",
" '__ge__',\n",
" '__getattribute__',\n",
" '__getitem__',\n",
" '__getnewargs__',\n",
" '__getslice__',\n",
" '__gt__',\n",
" '__hash__',\n",
" '__init__',\n",
" '__le__',\n",
" '__len__',\n",
" '__lt__',\n",
" '__mod__',\n",
" '__mul__',\n",
" '__ne__',\n",
" '__new__',\n",
" '__reduce__',\n",
" '__reduce_ex__',\n",
" '__repr__',\n",
" '__rmod__',\n",
" '__rmul__',\n",
" '__setattr__',\n",
" '__sizeof__',\n",
" '__str__',\n",
" '__subclasshook__',\n",
" '_formatter_field_name_split',\n",
" '_formatter_parser',\n",
" 'capitalize',\n",
" 'center',\n",
" 'count',\n",
" 'decode',\n",
" 'encode',\n",
" 'endswith',\n",
" 'expandtabs',\n",
" 'find',\n",
" 'format',\n",
" 'index',\n",
" 'isalnum',\n",
" 'isalpha',\n",
" 'isdecimal',\n",
" 'isdigit',\n",
" 'islower',\n",
" 'isnumeric',\n",
" 'isspace',\n",
" 'istitle',\n",
" 'isupper',\n",
" 'join',\n",
" 'ljust',\n",
" 'lower',\n",
" 'lstrip',\n",
" 'partition',\n",
" 'replace',\n",
" 'rfind',\n",
" 'rindex',\n",
" 'rjust',\n",
" 'rpartition',\n",
" 'rsplit',\n",
" 'rstrip',\n",
" 'split',\n",
" 'splitlines',\n",
" 'startswith',\n",
" 'strip',\n",
" 'swapcase',\n",
" 'title',\n",
" 'translate',\n",
" 'upper',\n",
" 'zfill']"
]
}
],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There's a lot there, but you can ignore everything that starts and ends with two underscores `__`. It looks like there's a method called `split`. Let's find out more about that, by using the `help` function."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"help(location.split)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Help on built-in function split:\n",
"\n",
"split(...)\n",
" S.split([sep [,maxsplit]]) -> list of strings\n",
" \n",
" Return a list of the words in S, using sep as the\n",
" delimiter string. If maxsplit is given, at most maxsplit\n",
" splits are done. If sep is not specified or is None, any\n",
" whitespace string is a separator and empty strings are\n",
" removed from the result.\n",
"\n"
]
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, if we give the split function some delimiter string (like a comma ','), then it will return a list of the generated substrings, like so:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print location # the original location\n",
"print location.split(',') # splitting on commas"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Gabriola, Canada\n",
"[u'Gabriola', u' Canada']\n"
]
}
],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can do the same with the Florida one:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"location = 'Plantation, Florida, United States'\n",
"print location\n",
"print location.split(',')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Plantation, Florida, United States\n",
"['Plantation', ' Florida', ' United States']\n"
]
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In both cases, the country portion is the last item in the list, so we can use negative indexing to retrieve it:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print location.split(',')[-1]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" United States\n"
]
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There's still that weird space in front of it, though. There are two simple ways we can fix this. The first is simply by splitting on a comma and a space instead of just a comma:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print location\n",
"print location.split(', ')\n",
"print location.split(', ')[-1]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Plantation, Florida, United States\n",
"['Plantation', 'Florida', 'United States']\n",
"United States\n"
]
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"or we can use the builtin `strip` string method to remove whitespace from the left and right sides of a string:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print location\n",
"print location.split(',')\n",
"print location.split(',')[-1].strip()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Plantation, Florida, United States\n",
"['Plantation', ' Florida', ' United States']\n",
"United States\n"
]
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I'll be using the former method, as it is a little bit cleaner.\n",
"\n",
"Now that we know how to extract the country from a single location strings, let's try doing it for all of the Indiegogo projects. Remember that using the `keys()` method on the dictionary `indiegogo` will give us a list of all URLs. We can loop through this list, and use our country-finding code on each of them in turn."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"all_urls = indiegogo.keys()\n",
"first_ten_urls = all_urls[:10] # I'll use the first 10 to simplify printing them out here\n",
"print first_ten_urls"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[u'http://www.indiegogo.com/projects/help-chippy-get-a-new-camera', u'http://www.indiegogo.com/projects/fatherhood101', u'http://www.indiegogo.com/projects/dorothy-documentary', u'http://www.indiegogo.com/projects/colorado-farm-aid', u'http://www.indiegogo.com/projects/conjurers-the-black-magician-s-contribution-to-the-conjuring-arts', u'http://www.indiegogo.com/projects/guide-her-home', u'http://www.indiegogo.com/projects/anhedonia-a-modern-fairy-tale--3', u'http://www.indiegogo.com/projects/abotani-a-short-animated-folktale-from-arunachal-pradesh', u'http://www.indiegogo.com/projects/anhedonia-a-modern-fairy-tale--4', u'http://www.indiegogo.com/projects/creative-growers-organic-farm']\n"
]
}
],
"prompt_number": 18
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for url in first_ten_urls: # the 'url' variable will change every time the loop iterates\n",
" print url\n",
" internal_dictionary = indiegogo[url]\n",
" location = internal_dictionary['location']\n",
" country = location.split(', ')[-1]\n",
" print country"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"http://www.indiegogo.com/projects/help-chippy-get-a-new-camera\n",
"United States\n",
"http://www.indiegogo.com/projects/fatherhood101\n",
"United States\n",
"http://www.indiegogo.com/projects/dorothy-documentary\n",
"Canada\n",
"http://www.indiegogo.com/projects/colorado-farm-aid\n",
"United States\n",
"http://www.indiegogo.com/projects/conjurers-the-black-magician-s-contribution-to-the-conjuring-arts\n",
"United States\n",
"http://www.indiegogo.com/projects/guide-her-home\n",
"United States\n",
"http://www.indiegogo.com/projects/anhedonia-a-modern-fairy-tale--3\n",
"United States\n",
"http://www.indiegogo.com/projects/abotani-a-short-animated-folktale-from-arunachal-pradesh\n",
"United Kingdom\n",
"http://www.indiegogo.com/projects/anhedonia-a-modern-fairy-tale--4\n",
"United States\n",
"http://www.indiegogo.com/projects/creative-growers-organic-farm\n",
"United States\n"
]
}
],
"prompt_number": 19
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just change `first_ten_urls` to `all_urls` in the code above to do the same process for all of the URLs instead of just the first 10. I print out the url every time the for-loop loops so that you can see how it changes.\n",
"\n",
"To simplify this code, how about we put the country-finding code inside a function? The function will take a URL as an input, and then return the portion of the location string that refers to the country."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def get_country(url):\n",
" internal_dictionary = indiegogo[url]\n",
" location = internal_dictionary['location']\n",
" country = location.split(', ')[-1]\n",
" return country\n",
"\n",
"# we can even simplify the function further...\n",
"#def get_country(url):\n",
"# return indiegogo[url]['location'].split(', ')[-1]\n",
"\n",
"# call the function like this\n",
"print get_country(\"http://www.indiegogo.com/projects/help-chippy-get-a-new-camera\")\n",
"\n",
"# now simplify the for loop!\n",
"for url in first_ten_urls:\n",
" print get_country(url)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"United States\n",
"United States\n",
"United States\n",
"Canada\n",
"United States\n",
"United States\n",
"United States\n",
"United States\n",
"United Kingdom\n",
"United States\n",
"United States\n"
]
}
],
"prompt_number": 20
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Question 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Use a Python data structure to organize the URLs of the projects by their country of origin.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we have to decide what we mean by \"organize\" (or, rather, what *I* meant by organize when I wrote this at 3 AM). It makes sense that there are two types of relations: being able to easily find the country for a given URL (which is what we already have), and being able to easily find all URLs that have the same country. We only know a few different types of datastructures at this point, so we have to figure out which of those best suits our needs. Dictionaries allow us to look up information by using a key -- couldn't this key be our country, and then the corresponding value could be a list of all URLs for that country? Sounds good to me!\n",
"\n",
"We can declare a dictionary variable that will hold our information, and then use the same for-loop structure to sequentially add our URLs and countries to that dictionary."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"country_lookup = {} # can also use: country_lookup = dict()\n",
"for url in first_ten_urls:\n",
" country = get_country(url)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 21
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is an excellent start, but how do we create and maintain the collections of URLs for each country? We know that the key into the `country_lookup` dictionary will be the string of the country, but what will our values be? Lists seem the most appropriate data structure to maintain a collection of URLs, so we'll use those. A regular dictionary in Python cannot figure out that it needs to add things with a common key to the same list, so as we loop through our data and get a new URL and a new country, we'll need to tell the `country_lookup` dictionary to create a list for a country if it doesn't exist already, and then add our URL to it. We can use the boolean expression `country not in country_lookup.keys()` to return False if we already have a list for that country, and True if we don't:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"country_lookup = {}\n",
"for url in first_ten_urls:\n",
" country = get_country(url)\n",
" if country not in country_lookup.keys(): # don't have a list for this country yet!\n",
" print 'Creating list for', country\n",
" country_lookup[country] = [] # create an empty list\n",
" country_lookup[country].append(url) # at the end of the loop, append the URL to the relevant list\n",
"print country_lookup.keys()\n",
"print # print by itself just creates whitespace\n",
"print 'United States:', country_lookup['United States']\n",
"print\n",
"print 'Canada:', country_lookup['Canada']\n",
"print\n",
"print 'United Kingdom:', country_lookup['United Kingdom']"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Creating list for United States\n",
"Creating list for Canada\n",
"Creating list for United Kingdom\n",
"[u'United States', u'Canada', u'United Kingdom']\n",
"\n",
"United States: [u'http://www.indiegogo.com/projects/help-chippy-get-a-new-camera', u'http://www.indiegogo.com/projects/fatherhood101', u'http://www.indiegogo.com/projects/colorado-farm-aid', u'http://www.indiegogo.com/projects/conjurers-the-black-magician-s-contribution-to-the-conjuring-arts', u'http://www.indiegogo.com/projects/guide-her-home', u'http://www.indiegogo.com/projects/anhedonia-a-modern-fairy-tale--3', u'http://www.indiegogo.com/projects/anhedonia-a-modern-fairy-tale--4', u'http://www.indiegogo.com/projects/creative-growers-organic-farm']\n",
"\n",
"Canada: [u'http://www.indiegogo.com/projects/dorothy-documentary']\n",
"\n",
"United Kingdom: [u'http://www.indiegogo.com/projects/abotani-a-short-animated-folktale-from-arunachal-pradesh']\n"
]
}
],
"prompt_number": 22
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Question 3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**What are the 3 most common countries?**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see if we can get a solution working on the first ten URLs we've been working with so far. It is often very helpful to write and test your code on a subset of the data that you know before you test it on the full dataset. Here, we know that the United States is the most common country, so let's concentrate on writing code that tells us that, and then we can generalize it to the top 3 countries and *then* we can use our code on the full dataset.\n",
"\n",
"In the last question, we generated a convenient dictionary where the keys were the countries and the values were lists of all URLs that were for projects from that country. The length of each list will tell us how many projects are from that country. By looping through the dictionary, we can keep track of the largest value without having to worry about sorting.\n",
"\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"largest_count = 0 # initialize to 0 so we can use it to compare\n",
"for country in country_lookup.keys(): # loop through the countries\n",
" count = len(country_lookup[country]) # get the number of URLs for that country\n",
" if count > largest_count: # if it's more than our current largest...\n",
" largest_country = country # then save the country,\n",
" largest_count = count # and save the count\n",
"print largest_country, largest_count # after we're done, we have the largest country!"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"United States 8\n"
]
}
],
"prompt_number": 23
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's generalize this to keep track of the top 3, or top 4 or top $N$ values. The tricky part is not finding the top 3 counts, but knowing which countries are associated with those counts. Python offers us faster ways of doing this (and I'll show you right after), but a simple method that uses the basic aspects of Python we've learned thus far uses the `sort` function for lists."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"counts = []\n",
"country_count = []\n",
"for country in country_lookup.keys():\n",
" count = len(country_lookup[country])\n",
" counts.append(count)\n",
" country_count.append((country, count)) # use a tuple to keep track of which count goes with which country\n",
"counts.sort() # sorts in order from lowest to highest\n",
"print counts[-3:] # print last 3 from sorted list (these are the 3 highest)\n",
"# now, iterate through counts[-3:] and find the countries with the associated counts from country_counts\n",
"for count in counts[-3:]:\n",
" print count\n",
" for x in country_count: # x[0] is the country, x[1] is the count\n",
" if x[1] == count: # we've found the country! \n",
" print x[0]\n",
" break # break because we don't need to search anymore"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[1, 1, 8]\n",
"1\n",
"Canada\n",
"1\n",
"Canada\n",
"8\n",
"United States\n"
]
}
],
"prompt_number": 24
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can just re-run all this code with `all_urls` instead of just `first_ten_urls`."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"country_lookup = {}\n",
"for url in all_urls:\n",
" country = get_country(url)\n",
" if country not in country_lookup.keys():\n",
" country_lookup[country] = []\n",
" country_lookup[country].append(url)\n",
"counts = []\n",
"country_count = []\n",
"for country in country_lookup.keys():\n",
" count = len(country_lookup[country])\n",
" counts.append(count)\n",
" country_count.append((country, count))\n",
"counts.sort()\n",
"for count in counts[-3:]:\n",
" print count\n",
" for x in country_count:\n",
" if x[1] == count:\n",
" print x[0]\n",
" break"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"34\n",
"United Kingdom\n",
"87\n",
"Canada\n",
"709\n",
"United States\n"
]
}
],
"prompt_number": 30
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To give you an idea of the power of Python, here's an example of how it's actually possible to do this in only a single of code:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sorted([(x[0], len(x[1])) for x in country_lookup.items()], key=lambda x:x[1])[-3:]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 33,
"text": [
"[(u'United Kingdom', 34), (u'Canada', 87), (u'United States', 709)]"
]
}
],
"prompt_number": 33
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A quick explanation of this code, working from the inside out. The method `country_lookup.items()` will give me a list of tuples of (country, list of urls) for every pair of key/value in the dictionary. The formation `[(x[0], len(x[1])) for x in country_lookup.items()]` creates a new tuple `(x[0], len(x[1]))` for every tuple (named `x`) in the `.items()` list. This basically creates the pairings of (country, number of URLs) that we used above. The `sorted` function returns a sorted list, as opposed to the `.sort()` method on lists, which sorte them in place. The last bit, `key=lambda x: x[1]` tell the `sorted` method to sort the list using the second item in each of the created tuples."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Question 4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Find out how many unique categories of projects there are.**\n",
"\n",
"To answer this, we need to first figure out how to get the category of a project. Let's go back to our dictionary called `indiegogo`. Remember that the keys are URLs and the values are the data for that URL? We have the first URL for that dictionary, so we can just use that to see what the data looks like."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"url =\"http://www.indiegogo.com/projects/help-chippy-get-a-new-camera\"\n",
"# if we didn't know this URL, we could get a list of the URLs by using\n",
"# indiegogo.keys(), and then choosing the first one with\n",
"# url = indiegogo.keys()[0]\n",
"indiegogo[url]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 27,
"text": [
"{u'amount_raised': u'990',\n",
" u'cache_file_name': u'93596ca0b61f4845a691cd89913b183f',\n",
" u'campaign_eta': u'0',\n",
" u'campaign_title': u'Help Chippy get a new camera',\n",
" u'category': u'Film',\n",
" u'currency_code': u'USD',\n",
" u'end_date': u'June 22, 2013 (11:59pm PT)',\n",
" u'location': u'Plantation, Florida, United States',\n",
" u'num_funders': u'53',\n",
" u'page_num': u'4352',\n",
" u'perk_info': [[u'', u'18', u'None'], [u'', u'0', u'None']],\n",
" u'start_date': u'April 23, 2013',\n",
" u'target_amount': u'2500',\n",
" u'team_info': [[u'Evan Cruz',\n",
" u'http://www.indiegogo.com/individuals/3088171']]}"
]
}
],
"prompt_number": 27
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Aha! We can see that there is a key for `category`, so we can just access that information using:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print indiegogo[url]['category']"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Film\n"
]
}
],
"prompt_number": 34
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To generate a list of all unique categories, just iterate through all the projects, extract the category, and keep track of all the ones you've seen so far:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"unique_categories = [] # initialize to empty\n",
"for url in indiegogo.keys(): # loop through all projects\n",
" category = indiegogo[url]['category']# extract the category\n",
" if category not in unique_categories:# if we haven't seen it yet...\n",
" unique_categories.append(category)# then add it to our list!\n",
"print unique_categories # print our list"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[u'Film', u'Community', u'Writing', u'Food', u'Comic', u'Technology', u'Sports', u'Design', u'Education', u'Environment', u'Health', u'Theater', u'Gaming', u'Fashion', u'Politics', u'Music', u'Art', u'Photography', u'Video / Web', u'Transmedia', u'Small Business', u'Animals', u'Religion', u'Dance']\n"
]
}
],
"prompt_number": 35
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Question 5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Group projects into the following categories by amount of money raised: 0-100,000; 101,000-250,000; 250,001-500,000; 500,001 - 1,000,000; and > 1,000,000**\n",
"\n",
"We should be familiar with this pattern by now. We first find the way to extract the amount raised from each project (lo and behold, it seems like there is a key named `amount_raised`! How convenient!), and then loop through all the projects and categorize them.\n",
"\n",
"What's different this time is that we know what our categories are, so instead of question 2, where we had to check if a category existed before we created a new list, we can just initialize a dictionary with those categories beforehand. It doesn't matter what we name them, so far as we know what they correspond to. Here, I'm naming them according to the upper bound of the respective category."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"money_categories = {100000: [], \n",
" 250000: [], \n",
" 500000: [], \n",
" 1000000: [], \n",
" 'hella money': []}"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 47
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dictionary currently stores the amount raised as a string, which isn't going to do us any good when we want to compare it against numbers. Thankfully, Python gives us a simple way of converting a string of a number into an integer:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"amount_raised = indiegogo[url]['amount_raised']\n",
"print amount_raised\n",
"print amount_raised < 100000\n",
"# this gives us False because of how Python compares strings and ints"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"815\n",
"False\n"
]
}
],
"prompt_number": 40
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"amount_raised = int(indiegogo[url]['amount_raised']) # use int() to convert\n",
"print amount_raised\n",
"print amount_raised < 100000 # gives us the correct answer"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"815\n",
"True\n"
]
}
],
"prompt_number": 41
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we just loop through all of the projects, check if they are in the ranges specified above, and then add the URL to the related list:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for url in indiegogo.keys():\n",
" amount_raised = int(indiegogo[url]['amount_raised'])\n",
" if amount_raised < 100000:\n",
" money_categories[100000].append(url)\n",
" elif amount_raised < 250000:\n",
" money_categories[250000].append(url)\n",
" elif amount_raised < 500000:\n",
" money_categories[500000].append(url)\n",
" elif amount_raised < 1000000:\n",
" money_categories[1000000].append(url)\n",
" else:\n",
" money_categories['hella money'].append(url)\n",
"# print out the counts. The total is 1000, which is correct\n",
"for upper_bound in money_categories:\n",
" print upper_bound, len(money_categories[upper_bound])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"100000 999\n",
"250000 0\n",
"hella money 0\n",
"1000000 0\n",
"500000 1\n"
]
}
],
"prompt_number": 48
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment