Skip to content

Instantly share code, notes, and snippets.

@mcburton
Last active April 9, 2023 15:13
Show Gist options
  • Save mcburton/0851a0d9d8f569d07431d6fb358bb195 to your computer and use it in GitHub Desktop.
Save mcburton/0851a0d9d8f569d07431d6fb358bb195 to your computer and use it in GitHub Desktop.

This is a test for a POGIL style lesson in a Python notebook.

Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Introduction to Pandas Data Structures\n",
"\n",
"\n",
"\n",
"## Learning Objectives\n",
"\n",
"- Learn about the three pandas data structures: Series, Dataframes, and Indexes\n",
"- Understand what kinds of data can be represented in each of the data structures\n",
"- How to create each of the data structures from Python data structures\n",
"\n",
"\n",
"First step is to import the pandas modules (see previous lesson)\n",
"\n",
"\n",
"\n",
"## Data used in this lesson\n",
"\n",
"snapshot of the information about a few of pittsburgh neighborhoods. \n",
"- Name\n",
"- Population\n",
"- area\n",
"\n",
"The data is formatted as a table in these instructional materaisl. You should expect to copy these values into your code."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# import pandas\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction to Pandas Data Structures\n",
"\n",
"* To understand Pandas, which is hard, it is helpful to start the data structures it adds to Python:\n",
" * Series - For one dimensional data (lists) \n",
" * Dataframe - For two dimensional data (spreadsheets)\n",
" * Index - For naming, selecting, and transforming data within a Pandas Series or Dataframe (column and row names)\n",
" \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Series\n",
"\n",
"* A one-dimensional array of indexed data\n",
"* Kind of like a blend of a Python list and dictionary\n",
"* You can create them from a Python list\n",
"\n",
"### How to create a Series\n",
"\n",
"To create a Series you must use the `pandas.Series()` function and pass it your list-like data as an argument."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0.25\n",
"1 0.50\n",
"2 0.75\n",
"3 1.00\n",
"dtype: float64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a regular Python list of floating point numbers\n",
"my_list = [0.25, 0.5, 0.75, 1.0]\n",
"\n",
"# Transform that list into a Series\n",
"data = pd.Series(my_list)\n",
"\n",
"# Display the data in the series\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task\n",
"\n",
"- Look at the output above, you can see the data from the list but you can also see some additional information. Specifically, there are a set of numbers (0-3) and a line that says \"dtype.\"\n",
" - What do you think these additional pieces of information represent?\n",
"- In the code cell below, replicate the code above but replace the list of values with Python Integers instead of floating point numbers."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Create a series of integer numbers\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Look at the output, do you see anything different? The value next to `dtype` should be different.\n",
"\n",
"Unlike Python lists, a Pandas Series should have all data be of the same type."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### Task\n",
"\n",
"| Neighborhood | Population | Area |\n",
"| ------------------------- | ---------- | ---- |\n",
"| East Liberty | 5869 | 0.58 |\n",
"| Greenfield | 7294 | 0.78 |\n",
"| Squirrel Hill North | 11363 | 1.22 |\n",
"| Bloomfield | 8442 | 0.70 |\n",
"| Central Business District | 3629 | 0.65 |\n",
"|Data source [2010 Pittsburgh Neighborhood Profiles](https://ucsur.pitt.edu/files/census/UCSUR_SF1_NeighborhoodProfiles_July2011.pdf)|\n",
"\n",
"\n",
"- Create a Python Dictionary with a set of keys and values. From the data table above, create a python dictionary of Neighborhood population. The values should represent population data and the keys should represent a name of the neighborhood.\n",
"- Create a series using that dictionary.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create a dictionary and series here\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"#### Answer"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true,
"source_hidden": true
},
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"East Liberty 5869\n",
"Greenfield 7294\n",
"Squirrel Hill North 11363\n",
"Bloomfield 8442\n",
"Central Business District 3629\n",
"dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Answer\n",
"\n",
"neighborhood_dictionary = {\n",
" \"East Liberty\":5869,\n",
" \"Greenfield\":7294,\n",
" \"Squirrel Hill North\":11363,\n",
" \"Bloomfield\":8442,\n",
" \"Central Business District\":3629\n",
"}\n",
"\n",
"neighborhood_series = pd.Series(neighborhood_dictionary)\n",
"neighborhood_series\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"#### -----------------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Look at the resulting output, how has the data from the dictionary been represented in the series?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Components of a Series\n",
"\n",
"- A sequence of ordered data values\n",
"- An explicit named index for each value\n",
"- an implicit numerical index for each value\n",
"- a data type of all the values in the series\n",
"\n",
"So in the example above\n",
"\n",
"- The *named index* are the names of the neighborhoods\n",
"- the values are the populations of those neighborhoods (in 2010)\n",
"- the datatype is `int64` because all of the values are integer values\n",
"- What you can't see is the implicit numerical index, but we know it is there and we can use it!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Indexing into a Series to extract values\n",
"\n",
"You can index into a series using the same syntax as with Python lists. (See lesson TKTKTK).\n",
"\n",
"Using Python's *indexing* and *slicing* syntax you can extract a specific value or subset of a series, just as you would with a list. \n",
"\n",
"### Tasks - Indexing and Slicing a list\n",
"\n",
"* From the `neighborhood_series` data you created above, using indexing extract the first element"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# put your code here\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"#### Answer"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true,
"source_hidden": true
},
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"5869"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# answer\n",
"neighborhood_series[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"#### ---------------------------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* From the `neighborhood_series`, using indexing to extract value for \"Squirrel Hill North\""
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# put your code here\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"#### Answer"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true,
"source_hidden": true
},
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"11363"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# answer\n",
"neighborhood_series[2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ---------------------------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* From the `neighborhood_series`, use slicing syntax to create a subset of the 2nd through 4th elements."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# put your code here\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Squirrel Hill North 11363\n",
"Bloomfield 8442\n",
"Central Business District 3629\n",
"dtype: int64"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# answer\n",
"neighborhood_series[2:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* If you didn't know how many values were in the series, how would you using indexing to get the last item?"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# put your code here\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"#### Answer"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true,
"source_hidden": true
},
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"3629"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# answer\n",
"neighborhood_series[-1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ---------------------------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using the named index to extract values from a Pandas series \n",
"\n",
"Pandas series can also behave like a Python dictionary, that is, you can look up values by their names rather than position in the sequence. \n",
"\n",
"### Tasks - Extracting values by name\n",
"\n",
"* Extract the value for \"Greenfield\" from the series `neighborhood_series`"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# your code here\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Answer"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true,
"source_hidden": true
},
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"7294"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# answer\n",
"neighborhood_series[\"Greenfield\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ---------------------------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Extract a subset of values, that is a slice, of from \"Greenfield\" to the \"Central Business District\" but use the names, not the numerical positions. "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# your code here\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Greenfield 7294\n",
"Squirrel Hill North 11363\n",
"Bloomfield 8442\n",
"Central Business District 3629\n",
"dtype: int64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# answer\n",
"neighborhood_series[\"Greenfield\":\"Central Business District\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note how with named indexing the first and last are inclusive, but with numerical indexing the the first number is inclusive and the second number is exclusive."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary - Pandas Series\n",
"\n",
"TODO: summarize all the important points\n",
"- Creating a series with the pandas function `pd.Series(<data>)`\n",
"- If your data is structured as a list, it will be converted to a series with a single datatype\n",
" - If your data types are mixed, the series `dtype` will be \"object\"\n",
"- If your data is structured as a dictionary, your data will have a named index corresponding to the dictionary keys.\n",
" - You can add a named index manaully using the `pd.Series(<data>, index=<list of names>)` (TODO: add to lesson?\n",
"- You can extract values from a Series using the standard Python indexing and slicing syntax.\n",
" - This works for both the named and numerical index\n",
" \n",
"There is a lot more to learn about Pandas Series and we will cover some of those topics in future lessons. But now it is time to introduce the other important data structure for storing and manipulating two dimensional data, the Dataframe."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dataframe\n",
"\n",
"Dataframes are a two-dimensional data structure, just like an Excel spreadsheet. They have rows and columns.\n",
"\n",
"What is key to understand is each column is a Dataframe is a Series under the hood.\n",
"\n",
"Let's start with the Pittsburgh neighborhood data table.\n",
"\n",
"| Neighborhood | Population | Area |\n",
"| ------------------------- | ---------- | ---- |\n",
"| East Liberty | 5869 | 0.58 |\n",
"| Greenfield | 7294 | 0.78 |\n",
"| Squirrel Hill North | 11363 | 1.22 |\n",
"| Bloomfield | 8442 | 0.70 |\n",
"| Central Business District | 3629 | 0.65 |\n",
"|Data source [2010 Pittsburgh Neighborhood Profiles](https://ucsur.pitt.edu/files/census/UCSUR_SF1_NeighborhoodProfiles_July2011.pdf)|\n",
"\n",
"And now lets create three lists from each of the columns in the data table above."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['East Liberty',\n",
" 'Greenfield',\n",
" 'Squirrel Hill North',\n",
" 'Bloomfield',\n",
" 'Central Business District']"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"neighborhoods = [\"East Liberty\", \"Greenfield\", \"Squirrel Hill North\", \"Bloomfield\", \"Central Business District\"]\n",
"neighborhoods"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Task\n",
"\n",
"Create two more series, one for the population values (called `population` and one for the area values( called `area`). Make sure the order of the values matches the `neighborhoods` series above!\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# your code here"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"# answer\n",
"population = [5869, 7294, 11363, 8442, 3629]\n",
"area = [0.58, 0.78, 1.22, 0.70, 0.65]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Creating Dataframes using Python lists and dictionaries."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>neighborhoods</th>\n",
" <th>population</th>\n",
" <th>area</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>East Liberty</td>\n",
" <td>5869</td>\n",
" <td>0.58</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Greenfield</td>\n",
" <td>7294</td>\n",
" <td>0.78</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Squirrel Hill North</td>\n",
" <td>11363</td>\n",
" <td>1.22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Bloomfield</td>\n",
" <td>8442</td>\n",
" <td>0.70</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Central Business District</td>\n",
" <td>3629</td>\n",
" <td>0.65</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" neighborhoods population area\n",
"0 East Liberty 5869 0.58\n",
"1 Greenfield 7294 0.78\n",
"2 Squirrel Hill North 11363 1.22\n",
"3 Bloomfield 8442 0.70\n",
"4 Central Business District 3629 0.65"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pgh_neighborhood_info = pd.DataFrame({\"neighborhoods\":neighborhoods,\n",
" \"population\": population,\n",
" \"area\": area})\n",
"pgh_neighborhood_info"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Look! We have re-created the table above but now it is loaded as a Pandas Dataframe.\n",
"\n",
"There are other ways to create dataframes, for example maybe your data is more row-centric and you have a list of lists.\n",
"\n",
"```python\n",
"[['East Liberty', 5869, 0.58],\n",
" ['Greenfield', 7294, 0.78],\n",
" ['Squirrel Hill North', 11363, 1.22],\n",
" ['Bloomfield', 8442, 0.7],\n",
" ['Central Business District', 3629, 0.65]]\n",
"```\n",
"### Task\n",
"\n",
"1. Copy the list of lists in the code above and save it to a variable called \"data\".\n",
"2. Create a new list with three string values: \"neighborhood\", \"population\", and \"area\" and save it to a variable called \"column_names\".\n",
"3. Recreate the `pgh_neighborhood_info` dataframe using the `pd.Dataframe()` function. Put the `data` variable as the first positional argument and the `column_names` as the value for the `columns` positional argument. See the section on python functions if you need a hint about positional and keyword arguments. \n",
"4. Display the dataframe and see if it is the same as the output above."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# your code here\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>neighborhood</th>\n",
" <th>population</th>\n",
" <th>area</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>East Liberty</td>\n",
" <td>5869</td>\n",
" <td>0.58</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Greenfield</td>\n",
" <td>7294</td>\n",
" <td>0.78</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Squirrel Hill North</td>\n",
" <td>11363</td>\n",
" <td>1.22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Bloomfield</td>\n",
" <td>8442</td>\n",
" <td>0.70</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Central Business District</td>\n",
" <td>3629</td>\n",
" <td>0.65</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" neighborhood population area\n",
"0 East Liberty 5869 0.58\n",
"1 Greenfield 7294 0.78\n",
"2 Squirrel Hill North 11363 1.22\n",
"3 Bloomfield 8442 0.70\n",
"4 Central Business District 3629 0.65"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# answer\n",
"\n",
"data = [['East Liberty', 5869, 0.58],\n",
" ['Greenfield', 7294, 0.78],\n",
" ['Squirrel Hill North', 11363, 1.22],\n",
" ['Bloomfield', 8442, 0.7],\n",
" ['Central Business District', 3629, 0.65]]\n",
"\n",
"column_names = [\"neighborhood\", \"population\", \"area\"]\n",
"\n",
"pgh_neighborhood_info = pd.DataFrame(data, columns=column_names)\n",
"pgh_neighborhood_info"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Slicing data out of a dataframe\n",
"\n",
"You can use indexing notation to extract individual columns from a dataframe. \n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0.58\n",
"1 0.78\n",
"2 1.22\n",
"3 0.70\n",
"4 0.65\n",
"Name: area, dtype: float64"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pgh_neighborhood_info['area']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice the output is not a list, but a Series.\n",
"\n",
"#### Task\n",
"\n",
"1. Slice out the `population` column from the `pgh_neighborhood_info` dataframe.\n",
"2. Try to slice both the `neighborhood` and `population` columns, what happens?\n",
"3. Create a python list with two strings representing the names of the columns and save that to a variable (you pick the name).\n",
"4. Pass your newly created variable to dataframe within the slicing notation (that is, put it in teh square brackets). What happens now?"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"# your code here"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 5869\n",
"1 7294\n",
"2 11363\n",
"3 8442\n",
"4 3629\n",
"Name: population, dtype: int64"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# answer 1\n",
"pgh_neighborhood_info['population']"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"ename": "KeyError",
"evalue": "('neighborhood', 'population')",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"File \u001b[0;32m/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3629\u001b[0m, in \u001b[0;36mIndex.get_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 3628\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m-> 3629\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_engine\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_loc\u001b[49m\u001b[43m(\u001b[49m\u001b[43mcasted_key\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 3630\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n",
"File \u001b[0;32m/opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx:136\u001b[0m, in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
"File \u001b[0;32m/opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx:163\u001b[0m, in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
"File \u001b[0;32mpandas/_libs/hashtable_class_helper.pxi:5198\u001b[0m, in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
"File \u001b[0;32mpandas/_libs/hashtable_class_helper.pxi:5206\u001b[0m, in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: ('neighborhood', 'population')",
"\nThe above exception was the direct cause of the following exception:\n",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"Input \u001b[0;32mIn [35]\u001b[0m, in \u001b[0;36m<cell line: 2>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# answer 2\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m \u001b[43mpgh_neighborhood_info\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mneighborhood\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m,\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mpopulation\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m]\u001b[49m\n",
"File \u001b[0;32m/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py:3505\u001b[0m, in \u001b[0;36mDataFrame.__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 3503\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcolumns\u001b[38;5;241m.\u001b[39mnlevels \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[1;32m 3504\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_getitem_multilevel(key)\n\u001b[0;32m-> 3505\u001b[0m indexer \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcolumns\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_loc\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 3506\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m is_integer(indexer):\n\u001b[1;32m 3507\u001b[0m indexer \u001b[38;5;241m=\u001b[39m [indexer]\n",
"File \u001b[0;32m/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3631\u001b[0m, in \u001b[0;36mIndex.get_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 3629\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_engine\u001b[38;5;241m.\u001b[39mget_loc(casted_key)\n\u001b[1;32m 3630\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[0;32m-> 3631\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(key) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01merr\u001b[39;00m\n\u001b[1;32m 3632\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m:\n\u001b[1;32m 3633\u001b[0m \u001b[38;5;66;03m# If we have a listlike key, _check_indexing_error will raise\u001b[39;00m\n\u001b[1;32m 3634\u001b[0m \u001b[38;5;66;03m# InvalidIndexError. Otherwise we fall through and re-raise\u001b[39;00m\n\u001b[1;32m 3635\u001b[0m \u001b[38;5;66;03m# the TypeError.\u001b[39;00m\n\u001b[1;32m 3636\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_check_indexing_error(key)\n",
"\u001b[0;31mKeyError\u001b[0m: ('neighborhood', 'population')"
]
}
],
"source": [
"# answer 2\n",
"pgh_neighborhood_info['neighborhood','population']"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>neighborhood</th>\n",
" <th>population</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>East Liberty</td>\n",
" <td>5869</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Greenfield</td>\n",
" <td>7294</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Squirrel Hill North</td>\n",
" <td>11363</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Bloomfield</td>\n",
" <td>8442</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Central Business District</td>\n",
" <td>3629</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" neighborhood population\n",
"0 East Liberty 5869\n",
"1 Greenfield 7294\n",
"2 Squirrel Hill North 11363\n",
"3 Bloomfield 8442\n",
"4 Central Business District 3629"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# asnwer 3 & 4\n",
"foo = [\"neighborhood\", \"population\"]\n",
"pgh_neighborhood_info[foo]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Why do we get a key error when we try to slice two column names?\n",
"If we don't wrap our values in a list, pandas will try to find a single column with a complex name `('neighborhood', 'population')`, which is possible, rather than two columns with the names \"neighborhood\" and \"population.\"\n",
"\n",
"This wart is an artifact of the way in which column names are treated in Pandas. As an Index!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Index\n",
"\n",
"* Pandas `Series` and `DataFrames` are containers for data\n",
"* The Index (and Indexing) is the mechanism to make that data retrievable\n",
"* In a `Series` the index is the key to each value in the list\n",
"* In a `DataFrame` the index is the column names, but there is also an index for each row\n",
"* Indexing allows you to merge or join disparate datasets together\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"### Tasks (todo)\n",
"\n",
"Index helps with alignment of data\n",
"- create lists or series with data not properly aligned. create a dataframe.\n",
"- create two series for population and area, with neighborhood name as an index. Have the data ordered differently\n",
"- create a dataframe from those two series. notice how the data get alighned.\n",
"Extracting data with the named and numerical indices\n",
"- use `iloc` to get the row at index position 1. Which element is it?\n",
"- use `loc` to get the row for the \"greenfield\" neighbrohood. What data structure do you get back?\n",
"- use the row, column indexing syntax to get the population of the central business district.\n",
"- use slicing and the named index to get the area of the middle three rows."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise\n",
"\n",
"* Using the `iloc` and slicing syntax slice the following dataframe based on the highlighted blocks in the image\n",
"* first think of the slicing syntax to grab just the rows you want THEN think of the slicing syntax for the columns you want\n",
"* Put the row slices *before* the comma and the column slices *after* the comma"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>7</td>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1 2 3\n",
"1 4 5 6\n",
"2 7 8 9"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This is our example Dataframe\n",
"indexing_example = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])\n",
"indexing_example"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tasks\n",
"- Select the second two columns of the first two rows.\n",
"- select the entire third row\n",
"- select the first two columns\n",
"- select the first two columns of the second row\n",
"- "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Put the slicing syntax in your answer here\n",
"indexing_example.iloc[???]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment