Skip to content

Instantly share code, notes, and snippets.

@juandreww
Created October 31, 2020 08:12
Show Gist options
  • Save juandreww/9b82081cbd0544fa926bfcbc89b3b03d to your computer and use it in GitHub Desktop.
Save juandreww/9b82081cbd0544fa926bfcbc89b3b03d to your computer and use it in GitHub Desktop.
Created on Skills Network Labs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/Logos/organization_logo/organization_logo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n",
"</center>\n",
"\n",
"# Analyzing a real world data-set with SQL and Python\n",
"\n",
"Estaimted time needed: **15** minutes\n",
"\n",
"## Objectives\n",
"\n",
"After complting this lab you will be able to:\n",
"\n",
"- Understand a dataset of selected socioeconomic indicators in Chicago\n",
"- Learn how to store data in an Db2 database on IBM Cloud instance\n",
"- Solve example problems to practice your SQL skills\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Selected Socioeconomic Indicators in Chicago\n",
"\n",
"The city of Chicago released a dataset of socioeconomic data to the Chicago City Portal.\n",
"This dataset contains a selection of six socioeconomic indicators of public health significance and a “hardship index,” for each Chicago community area, for the years 2008 – 2012.\n",
"\n",
"Scores on the hardship index can range from 1 to 100, with a higher index number representing a greater level of hardship.\n",
"\n",
"A detailed description of the dataset can be found on [the city of Chicago's website](https://data.cityofchicago.org/Health-Human-Services/Census-Data-Selected-socioeconomic-indicators-in-C/kn9c-c2s2?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork-20127838&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork-20127838&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ), but to summarize, the dataset has the following variables:\n",
"\n",
"- **Community Area Number** (`ca`): Used to uniquely identify each row of the dataset\n",
"\n",
"- **Community Area Name** (`community_area_name`): The name of the region in the city of Chicago \n",
"\n",
"- **Percent of Housing Crowded** (`percent_of_housing_crowded`): Percent of occupied housing units with more than one person per room\n",
"\n",
"- **Percent Households Below Poverty** (`percent_households_below_poverty`): Percent of households living below the federal poverty line\n",
"\n",
"- **Percent Aged 16+ Unemployed** (`percent_aged_16_unemployed`): Percent of persons over the age of 16 years that are unemployed\n",
"\n",
"- **Percent Aged 25+ without High School Diploma** (`percent_aged_25_without_high_school_diploma`): Percent of persons over the age of 25 years without a high school education\n",
"\n",
"- **Percent Aged Under** 18 or Over 64:Percent of population under 18 or over 64 years of age (`percent_aged_under_18_or_over_64`): (ie. dependents)\n",
"\n",
"- **Per Capita Income** (`per_capita_income_`): Community Area per capita income is estimated as the sum of tract-level aggragate incomes divided by the total population\n",
"\n",
"- **Hardship Index** (`hardship_index`): Score that incorporates each of the six selected socioeconomic indicators\n",
"\n",
"In this Lab, we'll take a look at the variables in the socioeconomic indicators dataset and do some basic analysis with Python.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Connect to the database\n",
"\n",
"Let us first load the SQL extension and establish a connection with the database\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%load_ext sql"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Connected: ltz46454@BLUDB'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Remember the connection string is of the format:\n",
"# %sql ibm_db_sa://my-username:my-password@my-hostname:my-port/my-db-name\n",
"# Enter the connection string for your Db2 on Cloud database instance below\n",
"# i.e. copy after db2:// from the URI string in Service Credentials of your Db2 instance. Remove the double quotes at the end.\n",
"%sql ibm_db_sa://ltz46454:c7jb5513l%40jblmlj@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Store the dataset in a Table\n",
"\n",
"##### In many cases the dataset to be analyzed is available as a .CSV (comma separated values) file, perhaps on the internet. To analyze the data using SQL, it first needs to be stored in the database.\n",
"\n",
"##### We will first read the dataset source .CSV from the internet into pandas dataframe\n",
"\n",
"##### Then we need to create a table in our Db2 database to store the dataset. The PERSIST command in SQL \"magic\" simplifies the process of table creation and writing the data from a `pandas` dataframe into the table\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n"
]
},
{
"data": {
"text/plain": [
"'Persisted chicago_socioeconomic_data'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas\n",
"chicago_socioeconomic_data = pandas.read_csv('https://data.cityofchicago.org/resource/jcxq-k9xf.csv')\n",
"%sql PERSIST chicago_socioeconomic_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### You can verify that the table creation was successful by making a basic query like:\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n",
"Done.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>index</th>\n",
" <th>ca</th>\n",
" <th>community_area_name</th>\n",
" <th>percent_of_housing_crowded</th>\n",
" <th>percent_households_below_poverty</th>\n",
" <th>percent_aged_16_unemployed</th>\n",
" <th>percent_aged_25_without_high_school_diploma</th>\n",
" <th>percent_aged_under_18_or_over_64</th>\n",
" <th>per_capita_income_</th>\n",
" <th>hardship_index</th>\n",
" </tr>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>1.0</td>\n",
" <td>Rogers Park</td>\n",
" <td>7.7</td>\n",
" <td>23.6</td>\n",
" <td>8.7</td>\n",
" <td>18.2</td>\n",
" <td>27.5</td>\n",
" <td>23939</td>\n",
" <td>39.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>2.0</td>\n",
" <td>West Ridge</td>\n",
" <td>7.8</td>\n",
" <td>17.2</td>\n",
" <td>8.8</td>\n",
" <td>20.8</td>\n",
" <td>38.5</td>\n",
" <td>23040</td>\n",
" <td>46.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>3.0</td>\n",
" <td>Uptown</td>\n",
" <td>3.8</td>\n",
" <td>24.0</td>\n",
" <td>8.9</td>\n",
" <td>11.8</td>\n",
" <td>22.2</td>\n",
" <td>35787</td>\n",
" <td>20.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>4.0</td>\n",
" <td>Lincoln Square</td>\n",
" <td>3.4</td>\n",
" <td>10.9</td>\n",
" <td>8.2</td>\n",
" <td>13.4</td>\n",
" <td>25.5</td>\n",
" <td>37524</td>\n",
" <td>17.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>5.0</td>\n",
" <td>North Center</td>\n",
" <td>0.3</td>\n",
" <td>7.5</td>\n",
" <td>5.2</td>\n",
" <td>4.5</td>\n",
" <td>26.2</td>\n",
" <td>57123</td>\n",
" <td>6.0</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(0, 1.0, 'Rogers Park', 7.7, 23.6, 8.7, 18.2, 27.5, 23939, 39.0),\n",
" (1, 2.0, 'West Ridge', 7.8, 17.2, 8.8, 20.8, 38.5, 23040, 46.0),\n",
" (2, 3.0, 'Uptown', 3.8, 24.0, 8.9, 11.8, 22.2, 35787, 20.0),\n",
" (3, 4.0, 'Lincoln Square', 3.4, 10.9, 8.2, 13.4, 25.5, 37524, 17.0),\n",
" (4, 5.0, 'North Center', 0.3, 7.5, 5.2, 4.5, 26.2, 57123, 6.0)]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT * FROM chicago_socioeconomic_data limit 5;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Problems\n",
"\n",
"### Problem 1\n",
"\n",
"##### How many rows are in the dataset?\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n",
"Done.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>1</th>\n",
" </tr>\n",
" <tr>\n",
" <td>78</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(Decimal('78'),)]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT COUNT(*) FROM chicago_socioeconomic_data;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Double-click **here** for the solution.\n",
"\n",
"<!-- Hint:\n",
"\n",
"%sql SELECT COUNT(*) FROM chicago_socioeconomic_data;\n",
"\n",
"Correct answer: 78\n",
"\n",
"-->\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problem 2\n",
"\n",
"##### How many community areas in Chicago have a hardship index greater than 50.0?\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n",
"Done.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>1</th>\n",
" </tr>\n",
" <tr>\n",
" <td>38</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(Decimal('38'),)]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT COUNT(*) FROM chicago_socioeconomic_data WHERE hardship_index > 50;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Double-click **here** for the solution.\n",
"\n",
"<!-- Hint:\n",
"\n",
"%sql SELECT COUNT(*) FROM chicago_socioeconomic_data WHERE hardship_index > 50.0;\n",
"Correct answer: 38\n",
"-->\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problem 3\n",
"\n",
"##### What is the maximum value of hardship index in this dataset?\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n",
"Done.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>1</th>\n",
" </tr>\n",
" <tr>\n",
" <td>98.0</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(98.0,)]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT MAX(hardship_index) FROM chicago_socioeconomic_data;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Double-click **here** for the solution.\n",
"\n",
"<!-- Hint:\n",
"\n",
"%sql SELECT MAX(hardship_index) FROM chicago_socioeconomic_data;\n",
"\n",
"Correct answer: 98.0\n",
"-->\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problem 4\n",
"\n",
"##### Which community area which has the highest hardship index?\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n",
"Done.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>community_area_name</th>\n",
" </tr>\n",
" <tr>\n",
" <td>Riverdale</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[('Riverdale',)]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT community_area_name FROM chicago_socioeconomic_data WHERE hardship_index = (SELECT MAX(hardship_index) FROM chicago_socioeconomic_data);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Double-click **here** for the solution.\n",
"\n",
"<!-- Hint:\n",
"\n",
"## We can use the result of the last query to as an input to this query:\n",
"%sql SELECT community_area_name FROM chicago_socioeconomic_data where hardship_index=98.0\n",
"\n",
"## or another option:\n",
"%sql SELECT community_area_name FROM chicago_socioeconomic_data ORDER BY hardship_index DESC NULLS LAST FETCH FIRST ROW ONLY;\n",
"\n",
"## or you can use a sub-query to determine the max hardship index:\n",
"%sql select community_area_name from chicago_socioeconomic_data where hardship_index = ( select max(hardship_index) from chicago_socioeconomic_data ) \n",
"\n",
"Correct answer: 'Riverdale'\n",
"-->\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problem 5\n",
"\n",
"##### Which Chicago community areas have per-capita incomes greater than $60,000?\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n",
"Done.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>community_area_name</th>\n",
" </tr>\n",
" <tr>\n",
" <td>Lake View</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Lincoln Park</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Near North Side</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Loop</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[('Lake View',), ('Lincoln Park',), ('Near North Side',), ('Loop',)]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT community_area_name FROM chicago_socioeconomic_data WHERE per_capita_income_ > 60000;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Double-click **here** for the solution.\n",
"\n",
"<!-- Hint:\n",
"\n",
"%sql SELECT community_area_name FROM chicago_socioeconomic_data WHERE per_capita_income_ > 60000;\n",
"\n",
"Correct answer:Lake View,Lincoln Park, Near North Side, Loop\n",
"-->\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problem 6\n",
"\n",
"##### Create a scatter plot using the variables `per_capita_income_` and `hardship_index`. Explain the correlation between the two variables.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline\n",
"\n",
"plot = sns.jointplot(x=\"per_capita_income\", y='hardship_index', data=chicago_socioeconomic_data)\n",
"plo"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Double-click **here** for the solution.\n",
"\n",
"<!-- Hint:\n",
"# if the import command gives ModuleNotFoundError: No module named 'seaborn'\n",
"# then uncomment the following line i.e. delete the # to install the seaborn package \n",
"# !pip install seaborn\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"import seaborn as sns\n",
"\n",
"income_vs_hardship = %sql SELECT per_capita_income_, hardship_index FROM chicago_socioeconomic_data;\n",
"plot = sns.jointplot(x='per_capita_income_',y='hardship_index', data=income_vs_hardship.DataFrame())\n",
"\n",
"Correct answer:You can see that as Per Capita Income rises as the Hardship Index decreases. We see that the points on the scatter plot are somewhat closer to a straight line in the negative direction, so we have a negative correlation between the two variables. \n",
"-->\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Conclusion\n",
"\n",
"##### Now that you know how to do basic exploratory data analysis using SQL and python visualization tools, you can further explore this dataset to see how the variable `per_capita_income_` is related to `percent_households_below_poverty` and `percent_aged_16_unemployed`. Try to create interesting visualizations!\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"##### In this lab you learned how to store a real world data set from the internet in a database (Db2 on IBM Cloud), gain insights into data using SQL queries. You also visualized a portion of the data in the database to see what story it tells.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Author\n",
"\n",
"<a href=\"https://www.linkedin.com/in/ravahuja/\" target=\"_blank\">Rav Ahuja</a>\n",
"\n",
"## Change Log\n",
"\n",
"| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n",
"| ----------------- | ------- | ---------- | ---------------------------------- |\n",
"| 2020-08-28 | 2.0 | Lavanya | Moved lab to course repo in GitLab |\n",
"\n",
"<hr>\n",
"\n",
"## <h3 align=\"center\"> © IBM Corporation 2020. All rights reserved. <h3/>\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python",
"language": "python",
"name": "conda-env-python-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.11"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment