juandreww/DB0201EN-Week3-1-4-Analyzing-v5-py.ipynb

## DB0201EN-Week3-1-4-Analyzing-v5-py.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "    <img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/Logos/organization_logo/organization_logo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\"  />\n",
    "</center>\n",
    "\n",
    "# Analyzing a real world data-set with SQL and Python\n",
    "\n",
    "Estaimted time needed: **15** minutes\n",
    "\n",
    "## Objectives\n",
    "\n",
    "After complting this lab you will be able to:\n",
    "\n",
    "-   Understand a dataset of selected socioeconomic indicators in Chicago\n",
    "-   Learn how to store data in an Db2 database on IBM Cloud instance\n",
    "-   Solve example problems to practice your SQL skills\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selected Socioeconomic Indicators in Chicago\n",
    "\n",
    "The city of Chicago released a dataset of socioeconomic data to the Chicago City Portal.\n",
    "This dataset contains a selection of six socioeconomic indicators of public health significance and a “hardship index,” for each Chicago community area, for the years 2008 – 2012.\n",
    "\n",
    "Scores on the hardship index can range from 1 to 100, with a higher index number representing a greater level of hardship.\n",
    "\n",
    "A detailed description of the dataset can be found on [the city of Chicago's website](https://data.cityofchicago.org/Health-Human-Services/Census-Data-Selected-socioeconomic-indicators-in-C/kn9c-c2s2?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork-20127838&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork-20127838&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ), but to summarize, the dataset has the following variables:\n",
    "\n",
    "-   **Community Area Number** (`ca`): Used to uniquely identify each row of the dataset\n",
    "\n",
    "-   **Community Area Name** (`community_area_name`): The name of the region in the city of Chicago \n",
    "\n",
    "-   **Percent of Housing Crowded** (`percent_of_housing_crowded`): Percent of occupied housing units with more than one person per room\n",
    "\n",
    "-   **Percent Households Below Poverty** (`percent_households_below_poverty`): Percent of households living below the federal poverty line\n",
    "\n",
    "-   **Percent Aged 16+ Unemployed** (`percent_aged_16_unemployed`): Percent of persons over the age of 16 years that are unemployed\n",
    "\n",
    "-   **Percent Aged 25+ without High School Diploma** (`percent_aged_25_without_high_school_diploma`): Percent of persons over the age of 25 years without a high school education\n",
    "\n",
    "-   **Percent Aged Under** 18 or Over 64:Percent of population under 18 or over 64 years of age (`percent_aged_under_18_or_over_64`): (ie. dependents)\n",
    "\n",
    "-   **Per Capita Income** (`per_capita_income_`): Community Area per capita income is estimated as the sum of tract-level aggragate incomes divided by the total population\n",
    "\n",
    "-   **Hardship Index** (`hardship_index`): Score that incorporates each of the six selected socioeconomic indicators\n",
    "\n",
    "In this Lab, we'll take a look at the variables in the socioeconomic indicators dataset and do some basic analysis with Python.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Connect to the database\n",
    "\n",
    "Let us first load the SQL extension and establish a connection with the database\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext sql"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Connected: ltz46454@BLUDB'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Remember the connection string is of the format:\n",
    "# %sql ibm_db_sa://my-username:my-password@my-hostname:my-port/my-db-name\n",
    "# Enter the connection string for your Db2 on Cloud database instance below\n",
    "# i.e. copy after db2:// from the URI string in Service Credentials of your Db2 instance. Remove the double quotes at the end.\n",
    "%sql ibm_db_sa://ltz46454:c7jb5513l%40jblmlj@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Store the dataset in a Table\n",
    "\n",
    "##### In many cases the dataset to be analyzed is available as a .CSV (comma separated values) file, perhaps on the internet. To analyze the data using SQL, it first needs to be stored in the database.\n",
    "\n",
    "##### We will first read the dataset source .CSV from the internet into pandas dataframe\n",
    "\n",
    "##### Then we need to create a table in our Db2 database to store the dataset. The PERSIST command in SQL \"magic\" simplifies the process of table creation and writing the data from a `pandas` dataframe into the table\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'Persisted chicago_socioeconomic_data'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas\n",
    "chicago_socioeconomic_data = pandas.read_csv('https://data.cityofchicago.org/resource/jcxq-k9xf.csv')\n",
    "%sql PERSIST chicago_socioeconomic_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### You can verify that the table creation was successful by making a basic query like:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n",
      "Done.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table>\n",
       "    <tr>\n",
       "        <th>index</th>\n",
       "        <th>ca</th>\n",
       "        <th>community_area_name</th>\n",
       "        <th>percent_of_housing_crowded</th>\n",
       "        <th>percent_households_below_poverty</th>\n",
       "        <th>percent_aged_16_unemployed</th>\n",
       "        <th>percent_aged_25_without_high_school_diploma</th>\n",
       "        <th>percent_aged_under_18_or_over_64</th>\n",
       "        <th>per_capita_income_</th>\n",
       "        <th>hardship_index</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>0</td>\n",
       "        <td>1.0</td>\n",
       "        <td>Rogers Park</td>\n",
       "        <td>7.7</td>\n",
       "        <td>23.6</td>\n",
       "        <td>8.7</td>\n",
       "        <td>18.2</td>\n",
       "        <td>27.5</td>\n",
       "        <td>23939</td>\n",
       "        <td>39.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>1</td>\n",
       "        <td>2.0</td>\n",
       "        <td>West Ridge</td>\n",
       "        <td>7.8</td>\n",
       "        <td>17.2</td>\n",
       "        <td>8.8</td>\n",
       "        <td>20.8</td>\n",
       "        <td>38.5</td>\n",
       "        <td>23040</td>\n",
       "        <td>46.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>2</td>\n",
       "        <td>3.0</td>\n",
       "        <td>Uptown</td>\n",
       "        <td>3.8</td>\n",
       "        <td>24.0</td>\n",
       "        <td>8.9</td>\n",
       "        <td>11.8</td>\n",
       "        <td>22.2</td>\n",
       "        <td>35787</td>\n",
       "        <td>20.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>3</td>\n",
       "        <td>4.0</td>\n",
       "        <td>Lincoln Square</td>\n",
       "        <td>3.4</td>\n",
       "        <td>10.9</td>\n",
       "        <td>8.2</td>\n",
       "        <td>13.4</td>\n",
       "        <td>25.5</td>\n",
       "        <td>37524</td>\n",
       "        <td>17.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>4</td>\n",
       "        <td>5.0</td>\n",
       "        <td>North Center</td>\n",
       "        <td>0.3</td>\n",
       "        <td>7.5</td>\n",
       "        <td>5.2</td>\n",
       "        <td>4.5</td>\n",
       "        <td>26.2</td>\n",
       "        <td>57123</td>\n",
       "        <td>6.0</td>\n",
       "    </tr>\n",
       "</table>"
      ],
      "text/plain": [
       "[(0, 1.0, 'Rogers Park', 7.7, 23.6, 8.7, 18.2, 27.5, 23939, 39.0),\n",
       " (1, 2.0, 'West Ridge', 7.8, 17.2, 8.8, 20.8, 38.5, 23040, 46.0),\n",
       " (2, 3.0, 'Uptown', 3.8, 24.0, 8.9, 11.8, 22.2, 35787, 20.0),\n",
       " (3, 4.0, 'Lincoln Square', 3.4, 10.9, 8.2, 13.4, 25.5, 37524, 17.0),\n",
       " (4, 5.0, 'North Center', 0.3, 7.5, 5.2, 4.5, 26.2, 57123, 6.0)]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%sql SELECT * FROM chicago_socioeconomic_data limit 5;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Problems\n",
    "\n",
    "### Problem 1\n",
    "\n",
    "##### How many rows are in the dataset?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n",
      "Done.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table>\n",
       "    <tr>\n",
       "        <th>1</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>78</td>\n",
       "    </tr>\n",
       "</table>"
      ],
      "text/plain": [
       "[(Decimal('78'),)]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%sql SELECT COUNT(*) FROM chicago_socioeconomic_data;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Double-click **here** for the solution.\n",
    "\n",
    "<!-- Hint:\n",
    "\n",
    "%sql SELECT COUNT(*) FROM chicago_socioeconomic_data;\n",
    "\n",
    "Correct answer: 78\n",
    "\n",
    "-->\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Problem 2\n",
    "\n",
    "##### How many community areas in Chicago have a hardship index greater than 50.0?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n",
      "Done.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table>\n",
       "    <tr>\n",
       "        <th>1</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>38</td>\n",
       "    </tr>\n",
       "</table>"
      ],
      "text/plain": [
       "[(Decimal('38'),)]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%sql SELECT COUNT(*) FROM chicago_socioeconomic_data WHERE hardship_index > 50;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Double-click **here** for the solution.\n",
    "\n",
    "<!-- Hint:\n",
    "\n",
    "%sql SELECT COUNT(*) FROM chicago_socioeconomic_data WHERE hardship_index > 50.0;\n",
    "Correct answer: 38\n",
    "-->\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Problem 3\n",
    "\n",
    "##### What is the maximum value of hardship index in this dataset?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n",
      "Done.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table>\n",
       "    <tr>\n",
       "        <th>1</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>98.0</td>\n",
       "    </tr>\n",
       "</table>"
      ],
      "text/plain": [
       "[(98.0,)]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%sql SELECT MAX(hardship_index) FROM chicago_socioeconomic_data;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Double-click **here** for the solution.\n",
    "\n",
    "<!-- Hint:\n",
    "\n",
    "%sql SELECT MAX(hardship_index) FROM chicago_socioeconomic_data;\n",
    "\n",
    "Correct answer: 98.0\n",
    "-->\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Problem 4\n",
    "\n",
    "##### Which community area which has the highest hardship index?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n",
      "Done.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table>\n",
       "    <tr>\n",
       "        <th>community_area_name</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Riverdale</td>\n",
       "    </tr>\n",
       "</table>"
      ],
      "text/plain": [
       "[('Riverdale',)]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%sql SELECT community_area_name FROM chicago_socioeconomic_data WHERE hardship_index = (SELECT MAX(hardship_index) FROM chicago_socioeconomic_data);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Double-click **here** for the solution.\n",
    "\n",
    "<!-- Hint:\n",
    "\n",
    "## We can use the result of the last query to as an input to this query:\n",
    "%sql SELECT community_area_name FROM chicago_socioeconomic_data where hardship_index=98.0\n",
    "\n",
    "## or another option:\n",
    "%sql SELECT community_area_name FROM chicago_socioeconomic_data ORDER BY hardship_index DESC NULLS LAST FETCH FIRST ROW ONLY;\n",
    "\n",
    "## or you can use a sub-query to determine the max hardship index:\n",
    "%sql select community_area_name from chicago_socioeconomic_data where hardship_index = ( select max(hardship_index) from chicago_socioeconomic_data ) \n",
    "\n",
    "Correct answer: 'Riverdale'\n",
    "-->\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Problem 5\n",
    "\n",
    "##### Which Chicago community areas have per-capita incomes greater than $60,000?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n",
      "Done.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table>\n",
       "    <tr>\n",
       "        <th>community_area_name</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Lake View</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Lincoln Park</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Near North Side</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Loop</td>\n",
       "    </tr>\n",
       "</table>"
      ],
      "text/plain": [
       "[('Lake View',), ('Lincoln Park',), ('Near North Side',), ('Loop',)]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%sql SELECT community_area_name FROM chicago_socioeconomic_data WHERE per_capita_income_ > 60000;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Double-click **here** for the solution.\n",
    "\n",
    "<!-- Hint:\n",
    "\n",
    "%sql SELECT community_area_name FROM chicago_socioeconomic_data WHERE per_capita_income_ > 60000;\n",
    "\n",
    "Correct answer:Lake View,Lincoln Park, Near North Side, Loop\n",
    "-->\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Problem 6\n",
    "\n",
    "##### Create a scatter plot using the variables `per_capita_income_` and `hardship_index`. Explain the correlation between the two variables.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "%matplotlib inline\n",
    "\n",
    "plot = sns.jointplot(x=\"per_capita_income\", y='hardship_index', data=chicago_socioeconomic_data)\n",
    "plo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Double-click **here** for the solution.\n",
    "\n",
    "<!-- Hint:\n",
    "# if the import command gives ModuleNotFoundError: No module named 'seaborn'\n",
    "# then uncomment the following line i.e. delete the # to install the seaborn package \n",
    "# !pip install seaborn\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "import seaborn as sns\n",
    "\n",
    "income_vs_hardship = %sql SELECT per_capita_income_, hardship_index FROM chicago_socioeconomic_data;\n",
    "plot = sns.jointplot(x='per_capita_income_',y='hardship_index', data=income_vs_hardship.DataFrame())\n",
    "\n",
    "Correct answer:You can see that as Per Capita Income rises as the Hardship Index decreases. We see that the points on the scatter plot are somewhat closer to a straight line in the negative direction, so we have a negative correlation between the two variables. \n",
    "-->\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conclusion\n",
    "\n",
    "##### Now that you know how to do basic exploratory data analysis using SQL and python visualization tools, you can further explore this dataset to see how the variable `per_capita_income_` is related to `percent_households_below_poverty` and `percent_aged_16_unemployed`. Try to create interesting visualizations!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "##### In this lab you learned how to store a real world data set from the internet in a database (Db2 on IBM Cloud), gain insights into data using SQL queries. You also visualized a portion of the data in the database to see what story it tells.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Author\n",
    "\n",
    "<a href=\"https://www.linkedin.com/in/ravahuja/\" target=\"_blank\">Rav Ahuja</a>\n",
    "\n",
    "## Change Log\n",
    "\n",
    "| Date (YYYY-MM-DD) | Version | Changed By | Change Description                 |\n",
    "| ----------------- | ------- | ---------- | ---------------------------------- |\n",
    "| 2020-08-28        | 2.0     | Lavanya    | Moved lab to course repo in GitLab |\n",
    "\n",
    "<hr>\n",
    "\n",
    "## <h3 align=\"center\"> © IBM Corporation 2020. All rights reserved. <h3/>\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python",
   "language": "python",
   "name": "conda-env-python-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.11"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<center>\n",
	" <img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/Logos/organization_logo/organization_logo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n",
	"</center>\n",
	"\n",
	"# Analyzing a real world data-set with SQL and Python\n",
	"\n",
	"Estaimted time needed: 15 minutes\n",
	"\n",
	"## Objectives\n",
	"\n",
	"After complting this lab you will be able to:\n",
	"\n",
	"- Understand a dataset of selected socioeconomic indicators in Chicago\n",
	"- Learn how to store data in an Db2 database on IBM Cloud instance\n",
	"- Solve example problems to practice your SQL skills\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Selected Socioeconomic Indicators in Chicago\n",
	"\n",
	"The city of Chicago released a dataset of socioeconomic data to the Chicago City Portal.\n",
	"This dataset contains a selection of six socioeconomic indicators of public health significance and a “hardship index,” for each Chicago community area, for the years 2008 – 2012.\n",
	"\n",
	"Scores on the hardship index can range from 1 to 100, with a higher index number representing a greater level of hardship.\n",
	"\n",
	"A detailed description of the dataset can be found on [the city of Chicago's website](https://data.cityofchicago.org/Health-Human-Services/Census-Data-Selected-socioeconomic-indicators-in-C/kn9c-c2s2?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork-20127838&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork-20127838&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ), but to summarize, the dataset has the following variables:\n",
	"\n",
	"- Community Area Number (`ca`): Used to uniquely identify each row of the dataset\n",
	"\n",
	"- Community Area Name (`community_area_name`): The name of the region in the city of Chicago \n",
	"\n",
	"- Percent of Housing Crowded (`percent_of_housing_crowded`): Percent of occupied housing units with more than one person per room\n",
	"\n",
	"- Percent Households Below Poverty (`percent_households_below_poverty`): Percent of households living below the federal poverty line\n",
	"\n",
	"- Percent Aged 16+ Unemployed (`percent_aged_16_unemployed`): Percent of persons over the age of 16 years that are unemployed\n",
	"\n",
	"- Percent Aged 25+ without High School Diploma (`percent_aged_25_without_high_school_diploma`): Percent of persons over the age of 25 years without a high school education\n",
	"\n",
	"- Percent Aged Under 18 or Over 64:Percent of population under 18 or over 64 years of age (`percent_aged_under_18_or_over_64`): (ie. dependents)\n",
	"\n",
	"- Per Capita Income (`per_capita_income_`): Community Area per capita income is estimated as the sum of tract-level aggragate incomes divided by the total population\n",
	"\n",
	"- Hardship Index (`hardship_index`): Score that incorporates each of the six selected socioeconomic indicators\n",
	"\n",
	"In this Lab, we'll take a look at the variables in the socioeconomic indicators dataset and do some basic analysis with Python.\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Connect to the database\n",
	"\n",
	"Let us first load the SQL extension and establish a connection with the database\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"%load_ext sql"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'Connected: ltz46454@BLUDB'"
	]
	},
	"execution_count": 2,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# Remember the connection string is of the format:\n",
	"# %sql ibm_db_sa://my-username:my-password@my-hostname:my-port/my-db-name\n",
	"# Enter the connection string for your Db2 on Cloud database instance below\n",
	"# i.e. copy after db2:// from the URI string in Service Credentials of your Db2 instance. Remove the double quotes at the end.\n",
	"%sql ibm_db_sa://ltz46454:c7jb5513l%40jblmlj@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Store the dataset in a Table\n",
	"\n",
	"##### In many cases the dataset to be analyzed is available as a .CSV (comma separated values) file, perhaps on the internet. To analyze the data using SQL, it first needs to be stored in the database.\n",
	"\n",
	"##### We will first read the dataset source .CSV from the internet into pandas dataframe\n",
	"\n",
	"##### Then we need to create a table in our Db2 database to store the dataset. The PERSIST command in SQL \"magic\" simplifies the process of table creation and writing the data from a `pandas` dataframe into the table\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"'Persisted chicago_socioeconomic_data'"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"import pandas\n",
	"chicago_socioeconomic_data = pandas.read_csv('https://data.cityofchicago.org/resource/jcxq-k9xf.csv')\n",
	"%sql PERSIST chicago_socioeconomic_data"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##### You can verify that the table creation was successful by making a basic query like:\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n",
	"Done.\n"
	]
	},
	{
	"data": {
	"text/html": [
	"<table>\n",
	" <tr>\n",
	" <th>index</th>\n",
	" <th>ca</th>\n",
	" <th>community_area_name</th>\n",
	" <th>percent_of_housing_crowded</th>\n",
	" <th>percent_households_below_poverty</th>\n",
	" <th>percent_aged_16_unemployed</th>\n",
	" <th>percent_aged_25_without_high_school_diploma</th>\n",
	" <th>percent_aged_under_18_or_over_64</th>\n",
	" <th>per_capita_income_</th>\n",
	" <th>hardship_index</th>\n",
	" </tr>\n",
	" <tr>\n",
	" <td>0</td>\n",
	" <td>1.0</td>\n",
	" <td>Rogers Park</td>\n",
	" <td>7.7</td>\n",
	" <td>23.6</td>\n",
	" <td>8.7</td>\n",
	" <td>18.2</td>\n",
	" <td>27.5</td>\n",
	" <td>23939</td>\n",
	" <td>39.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td>1</td>\n",
	" <td>2.0</td>\n",
	" <td>West Ridge</td>\n",
	" <td>7.8</td>\n",
	" <td>17.2</td>\n",
	" <td>8.8</td>\n",
	" <td>20.8</td>\n",
	" <td>38.5</td>\n",
	" <td>23040</td>\n",
	" <td>46.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td>2</td>\n",
	" <td>3.0</td>\n",
	" <td>Uptown</td>\n",
	" <td>3.8</td>\n",
	" <td>24.0</td>\n",
	" <td>8.9</td>\n",
	" <td>11.8</td>\n",
	" <td>22.2</td>\n",
	" <td>35787</td>\n",
	" <td>20.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td>3</td>\n",
	" <td>4.0</td>\n",
	" <td>Lincoln Square</td>\n",
	" <td>3.4</td>\n",
	" <td>10.9</td>\n",
	" <td>8.2</td>\n",
	" <td>13.4</td>\n",
	" <td>25.5</td>\n",
	" <td>37524</td>\n",
	" <td>17.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td>4</td>\n",
	" <td>5.0</td>\n",
	" <td>North Center</td>\n",
	" <td>0.3</td>\n",
	" <td>7.5</td>\n",
	" <td>5.2</td>\n",
	" <td>4.5</td>\n",
	" <td>26.2</td>\n",
	" <td>57123</td>\n",
	" <td>6.0</td>\n",
	" </tr>\n",
	"</table>"
	],
	"text/plain": [
	"[(0, 1.0, 'Rogers Park', 7.7, 23.6, 8.7, 18.2, 27.5, 23939, 39.0),\n",
	" (1, 2.0, 'West Ridge', 7.8, 17.2, 8.8, 20.8, 38.5, 23040, 46.0),\n",
	" (2, 3.0, 'Uptown', 3.8, 24.0, 8.9, 11.8, 22.2, 35787, 20.0),\n",
	" (3, 4.0, 'Lincoln Square', 3.4, 10.9, 8.2, 13.4, 25.5, 37524, 17.0),\n",
	" (4, 5.0, 'North Center', 0.3, 7.5, 5.2, 4.5, 26.2, 57123, 6.0)]"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"%sql SELECT * FROM chicago_socioeconomic_data limit 5;"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Problems\n",
	"\n",
	"### Problem 1\n",
	"\n",
	"##### How many rows are in the dataset?\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n",
	"Done.\n"
	]
	},
	{
	"data": {
	"text/html": [
	"<table>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" </tr>\n",
	" <tr>\n",
	" <td>78</td>\n",
	" </tr>\n",
	"</table>"
	],
	"text/plain": [
	"[(Decimal('78'),)]"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"%sql SELECT COUNT(*) FROM chicago_socioeconomic_data;"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Double-click here for the solution.\n",
	"\n",
	"<!-- Hint:\n",
	"\n",
	"%sql SELECT COUNT(*) FROM chicago_socioeconomic_data;\n",
	"\n",
	"Correct answer: 78\n",
	"\n",
	"-->\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Problem 2\n",
	"\n",
	"##### How many community areas in Chicago have a hardship index greater than 50.0?\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n",
	"Done.\n"
	]
	},
	{
	"data": {
	"text/html": [
	"<table>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" </tr>\n",
	" <tr>\n",
	" <td>38</td>\n",
	" </tr>\n",
	"</table>"
	],
	"text/plain": [
	"[(Decimal('38'),)]"
	]
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"%sql SELECT COUNT(*) FROM chicago_socioeconomic_data WHERE hardship_index > 50;"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Double-click here for the solution.\n",
	"\n",
	"<!-- Hint:\n",
	"\n",
	"%sql SELECT COUNT(*) FROM chicago_socioeconomic_data WHERE hardship_index > 50.0;\n",
	"Correct answer: 38\n",
	"-->\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Problem 3\n",
	"\n",
	"##### What is the maximum value of hardship index in this dataset?\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n",
	"Done.\n"
	]
	},
	{
	"data": {
	"text/html": [
	"<table>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" </tr>\n",
	" <tr>\n",
	" <td>98.0</td>\n",
	" </tr>\n",
	"</table>"
	],
	"text/plain": [
	"[(98.0,)]"
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"%sql SELECT MAX(hardship_index) FROM chicago_socioeconomic_data;"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Double-click here for the solution.\n",
	"\n",
	"<!-- Hint:\n",
	"\n",
	"%sql SELECT MAX(hardship_index) FROM chicago_socioeconomic_data;\n",
	"\n",
	"Correct answer: 98.0\n",
	"-->\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Problem 4\n",
	"\n",
	"##### Which community area which has the highest hardship index?\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n",
	"Done.\n"
	]
	},
	{
	"data": {
	"text/html": [
	"<table>\n",
	" <tr>\n",
	" <th>community_area_name</th>\n",
	" </tr>\n",
	" <tr>\n",
	" <td>Riverdale</td>\n",
	" </tr>\n",
	"</table>"
	],
	"text/plain": [
	"[('Riverdale',)]"
	]
	},
	"execution_count": 12,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"%sql SELECT community_area_name FROM chicago_socioeconomic_data WHERE hardship_index = (SELECT MAX(hardship_index) FROM chicago_socioeconomic_data);"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Double-click here for the solution.\n",
	"\n",
	"<!-- Hint:\n",
	"\n",
	"## We can use the result of the last query to as an input to this query:\n",
	"%sql SELECT community_area_name FROM chicago_socioeconomic_data where hardship_index=98.0\n",
	"\n",
	"## or another option:\n",
	"%sql SELECT community_area_name FROM chicago_socioeconomic_data ORDER BY hardship_index DESC NULLS LAST FETCH FIRST ROW ONLY;\n",
	"\n",
	"## or you can use a sub-query to determine the max hardship index:\n",
	"%sql select community_area_name from chicago_socioeconomic_data where hardship_index = ( select max(hardship_index) from chicago_socioeconomic_data ) \n",
	"\n",
	"Correct answer: 'Riverdale'\n",
	"-->\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Problem 5\n",
	"\n",
	"##### Which Chicago community areas have per-capita incomes greater than $60,000?\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" * ibm_db_sa://ltz46454:***@dashdb-txn-sbox-yp-dal09-11.services.dal.bluemix.net:50000/BLUDB\n",
	"Done.\n"
	]
	},
	{
	"data": {
	"text/html": [
	"<table>\n",
	" <tr>\n",
	" <th>community_area_name</th>\n",
	" </tr>\n",
	" <tr>\n",
	" <td>Lake View</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td>Lincoln Park</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td>Near North Side</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td>Loop</td>\n",
	" </tr>\n",
	"</table>"
	],
	"text/plain": [
	"[('Lake View',), ('Lincoln Park',), ('Near North Side',), ('Loop',)]"
	]
	},
	"execution_count": 13,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"%sql SELECT community_area_name FROM chicago_socioeconomic_data WHERE per_capita_income_ > 60000;"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Double-click here for the solution.\n",
	"\n",
	"<!-- Hint:\n",
	"\n",
	"%sql SELECT community_area_name FROM chicago_socioeconomic_data WHERE per_capita_income_ > 60000;\n",
	"\n",
	"Correct answer:Lake View,Lincoln Park, Near North Side, Loop\n",
	"-->\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Problem 6\n",
	"\n",
	"##### Create a scatter plot using the variables `per_capita_income_` and `hardship_index`. Explain the correlation between the two variables.\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true,
	"jupyter": {
	"outputs_hidden": true
	}
	},
	"outputs": [],
	"source": [
	"import matplotlib.pyplot as plt\n",
	"import seaborn as sns\n",
	"%matplotlib inline\n",
	"\n",
	"plot = sns.jointplot(x=\"per_capita_income\", y='hardship_index', data=chicago_socioeconomic_data)\n",
	"plo"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Double-click here for the solution.\n",
	"\n",
	"<!-- Hint:\n",
	"# if the import command gives ModuleNotFoundError: No module named 'seaborn'\n",
	"# then uncomment the following line i.e. delete the # to install the seaborn package \n",
	"# !pip install seaborn\n",
	"import matplotlib.pyplot as plt\n",
	"%matplotlib inline\n",
	"import seaborn as sns\n",
	"\n",
	"income_vs_hardship = %sql SELECT per_capita_income_, hardship_index FROM chicago_socioeconomic_data;\n",
	"plot = sns.jointplot(x='per_capita_income_',y='hardship_index', data=income_vs_hardship.DataFrame())\n",
	"\n",
	"Correct answer:You can see that as Per Capita Income rises as the Hardship Index decreases. We see that the points on the scatter plot are somewhat closer to a straight line in the negative direction, so we have a negative correlation between the two variables. \n",
	"-->\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Conclusion\n",
	"\n",
	"##### Now that you know how to do basic exploratory data analysis using SQL and python visualization tools, you can further explore this dataset to see how the variable `per_capita_income_` is related to `percent_households_below_poverty` and `percent_aged_16_unemployed`. Try to create interesting visualizations!\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Summary\n",
	"\n",
	"##### In this lab you learned how to store a real world data set from the internet in a database (Db2 on IBM Cloud), gain insights into data using SQL queries. You also visualized a portion of the data in the database to see what story it tells.\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Author\n",
	"\n",
	"<a href=\"https://www.linkedin.com/in/ravahuja/\" target=\"_blank\">Rav Ahuja</a>\n",
	"\n",
	"## Change Log\n",
	"\n",
	"\| Date (YYYY-MM-DD) \| Version \| Changed By \| Change Description \|\n",
	"\| ----------------- \| ------- \| ---------- \| ---------------------------------- \|\n",
	"\| 2020-08-28 \| 2.0 \| Lavanya \| Moved lab to course repo in GitLab \|\n",
	"\n",
	"<hr>\n",
	"\n",
	"## <h3 align=\"center\"> © IBM Corporation 2020. All rights reserved. <h3/>\n"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python",
	"language": "python",
	"name": "conda-env-python-py"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.11"
	},
	"widgets": {
	"state": {},
	"version": "1.1.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}