Skip to content

Instantly share code, notes, and snippets.

@maia-18
Created June 11, 2020 14:09
Show Gist options
  • Save maia-18/6dd26aa664b121648b88d29b6431b160 to your computer and use it in GitHub Desktop.
Save maia-18/6dd26aa664b121648b88d29b6431b160 to your computer and use it in GitHub Desktop.
Created on Skills Network Labs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Applied Data Science Capstone - Week 3\n",
"## Reading Toronto Wikipedia Places"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting package metadata (current_repodata.json): done\n",
"Solving environment: done\n",
"\n",
"# All requested packages already installed.\n",
"\n",
"\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"conda install -c anaconda beautifulsoup4"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting package metadata (current_repodata.json): done\n",
"Solving environment: failed with initial frozen solve. Retrying with flexible solve.\n",
"Collecting package metadata (repodata.json): done\n",
"Solving environment: failed with initial frozen solve. Retrying with flexible solve.\n",
"\n",
"PackagesNotFoundError: The following packages are not available from current channels:\n",
"\n",
" - beautifulsoup\n",
"\n",
"Current channels:\n",
"\n",
" - https://conda.anaconda.org/anaconda/linux-64\n",
" - https://conda.anaconda.org/anaconda/noarch\n",
" - https://repo.anaconda.com/pkgs/main/linux-64\n",
" - https://repo.anaconda.com/pkgs/main/noarch\n",
" - https://repo.anaconda.com/pkgs/r/linux-64\n",
" - https://repo.anaconda.com/pkgs/r/noarch\n",
"\n",
"To search for alternate channels that may provide the conda package you're\n",
"looking for, navigate to\n",
"\n",
" https://anaconda.org\n",
"\n",
"and use the search bar at the top of the page.\n",
"\n",
"\n",
"\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"conda install -c anaconda beautifulsoup"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting package metadata (current_repodata.json): done\n",
"Solving environment: done\n",
"\n",
"# All requested packages already installed.\n",
"\n",
"\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"conda install -c anaconda lxml\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ## Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"from bs4 import BeautifulSoup\n",
"import pandas"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [],
"source": [
"website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text\n",
"soup = BeautifulSoup(website_text,'xml')\n",
"table = soup.find('table',{'class':'wikitable sortable'})"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [],
"source": [
"table_rows = table.find_all('tr')\n",
"data = []\n",
"for row in table_rows:\n",
" td=[]\n",
" for t in row.find_all('td'):\n",
" td.append(t.text.strip())\n",
" data.append(td)\n",
"dftoronto = pandas.DataFrame(data, columns=['Postal Code', 'Borough', 'Neighborhood'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Postal Code</th>\n",
" <th>Borough</th>\n",
" <th>Neighborhood</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>M1B</td>\n",
" <td>Scarborough</td>\n",
" <td>Malvern, Rouge</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>M1C</td>\n",
" <td>Scarborough</td>\n",
" <td>Rouge Hill, Port Union, Highland Creek</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>M1E</td>\n",
" <td>Scarborough</td>\n",
" <td>Guildwood, Morningside, West Hill</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>M1G</td>\n",
" <td>Scarborough</td>\n",
" <td>Woburn</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>M1H</td>\n",
" <td>Scarborough</td>\n",
" <td>Cedarbrae</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>98</th>\n",
" <td>M9N</td>\n",
" <td>York</td>\n",
" <td>Weston</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99</th>\n",
" <td>M9P</td>\n",
" <td>Etobicoke</td>\n",
" <td>Westmount</td>\n",
" </tr>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>M9R</td>\n",
" <td>Etobicoke</td>\n",
" <td>Kingsview Village, St. Phillips, Martin Grove ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>M9V</td>\n",
" <td>Etobicoke</td>\n",
" <td>South Steeles, Silverstone, Humbergate, Jamest...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>M9W</td>\n",
" <td>Etobicoke</td>\n",
" <td>Northwest, West Humber - Clairville</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>103 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" Postal Code Borough \\\n",
"0 M1B Scarborough \n",
"1 M1C Scarborough \n",
"2 M1E Scarborough \n",
"3 M1G Scarborough \n",
"4 M1H Scarborough \n",
".. ... ... \n",
"98 M9N York \n",
"99 M9P Etobicoke \n",
"100 M9R Etobicoke \n",
"101 M9V Etobicoke \n",
"102 M9W Etobicoke \n",
"\n",
" Neighborhood \n",
"0 Malvern, Rouge \n",
"1 Rouge Hill, Port Union, Highland Creek \n",
"2 Guildwood, Morningside, West Hill \n",
"3 Woburn \n",
"4 Cedarbrae \n",
".. ... \n",
"98 Weston \n",
"99 Westmount \n",
"100 Kingsview Village, St. Phillips, Martin Grove ... \n",
"101 South Steeles, Silverstone, Humbergate, Jamest... \n",
"102 Northwest, West Humber - Clairville \n",
"\n",
"[103 rows x 3 columns]"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dftoronto = dftoronto[~dftoronto['Borough'].isnull()] # clean bad rows\n",
"dftoronto.drop(dftoronto[dftoronto.Borough == 'Not assigned'].index, inplace=True)\n",
"dftoronto.reset_index(drop=True, inplace=True)\n",
"dftoronto = dftoronto.groupby(['Postal Code','Borough'])['Neighborhood'].apply(lambda x: ','.join(x)).reset_index()\n",
"dftoronto['Neighborhood'].replace('Not assigned',dftoronto['Borough'],inplace=True)\n",
"dftoronto"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe."
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(103, 3)"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dftoronto.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Get the coordinates"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Libraries installed and imported\n"
]
}
],
"source": [
"#Install necessary libraries\n",
"\n",
"#Geopy library for handling geo-spatial data\n",
"#!pip install geopy\n",
"#Folium for map plotting based on lat and lng\n",
"#!pip install folium=0.5.0\n",
"\n",
"#importing other necessary libraries\n",
"import numpy as np\n",
"import pandas as pd\n",
"import requests\n",
"import random\n",
"import folium\n",
"from geopy.geocoders import Nominatim\n",
"from IPython.display import Image \n",
"from IPython.core.display import HTML \n",
"from pandas.io.json import json_normalize\n",
"\n",
"print('Libraries installed and imported')"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Postal Code</th>\n",
" <th>Latitude</th>\n",
" <th>Longitude</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>M1B</td>\n",
" <td>43.806686</td>\n",
" <td>-79.194353</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>M1C</td>\n",
" <td>43.784535</td>\n",
" <td>-79.160497</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>M1E</td>\n",
" <td>43.763573</td>\n",
" <td>-79.188711</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>M1G</td>\n",
" <td>43.770992</td>\n",
" <td>-79.216917</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>M1H</td>\n",
" <td>43.773136</td>\n",
" <td>-79.239476</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Postal Code Latitude Longitude\n",
"0 M1B 43.806686 -79.194353\n",
"1 M1C 43.784535 -79.160497\n",
"2 M1E 43.763573 -79.188711\n",
"3 M1G 43.770992 -79.216917\n",
"4 M1H 43.773136 -79.239476"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"geo_data = pd.read_csv(\"http://cocl.us/Geospatial_data\")\n",
"\n",
"geo_data.head()\n",
"#print(geo_data.shape)"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Postal Code</th>\n",
" <th>Borough</th>\n",
" <th>Neighborhood</th>\n",
" <th>Latitude</th>\n",
" <th>Longitude</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>M1B</td>\n",
" <td>Scarborough</td>\n",
" <td>Malvern, Rouge</td>\n",
" <td>43.806686</td>\n",
" <td>-79.194353</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>M1C</td>\n",
" <td>Scarborough</td>\n",
" <td>Rouge Hill, Port Union, Highland Creek</td>\n",
" <td>43.784535</td>\n",
" <td>-79.160497</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>M1E</td>\n",
" <td>Scarborough</td>\n",
" <td>Guildwood, Morningside, West Hill</td>\n",
" <td>43.763573</td>\n",
" <td>-79.188711</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>M1G</td>\n",
" <td>Scarborough</td>\n",
" <td>Woburn</td>\n",
" <td>43.770992</td>\n",
" <td>-79.216917</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>M1H</td>\n",
" <td>Scarborough</td>\n",
" <td>Cedarbrae</td>\n",
" <td>43.773136</td>\n",
" <td>-79.239476</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>98</th>\n",
" <td>M9N</td>\n",
" <td>York</td>\n",
" <td>Weston</td>\n",
" <td>43.706876</td>\n",
" <td>-79.518188</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99</th>\n",
" <td>M9P</td>\n",
" <td>Etobicoke</td>\n",
" <td>Westmount</td>\n",
" <td>43.696319</td>\n",
" <td>-79.532242</td>\n",
" </tr>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>M9R</td>\n",
" <td>Etobicoke</td>\n",
" <td>Kingsview Village, St. Phillips, Martin Grove ...</td>\n",
" <td>43.688905</td>\n",
" <td>-79.554724</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>M9V</td>\n",
" <td>Etobicoke</td>\n",
" <td>South Steeles, Silverstone, Humbergate, Jamest...</td>\n",
" <td>43.739416</td>\n",
" <td>-79.588437</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>M9W</td>\n",
" <td>Etobicoke</td>\n",
" <td>Northwest, West Humber - Clairville</td>\n",
" <td>43.706748</td>\n",
" <td>-79.594054</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>103 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" Postal Code Borough \\\n",
"0 M1B Scarborough \n",
"1 M1C Scarborough \n",
"2 M1E Scarborough \n",
"3 M1G Scarborough \n",
"4 M1H Scarborough \n",
".. ... ... \n",
"98 M9N York \n",
"99 M9P Etobicoke \n",
"100 M9R Etobicoke \n",
"101 M9V Etobicoke \n",
"102 M9W Etobicoke \n",
"\n",
" Neighborhood Latitude Longitude \n",
"0 Malvern, Rouge 43.806686 -79.194353 \n",
"1 Rouge Hill, Port Union, Highland Creek 43.784535 -79.160497 \n",
"2 Guildwood, Morningside, West Hill 43.763573 -79.188711 \n",
"3 Woburn 43.770992 -79.216917 \n",
"4 Cedarbrae 43.773136 -79.239476 \n",
".. ... ... ... \n",
"98 Weston 43.706876 -79.518188 \n",
"99 Westmount 43.696319 -79.532242 \n",
"100 Kingsview Village, St. Phillips, Martin Grove ... 43.688905 -79.554724 \n",
"101 South Steeles, Silverstone, Humbergate, Jamest... 43.739416 -79.588437 \n",
"102 Northwest, West Humber - Clairville 43.706748 -79.594054 \n",
"\n",
"[103 rows x 5 columns]"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"geo_data.columns = ['Postal Code', 'Latitude', 'Longitude']\n",
"\n",
"dftorontogeo = pd.merge(dftoronto, geo_data, on='Postal Code')\n",
"\n",
"dftorontogeo"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python",
"language": "python",
"name": "conda-env-python-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment