Skip to content

Instantly share code, notes, and snippets.

@du2x
Created January 28, 2020 01:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save du2x/337882e8ee1c7eb0f1355156f4816a05 to your computer and use it in GitHub Desktop.
Save du2x/337882e8ee1c7eb0f1355156f4816a05 to your computer and use it in GitHub Desktop.
Created on Cognitive Class Labs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Neighbordhood Venues Categories Pattern vs Neighborhood Prevailing Social Classes"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Solving environment: done\n",
"\n",
"\n",
"==> WARNING: A newer version of conda exists. <==\n",
" current version: 4.5.11\n",
" latest version: 4.8.1\n",
"\n",
"Please update conda by running\n",
"\n",
" $ conda update -n base -c defaults conda\n",
"\n",
"\n",
"\n",
"# All requested packages already installed.\n",
"\n",
"Hello Capstone Project Course!\n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"import requests\n",
"\n",
"CLIENT_ID = 'SPRWEPOSZVPI0WZ4MTQHQZWDYUSWBJ4FWK01XRNPQ25RGA4Z'\n",
"CLIENT_SECRET = 'APDO2EBXWWR2LNB5V0V13MX40ZEJBFXPDMHS04WLFHBPNHPJ'\n",
"\n",
"VERSION = '20180605' # Foursquare API version\n",
"\n",
"!conda install -c conda-forge folium=0.5.0 --yes # uncomment if needed\n",
"\n",
"print('Hello Capstone Project Course!')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Introduction \n",
"\n",
"For an entrepreneur, choosing the location of a new establishment within a city can be a very important and also very difficult task. For this, it is advisable to have as much information as possible from each neighborhood.\n",
"\n",
"Similarly, the city government also needs as much information as possible from each neighborhood to manage them properly.\n",
"\n",
"Amongst all information about neighborhoods, one that stands out is its prevailing social class. The needs and opportunities of a neighborhood are often associated with this information.\n",
"\n",
"Here, we will seek to develop a model capable of predicting the prevailing social class of each neighborhood, based on the categories of establishments there. This model will be trained with data from the set of reports of establishments in each neighborhood, retrievable from the Foursquare API and with the data a report from UFMG (University of Minas Gerais) that informs the majority social class of each neighborhood. \n",
"\n",
"If the model works well, we may use it to find out a valuable information of neighborhood on cities similar to Belo Horizonte that hasn't a report about their neighborhood prevailing social class."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Data acquisition and cleaning\n",
"\n",
"The report about the Belo Horizonte neighborhood prevailing social classes is published in PDF format. Fortunately, it is very easy to copy the data contents and past into a csv file. The resulting columns are \"Neighborhood\" and \"Class\". Let's see the head of this data set."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Neighborhood</th>\n",
" <th>Class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>AARAO REIS</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>ALTO DOS PINHEIROS</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>ALTO PARAISO</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>ALVARO CAMARGOS</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>ALVORADA</td>\n",
" <td>low</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Neighborhood Class\n",
"0 AARAO REIS low\n",
"1 ALTO DOS PINHEIROS low\n",
"2 ALTO PARAISO low\n",
"3 ALVARO CAMARGOS low\n",
"4 ALVORADA low"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"bairros = pd.read_csv('bh_bairros_classes.csv') # it was previously saved\n",
"bairros.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will retrieve the Neighborhood venues with Foursquare API, by calling the \"query\" endpoint for each Neighborhood, which requires localization data. We use geopy library to retrieve localization data of each Neighborhood."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"#!pip install geopy # uncomment if need\n",
"from geopy import Nominatim\n",
"from geopy.exc import GeocoderUnavailable, GeocoderTimedOut, GeocoderServiceError\n",
"\n",
"geolocator = Nominatim(user_agent=\"ny_explorer\")\n",
"\n",
"def geolocator_belohorizonte(neighborhood): \n",
" try:\n",
" locator = geolocator.geocode(neighborhood +', Belo Horizonte')\n",
" except (GeocoderUnavailable, GeocoderTimedOut, GeocoderServiceError):\n",
" # print('Geocoder unavailable or timed out... will try again!')\n",
" locator = geolocator_belohorizonte(neighborhood)\n",
" return locator"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(245, 4)\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Neighborhood</th>\n",
" <th>Class</th>\n",
" <th>Latitude</th>\n",
" <th>Longitude</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>AARAO REIS</td>\n",
" <td>low</td>\n",
" <td>-19.847221</td>\n",
" <td>-43.919508</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>ALTO DOS PINHEIROS</td>\n",
" <td>low</td>\n",
" <td>-19.932567</td>\n",
" <td>-44.004875</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>ALVARO CAMARGOS</td>\n",
" <td>low</td>\n",
" <td>-19.916339</td>\n",
" <td>-44.007857</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>ALVORADA</td>\n",
" <td>low</td>\n",
" <td>-30.031715</td>\n",
" <td>-51.049711</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>ANA LUCIA</td>\n",
" <td>low</td>\n",
" <td>-19.887783</td>\n",
" <td>-43.906368</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Neighborhood Class Latitude Longitude\n",
"0 AARAO REIS low -19.847221 -43.919508\n",
"1 ALTO DOS PINHEIROS low -19.932567 -44.004875\n",
"2 ALVARO CAMARGOS low -19.916339 -44.007857\n",
"3 ALVORADA low -30.031715 -51.049711\n",
"4 ANA LUCIA low -19.887783 -43.906368"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(bairros.shape)\n",
"bairros.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The geopy library sometimes can't be accurate, so we had to remove the Neighborhoods that couldn't have its localization data accurately retrieved. The identification of such cases was map manually with a map vizualization support."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div style=\"width:100%;\"><div style=\"position:relative;width:100%;height:0;padding-bottom:60%;\"><iframe src=\"data:text/html;charset=utf-8;base64,\" style=\"position:absolute;width:100%;height:100%;left:0;top:0;border:none !important;\" allowfullscreen webkitallowfullscreen mozallowfullscreen></iframe></div></div>"
],
"text/plain": [
"<folium.folium.Map at 0x7f97cb9d4d30>"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import folium\n",
"import matplotlib.cm as cm\n",
"import matplotlib.colors as colors\n",
"\n",
"classes = {'low':0, 'regular':1, 'high':2, 'luxury':3}\n",
"\n",
"\n",
"ilat = -19.9227318\n",
"ilon = -43.9450948\n",
"\n",
"# create map\n",
"map_classes = folium.Map(location=[ilat, ilon], zoom_start=11)\n",
"\n",
"# set color scheme for the clusters\n",
"x = np.arange(4)\n",
"ys = [i + x + (i*x)**2 for i in range(4)]\n",
"colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))\n",
"rainbow = [colors.rgb2hex(i) for i in colors_array]\n",
"\n",
"# add markers to the map\n",
"markers_colors = []\n",
"for lat, lon, poi, socialclass in zip(bairros['Latitude'], bairros['Longitude'], bairros['Neighborhood'], bairros['Class']):\n",
" label = folium.Popup(str(poi) + ' Class ' + socialclass, parse_html=True)\n",
" folium.CircleMarker(\n",
" [lat, lon],\n",
" radius=5,\n",
" popup=label,\n",
" color=rainbow[classes[socialclass]-1],\n",
" fill=True,\n",
" fill_color=rainbow[classes[socialclass]-1],\n",
" fill_opacity=0.7).add_to(map_classes)\n",
" \n",
"map_classes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By viewing the map it's clear that geopy defined many points out of bounds of Belo Horizonte. Besides that, as I live in Belo Horizonte, I could detect some neighborhoods far from downtown that were inacurately defined by geopy.\n",
"\n",
"So, we chose to restrict the analysis to neighborhoods not far from downtown.\n",
"\n",
"Also, the location of \"Pindorama\" neighborhood, near from downtown, is remarkably wrong. So it will be removed too."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"df = bairros[(bairros['Latitude']<ilat+0.09) & (bairros['Latitude']>ilat-0.09) & (bairros['Longitude']>ilon-0.06) & (bairros['Longitude']<ilon+0.06)]\n",
"df = df[df['Neighborhood'] != 'PINDORAMA']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we have the Latitude and Longitude for each neighborhood, so we are ready to make Foursquare API calls on \"query\" endpoint.\n",
"\n",
"Actually, for each neighborhood we made 4 API calls, one for each category: \"food\", \"stores and services\" and \"professional\", and one for all categories combined.\n",
"\n",
"So there will be 4 resulting datasets, one for each category, plus one for all categories combined, and it will look like this:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"categories = {\n",
" 'food': '4d4b7105d754a06374d81259',\n",
" 'stores': '4d4b7105d754a06378d81259', \n",
" 'profissional': '4d4b7105d754a06375d81259'\n",
"}\n",
"\n",
"# function that extracts the category of the venue\n",
"def get_category_type(row):\n",
" try:\n",
" categories_list = row['categories']\n",
" except:\n",
" categories_list = row['venue.categories']\n",
" \n",
" if len(categories_list) == 0:\n",
" return None\n",
" else:\n",
" return categories_list[0]['name']\n",
" \n",
"\n",
"# function that returns nearby venues by accessing foursquare \n",
"def getNearbyVenues(names, latitudes, longitudes, category, radius=500, LIMIT=150):\n",
" \n",
" venues_list=[]\n",
" for name, lat, lng in zip(names, latitudes, longitudes): \n",
" \n",
" # create the API request URL\n",
" url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&categoryId={}'.format(\n",
" CLIENT_ID, \n",
" CLIENT_SECRET, \n",
" VERSION, \n",
" lat, \n",
" lng, \n",
" radius, \n",
" LIMIT,\n",
" category\n",
" )\n",
" \n",
" # make the GET request\n",
" results = requests.get(url).json()[\"response\"]['groups'][0]['items']\n",
" \n",
" # return only relevant information for each nearby venue\n",
" venues_list.append([(\n",
" name, \n",
" lat, \n",
" lng, \n",
" v['venue']['id'], \n",
" v['venue']['name'], \n",
" v['venue']['location']['lat'], \n",
" v['venue']['location']['lng'], \n",
" v['venue']['categories'][0]['name']) for v in results])\n",
"\n",
" nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])\n",
" nearby_venues.columns = ['Neighborhood', \n",
" 'Neighborhood Latitude', \n",
" 'Neighborhood Longitude', \n",
" 'ID', \n",
" 'Venue', \n",
" 'Venue Latitude', \n",
" 'Venue Longitude', \n",
" 'Venue Category']\n",
" \n",
" return(nearby_venues)"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Neighborhood</th>\n",
" <th>Neighborhood Latitude</th>\n",
" <th>Neighborhood Longitude</th>\n",
" <th>ID</th>\n",
" <th>Venue</th>\n",
" <th>Venue Latitude</th>\n",
" <th>Venue Longitude</th>\n",
" <th>Venue Category</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>AARAO REIS</td>\n",
" <td>-19.847221</td>\n",
" <td>-43.919508</td>\n",
" <td>4eb1d86e77c814d925751c99</td>\n",
" <td>Chapa Mágica</td>\n",
" <td>-19.845448</td>\n",
" <td>-43.921754</td>\n",
" <td>BBQ Joint</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>AARAO REIS</td>\n",
" <td>-19.847221</td>\n",
" <td>-43.919508</td>\n",
" <td>5bf1cc4275eee40039f91adf</td>\n",
" <td>Burger King</td>\n",
" <td>-19.846823</td>\n",
" <td>-43.919360</td>\n",
" <td>Fast Food Restaurant</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>AARAO REIS</td>\n",
" <td>-19.847221</td>\n",
" <td>-43.919508</td>\n",
" <td>4daba4b84b22f071ead33715</td>\n",
" <td>Celo Burguer</td>\n",
" <td>-19.847524</td>\n",
" <td>-43.919394</td>\n",
" <td>Burger Joint</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>AARAO REIS</td>\n",
" <td>-19.847221</td>\n",
" <td>-43.919508</td>\n",
" <td>516dc84d498e618c69124919</td>\n",
" <td>bobs</td>\n",
" <td>-19.846710</td>\n",
" <td>-43.917326</td>\n",
" <td>Burger Joint</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>AARAO REIS</td>\n",
" <td>-19.847221</td>\n",
" <td>-43.919508</td>\n",
" <td>539991af498ea6a823188d29</td>\n",
" <td>Padaria Vila Verde</td>\n",
" <td>-19.847362</td>\n",
" <td>-43.921778</td>\n",
" <td>Bakery</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Neighborhood Neighborhood Latitude Neighborhood Longitude \\\n",
"0 AARAO REIS -19.847221 -43.919508 \n",
"1 AARAO REIS -19.847221 -43.919508 \n",
"2 AARAO REIS -19.847221 -43.919508 \n",
"3 AARAO REIS -19.847221 -43.919508 \n",
"4 AARAO REIS -19.847221 -43.919508 \n",
"\n",
" ID Venue Venue Latitude \\\n",
"0 4eb1d86e77c814d925751c99 Chapa Mágica -19.845448 \n",
"1 5bf1cc4275eee40039f91adf Burger King -19.846823 \n",
"2 4daba4b84b22f071ead33715 Celo Burguer -19.847524 \n",
"3 516dc84d498e618c69124919 bobs -19.846710 \n",
"4 539991af498ea6a823188d29 Padaria Vila Verde -19.847362 \n",
"\n",
" Venue Longitude Venue Category \n",
"0 -43.921754 BBQ Joint \n",
"1 -43.919360 Fast Food Restaurant \n",
"2 -43.919394 Burger Joint \n",
"3 -43.917326 Burger Joint \n",
"4 -43.921778 Bakery "
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bh_food_venues = getNearbyVenues(names=df['Neighborhood'],\n",
" latitudes=df['Latitude'],\n",
" longitudes=df['Longitude'],\n",
" categories['food']\n",
" )\n",
"\n",
"bh_pro_venues = getNearbyVenues(names=df['Neighborhood'],\n",
" latitudes=df['Latitude'],\n",
" longitudes=df['Longitude'],\n",
" categories['professional']\n",
" )\n",
"\n",
"bh_stores_venues = getNearbyVenues(names=df['Neighborhood'],\n",
" latitudes=df['Latitude'],\n",
" longitudes=df['Longitude'],\n",
" categories['stores']\n",
" )\n",
"\n",
"bh_all_venues = getNearbyVenues(names=df['Neighborhood'],\n",
" latitudes=df['Latitude'],\n",
" longitudes=df['Longitude'],\n",
" ''\n",
" )\n",
"bh_food_venues.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Methodology "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So now we have 4 clean datasets:\n",
"\n",
"* bh_food_venues\n",
"* bh_stores_venues\n",
"* bh_pro_venues\n",
"* bh_all_venues\n",
"\n",
"They will be our asset for training our model using classification algorithms - the venues categories will be its features (after an onehot) and the social class will be the target variable.\n",
"\n",
"The resulting models will be evaluated and we will show the best dataset and best classification algorithm for our goal.\n",
"\n",
"Let's pick one dataset and see if it's ready to go:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Neighborhood</th>\n",
" <th>Neighborhood Latitude</th>\n",
" <th>Neighborhood Longitude</th>\n",
" <th>ID</th>\n",
" <th>Venue</th>\n",
" <th>Venue Latitude</th>\n",
" <th>Venue Longitude</th>\n",
" <th>Venue Category</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>AARAO REIS</td>\n",
" <td>-19.847221</td>\n",
" <td>-43.919508</td>\n",
" <td>4eb1d86e77c814d925751c99</td>\n",
" <td>Chapa Mágica</td>\n",
" <td>-19.845448</td>\n",
" <td>-43.921754</td>\n",
" <td>BBQ Joint</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>AARAO REIS</td>\n",
" <td>-19.847221</td>\n",
" <td>-43.919508</td>\n",
" <td>5bf1cc4275eee40039f91adf</td>\n",
" <td>Burger King</td>\n",
" <td>-19.846823</td>\n",
" <td>-43.919360</td>\n",
" <td>Fast Food Restaurant</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>AARAO REIS</td>\n",
" <td>-19.847221</td>\n",
" <td>-43.919508</td>\n",
" <td>4daba4b84b22f071ead33715</td>\n",
" <td>Celo Burguer</td>\n",
" <td>-19.847524</td>\n",
" <td>-43.919394</td>\n",
" <td>Burger Joint</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>AARAO REIS</td>\n",
" <td>-19.847221</td>\n",
" <td>-43.919508</td>\n",
" <td>516dc84d498e618c69124919</td>\n",
" <td>bobs</td>\n",
" <td>-19.846710</td>\n",
" <td>-43.917326</td>\n",
" <td>Burger Joint</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>AARAO REIS</td>\n",
" <td>-19.847221</td>\n",
" <td>-43.919508</td>\n",
" <td>539991af498ea6a823188d29</td>\n",
" <td>Padaria Vila Verde</td>\n",
" <td>-19.847362</td>\n",
" <td>-43.921778</td>\n",
" <td>Bakery</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Neighborhood Neighborhood Latitude Neighborhood Longitude \\\n",
"0 AARAO REIS -19.847221 -43.919508 \n",
"1 AARAO REIS -19.847221 -43.919508 \n",
"2 AARAO REIS -19.847221 -43.919508 \n",
"3 AARAO REIS -19.847221 -43.919508 \n",
"4 AARAO REIS -19.847221 -43.919508 \n",
"\n",
" ID Venue Venue Latitude \\\n",
"0 4eb1d86e77c814d925751c99 Chapa Mágica -19.845448 \n",
"1 5bf1cc4275eee40039f91adf Burger King -19.846823 \n",
"2 4daba4b84b22f071ead33715 Celo Burguer -19.847524 \n",
"3 516dc84d498e618c69124919 bobs -19.846710 \n",
"4 539991af498ea6a823188d29 Padaria Vila Verde -19.847362 \n",
"\n",
" Venue Longitude Venue Category \n",
"0 -43.921754 BBQ Joint \n",
"1 -43.919360 Fast Food Restaurant \n",
"2 -43.919394 Burger Joint \n",
"3 -43.917326 Burger Joint \n",
"4 -43.921778 Bakery "
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bh_all_venues = pd.read_csv('bh_food_venues.csv')\n",
"bh_all_venues.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we are going to use the Venues categories as features of our clustering algorithms, it's appropriate to avoid neighborhoods with small number of venues, because it's high potential to become outliers.\n",
"\n",
"Unfortunately, the dataset of stores become too small after that restriction, so it will be discarded.\n",
"\n",
"Now let's take a fast view on the most common Venues Categories of **bh_all_venues**:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Bakery 296\n",
"Bar 259\n",
"Brazilian Restaurant 230\n",
"Gym / Fitness Center 193\n",
"Burger Joint 165\n",
" ... \n",
"Fish Market 1\n",
"Speakeasy 1\n",
"Travel Agency 1\n",
"Design Studio 1\n",
"Butcher 1\n",
"Name: Venue Category, Length: 272, dtype: int64"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bh_all_venues['Venue Category'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks like we are ready now to prepare our dataset to the classification algorithm.\n",
"\n",
"First we do one hot encoding and drop columns that are not features or target.\n",
"\n",
"Then we split training set with test set and build KNN, SVM and Logistic regression models.\n",
"\n",
"The target (y) will be tested in the following formats:\n",
"\n",
"* the actual class\n",
"* if the class == 'luxury'\n",
"* if the class == 'high'\n",
"* if the class == 'regular'\n",
"* if the class == 'low'\n",
"* if the class == 'high' or 'luxury'"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn import preprocessing\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn import svm\n",
"from sklearn import metrics\n",
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"le = preprocessing.LabelEncoder()\n",
"le.fit([\"low\", \"regular\", \"high\", \"luxury\"])\n",
"\n",
"def prepare(df_v): # returns a dataset with only features and target\n",
" df_venues = df_v\n",
"\n",
" # one hot encoding\n",
" onehot = pd.get_dummies(df_venues[['Venue Category']], prefix=\"\", prefix_sep=\"\")\n",
"\n",
" # add neighborhood column back to dataframe\n",
" onehot.insert(0, 'Neighborhood', df_venues['Neighborhood'].values) \n",
"\n",
" # creating dataset of onehot categories means\n",
" df_grouped = onehot.groupby('Neighborhood').mean().reset_index()\n",
"\n",
" # droping neighborhood with less than 8 venues to avoid outliers\n",
" min_venues_mask = df_venues.groupby('Neighborhood').count()['Venue']>=8 \n",
" df_grouped = df_grouped[min_venues_mask.values]\n",
"\n",
" df_merged = df\n",
"\n",
" # merge df with bh_food_grouped to add latitude/longitude for each neighborhood\n",
" df_merged = df_merged.join(df_grouped.set_index('Neighborhood'), on='Neighborhood')\n",
"\n",
" df_merged.dropna(inplace=True) \n",
" return df_merged.drop(['Latitude', 'Longitude', 'Neighborhood'], 1)\n",
"\n",
"\n",
"def build_evaluate(X, y):\n",
" X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)\n",
"\n",
" LR = LogisticRegression(C=0.1, solver='liblinear', multi_class='auto').fit(X_train, y_train)\n",
" yhat_lr = LR.predict(X_test)\n",
" yhat_lr_prob = LR.predict_proba(X_test)\n",
"\n",
" print(' Logistic regression accuracy score', metrics.accuracy_score(y_test, yhat_lr))\n",
" print(' Logistic regression log loss', metrics.log_loss(y_test, yhat_lr_prob))\n",
"\n",
" Ks = 10\n",
" mean_acc = np.zeros((Ks-1))\n",
" std_acc = np.zeros((Ks-1))\n",
" ConfustionMx = [];\n",
" for n in range(1,Ks):\n",
"\n",
" #Train Model and Predict \n",
" neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)\n",
" yhat=neigh.predict(X_test)\n",
" mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)\n",
"\n",
"\n",
" std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])\n",
"\n",
" print(\" KNN accuracy score\", mean_acc.max(), \" with k=\", mean_acc.argmax()+1)\n",
" \n",
" clf = svm.SVC(kernel='rbf', gamma='auto')\n",
" clf.fit(X_train, y_train) \n",
"\n",
" yhat_svm = clf.predict(X_test)\n",
" print(\" SVM accuracy score\", metrics.accuracy_score(y_test, yhat_svm)) "
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Class</th>\n",
" <th>Acai House</th>\n",
" <th>American Restaurant</th>\n",
" <th>Arepa Restaurant</th>\n",
" <th>Argentinian Restaurant</th>\n",
" <th>Asian Restaurant</th>\n",
" <th>Australian Restaurant</th>\n",
" <th>BBQ Joint</th>\n",
" <th>Bagel Shop</th>\n",
" <th>Baiano Restaurant</th>\n",
" <th>...</th>\n",
" <th>Soup Place</th>\n",
" <th>South American Restaurant</th>\n",
" <th>Spanish Restaurant</th>\n",
" <th>Steakhouse</th>\n",
" <th>Sushi Restaurant</th>\n",
" <th>Syrian Restaurant</th>\n",
" <th>Taco Place</th>\n",
" <th>Tapas Restaurant</th>\n",
" <th>Vegetarian / Vegan Restaurant</th>\n",
" <th>Wings Joint</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>low</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.090909</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>low</td>\n",
" <td>0.041667</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.041667</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>low</td>\n",
" <td>0.050000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.100000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>low</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.052632</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.157895</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>low</td>\n",
" <td>0.055556</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.055556</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>240</th>\n",
" <td>luxury</td>\n",
" <td>0.037037</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.074074</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.074074</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>241</th>\n",
" <td>luxury</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.011628</td>\n",
" <td>0.0</td>\n",
" <td>0.011628</td>\n",
" <td>0.0</td>\n",
" <td>0.011628</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.023256</td>\n",
" <td>0.011628</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.058140</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>242</th>\n",
" <td>luxury</td>\n",
" <td>0.058824</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.058824</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>243</th>\n",
" <td>luxury</td>\n",
" <td>0.022727</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.022727</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.022727</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.022727</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.045455</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>244</th>\n",
" <td>luxury</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.086957</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.043478</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>130 rows × 72 columns</p>\n",
"</div>"
],
"text/plain": [
" Class Acai House American Restaurant Arepa Restaurant \\\n",
"0 low 0.000000 0.0 0.0 \n",
"4 low 0.041667 0.0 0.0 \n",
"5 low 0.050000 0.0 0.0 \n",
"6 low 0.000000 0.0 0.0 \n",
"11 low 0.055556 0.0 0.0 \n",
".. ... ... ... ... \n",
"240 luxury 0.037037 0.0 0.0 \n",
"241 luxury 0.000000 0.0 0.0 \n",
"242 luxury 0.058824 0.0 0.0 \n",
"243 luxury 0.022727 0.0 0.0 \n",
"244 luxury 0.000000 0.0 0.0 \n",
"\n",
" Argentinian Restaurant Asian Restaurant Australian Restaurant \\\n",
"0 0.000000 0.000000 0.0 \n",
"4 0.000000 0.000000 0.0 \n",
"5 0.000000 0.000000 0.0 \n",
"6 0.000000 0.000000 0.0 \n",
"11 0.000000 0.000000 0.0 \n",
".. ... ... ... \n",
"240 0.000000 0.000000 0.0 \n",
"241 0.000000 0.011628 0.0 \n",
"242 0.000000 0.000000 0.0 \n",
"243 0.022727 0.000000 0.0 \n",
"244 0.000000 0.000000 0.0 \n",
"\n",
" BBQ Joint Bagel Shop Baiano Restaurant ... Soup Place \\\n",
"0 0.090909 0.0 0.000000 ... 0.0 \n",
"4 0.000000 0.0 0.000000 ... 0.0 \n",
"5 0.100000 0.0 0.000000 ... 0.0 \n",
"6 0.052632 0.0 0.000000 ... 0.0 \n",
"11 0.000000 0.0 0.000000 ... 0.0 \n",
".. ... ... ... ... ... \n",
"240 0.074074 0.0 0.000000 ... 0.0 \n",
"241 0.011628 0.0 0.011628 ... 0.0 \n",
"242 0.000000 0.0 0.000000 ... 0.0 \n",
"243 0.000000 0.0 0.022727 ... 0.0 \n",
"244 0.086957 0.0 0.000000 ... 0.0 \n",
"\n",
" South American Restaurant Spanish Restaurant Steakhouse \\\n",
"0 0.0 0.0 0.000000 \n",
"4 0.0 0.0 0.041667 \n",
"5 0.0 0.0 0.000000 \n",
"6 0.0 0.0 0.157895 \n",
"11 0.0 0.0 0.055556 \n",
".. ... ... ... \n",
"240 0.0 0.0 0.000000 \n",
"241 0.0 0.0 0.023256 \n",
"242 0.0 0.0 0.058824 \n",
"243 0.0 0.0 0.000000 \n",
"244 0.0 0.0 0.000000 \n",
"\n",
" Sushi Restaurant Syrian Restaurant Taco Place Tapas Restaurant \\\n",
"0 0.000000 0.0 0.0 0.0 \n",
"4 0.000000 0.0 0.0 0.0 \n",
"5 0.000000 0.0 0.0 0.0 \n",
"6 0.000000 0.0 0.0 0.0 \n",
"11 0.000000 0.0 0.0 0.0 \n",
".. ... ... ... ... \n",
"240 0.000000 0.0 0.0 0.0 \n",
"241 0.011628 0.0 0.0 0.0 \n",
"242 0.000000 0.0 0.0 0.0 \n",
"243 0.022727 0.0 0.0 0.0 \n",
"244 0.043478 0.0 0.0 0.0 \n",
"\n",
" Vegetarian / Vegan Restaurant Wings Joint \n",
"0 0.000000 0.0 \n",
"4 0.000000 0.0 \n",
"5 0.000000 0.0 \n",
"6 0.000000 0.0 \n",
"11 0.000000 0.0 \n",
".. ... ... \n",
"240 0.074074 0.0 \n",
"241 0.058140 0.0 \n",
"242 0.000000 0.0 \n",
"243 0.045455 0.0 \n",
"244 0.000000 0.0 \n",
"\n",
"[130 rows x 72 columns]"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bh_food_venues = pd.read_csv('bh_food_venues.csv')\n",
"bh_pro_venues = pd.read_csv('bh_pro_venues.csv')\n",
"prepare(bh_all_venues)"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"### all venues category dataset ###\n",
" ### Actual class ###\n",
" Logistic regression accuracy score 0.6923076923076923\n",
" Logistic regression log loss 1.1909560870176388\n",
" KNN accuracy score 0.5769230769230769 with k= 8\n",
" SVM accuracy score 0.6923076923076923\n",
" ### if y == 'low' ###\n",
" Logistic regression accuracy score 0.9230769230769231\n",
" Logistic regression log loss 0.40203381145443345\n",
" KNN accuracy score 0.9230769230769231 with k= 6\n",
" SVM accuracy score 0.9230769230769231\n",
" ### if y == 'regular' ###\n",
" Logistic regression accuracy score 0.3076923076923077\n",
" Logistic regression log loss 0.7406334499138543\n",
" KNN accuracy score 0.6538461538461539 with k= 1\n",
" SVM accuracy score 0.3076923076923077\n",
" ### if y == 'high' ###\n",
" Logistic regression accuracy score 0.9615384615384616\n",
" Logistic regression log loss 0.34060287134407913\n",
" KNN accuracy score 1.0 with k= 1\n",
" SVM accuracy score 0.9615384615384616\n",
" ### if y == 'luxury' ###\n",
" Logistic regression accuracy score 0.8076923076923077\n",
" Logistic regression log loss 0.5148135568320493\n",
" KNN accuracy score 0.8076923076923077 with k= 2\n",
" SVM accuracy score 0.8076923076923077\n",
"### food venues category dataset ###\n",
" ### Actual class ###\n",
" Logistic regression accuracy score 0.6923076923076923\n",
" Logistic regression log loss 1.1909560870176388\n",
" KNN accuracy score 0.5769230769230769 with k= 8\n",
" SVM accuracy score 0.6923076923076923\n",
" ### if y == 'low' ###\n",
" Logistic regression accuracy score 0.9230769230769231\n",
" Logistic regression log loss 0.40203381145443345\n",
" KNN accuracy score 0.9230769230769231 with k= 6\n",
" SVM accuracy score 0.9230769230769231\n",
" ### if y == 'regular' ###\n",
" Logistic regression accuracy score 0.3076923076923077\n",
" Logistic regression log loss 0.7406334499138543\n",
" KNN accuracy score 0.6538461538461539 with k= 1\n",
" SVM accuracy score 0.3076923076923077\n",
" ### if y == 'high' ###\n",
" Logistic regression accuracy score 0.9615384615384616\n",
" Logistic regression log loss 0.34060287134407913\n",
" KNN accuracy score 1.0 with k= 1\n",
" SVM accuracy score 0.9615384615384616\n",
" ### if y == 'luxury' ###\n",
" Logistic regression accuracy score 0.8076923076923077\n",
" Logistic regression log loss 0.5148135568320493\n",
" KNN accuracy score 0.8076923076923077 with k= 2\n",
" SVM accuracy score 0.8076923076923077\n",
"### professional venues category dataset ###\n",
" ### Actual class ###\n",
" Logistic regression accuracy score 0.5714285714285714\n",
" Logistic regression log loss 1.2284283597098578\n",
" KNN accuracy score 0.5357142857142857 with k= 2\n",
" SVM accuracy score 0.5714285714285714\n",
" ### if y == 'low' ###\n",
" Logistic regression accuracy score 0.8214285714285714\n",
" Logistic regression log loss 0.488417167619632\n",
" KNN accuracy score 0.8214285714285714 with k= 4\n",
" SVM accuracy score 0.8214285714285714\n",
" ### if y == 'regular' ###\n",
" Logistic regression accuracy score 0.42857142857142855\n",
" Logistic regression log loss 0.6960464112425366\n",
" KNN accuracy score 0.6785714285714286 with k= 5\n",
" SVM accuracy score 0.42857142857142855\n",
" ### if y == 'high' ###\n",
" Logistic regression accuracy score 0.8928571428571429\n",
" Logistic regression log loss 0.3983842504318268\n",
" KNN accuracy score 0.9285714285714286 with k= 2\n",
" SVM accuracy score 0.8928571428571429\n",
" ### if y == 'luxury' ###\n",
" Logistic regression accuracy score 0.8571428571428571\n",
" Logistic regression log loss 0.4717453131865441\n",
" KNN accuracy score 0.8571428571428571 with k= 2\n",
" SVM accuracy score 0.8571428571428571\n"
]
}
],
"source": [
"names = ['all', 'food', 'professional']\n",
"datasets = [bh_all_venues, bh_food_venues, bh_pro_venues]\n",
"\n",
"for name, df_venues in zip(names, datasets):\n",
" \n",
" print('### {} venues category dataset ###'.format(name))\n",
" \n",
" df_merged = prepare(df_venues)\n",
"\n",
" y = le.transform(df_merged['Class'].values)\n",
" y0 = y == 0\n",
" y1 = y == 1\n",
" y2 = y == 2\n",
" y3 = y == 3\n",
"\n",
" X = df_merged.drop('Class', 1)\n",
"\n",
" print(' ### Actual class ###')\n",
" build_evaluate(X, y)\n",
"\n",
" print(' ### if y == \\'low\\' ###')\n",
" build_evaluate(X, y0)\n",
"\n",
" print(' ### if y == \\'regular\\' ###')\n",
" build_evaluate(X, y1)\n",
"\n",
" print(' ### if y == \\'high\\' ###')\n",
" build_evaluate(X, y2)\n",
"\n",
" print(' ### if y == \\'luxury\\' ###')\n",
" build_evaluate(X, y3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As mentioned before, we built KNN, SVM and Logistic regression models, but we will show only the results obtained by Logistic regression, because it gets the better scores (jaccard index score) in most cases."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.1 All venue categories dataset scores"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* the actual class: 0.6666\n",
"* if the class == 'luxury': **0.9**\n",
"* if the class == 'high': 0.8666\n",
"* if the class == 'regular': 0.3333\n",
"* if the class == 'low': **0.9**\n",
"* if the class == 'high' or 'luxury': 0.7666"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.2 Food venue categories dataset scores"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* the actual class: 0.6923\n",
"* if the class == 'luxury': 0.8\n",
"* if the class == 'high': **0.96**\n",
"* if the class == 'regular': 0.3076\n",
"* if the class == 'low': **0.923**\n",
"* if the class == 'high' or 'luxury': 0.7692"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.3 Professional venue categories dataset scores"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* the actual class: 0.5714\n",
"* if the class == 'luxury': 0.8571\n",
"* if the class == 'high': 0.8928\n",
"* if the class == 'regular': 0.4286\n",
"* if the class == 'low': 0.8214\n",
"* if the class == 'high' or 'luxury': 0.75"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Discussion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's interesting to see that the food venue categories dataset got the best overall results but closely followed by the all categories which indicates that may be possible to combine two or more categories to get optimistic scores, as it's clear that there are categories that disturbs the score (see the professional venue categories dataset). \n",
"\n",
"Besides that, it's also interesting to see that the model built with foods venues categories dataset can predict the 'high' and 'low' class remarkably well, and definately could be used in a different city, similar to 'Belo Horizonte'. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Even though we couldn't get a great model to predict the actual class of a Neighborhood, we could get interesting results on predicting the 'high' and 'low' classes using the Food categories dataset, and predicting 'luxury' and 'low' casses using the All categories dataset."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python",
"language": "python",
"name": "conda-env-python-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment