Skip to content

Instantly share code, notes, and snippets.

@firmai
Last active January 29, 2024 19:17
Show Gist options
  • Save firmai/0a20f90e9e6a8c13c048b9b163cbed8c to your computer and use it in GitHub Desktop.
Save firmai/0a20f90e9e6a8c13c048b9b163cbed8c to your computer and use it in GitHub Desktop.
AirBnB Valuation.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/firmai/0a20f90e9e6a8c13c048b9b163cbed8c/airbnb-valuation.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Aconh-UHZXEI"
},
"source": [
"**Airbnb Rental Valuation**\n",
"\n",
"Welcome to Airbnb Analysis Corp.! Your task is to set the competitive **daily accomodation rate** for a client's house in Bondi Beach. The owner currently charges $500. We have been tasked to estimate a **fair value** that the owner should be charging. The house has the following characteristics and constraints. While developing this model you came to realise that Airbnb can use your model to estimate the fair value of any property on their database, your are effectively creating a recommendation model for all prospective hosts!\n",
"\n",
"\n",
"1. The owner has been a host since **August 2010**\n",
"1. The location is **lon:151.274506, lat:33.889087**\n",
"1. The current review score rating **95.0**\n",
"1. Number of reviews **53**\n",
"1. Minimum nights **4**\n",
"1. The house can accommodate **10** people.\n",
"1. The owner currently charges a cleaning fee of **370**\n",
"1. The house has **3 bathrooms, 5 bedrooms, 7 beds**.\n",
"1. The house is available for **255 of the next 365 days**\n",
"1. The client is **verified**, and they are a **superhost**.\n",
"1. The cancellation policy is **strict with a 14 days grace period**.\n",
"1. The host requires a security deposit of **$1,500**\n",
"\n",
"\n",
"*All values strictly apply to the month of July 2018*"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "lTnEOeuYZXEK"
},
"outputs": [],
"source": [
"from dateutil import parser\n",
"dict_client = {}\n",
"\n",
"dict_client[\"city\"] = \"Bondi Beach\"\n",
"dict_client[\"longitude\"] = 151.274506\n",
"dict_client[\"latitude\"] = -33.889087\n",
"dict_client[\"review_scores_rating\"] = 95\n",
"dict_client[\"number_of_reviews\"] = 53\n",
"dict_client[\"minimum_nights\"] = 4\n",
"dict_client[\"accommodates\"] = 10\n",
"dict_client[\"bathrooms\"] = 3\n",
"dict_client[\"bedrooms\"] = 5\n",
"dict_client[\"beds\"] = 7\n",
"dict_client[\"security_deposit\"] = 1500\n",
"dict_client[\"cleaning_fee\"] = 370\n",
"dict_client[\"property_type\"] = \"House\"\n",
"dict_client[\"room_type\"] = \"Entire home/apt\"\n",
"dict_client[\"availability_365\"] = 255\n",
"dict_client[\"host_identity_verified\"] = 1 ## 1 for yes, 0 for no\n",
"dict_client[\"host_is_superhost\"] = 1\n",
"dict_client[\"cancellation_policy\"] = \"strict_14_with_grace_period\"\n",
"dict_client[\"host_since\"] = parser.parse(\"01-08-2010\")\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XLqEwDW3ZXEN"
},
"source": [
"# Setup"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "g-5V7ujhZXEO"
},
"source": [
"First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "LsbHVGAqZXEP"
},
"outputs": [],
"source": [
"# To support both python 2 and python 3\n",
"from __future__ import division, print_function, unicode_literals\n",
"# Common imports\n",
"import numpy as np\n",
"import os\n",
"import pandas as pd\n",
"\n",
"# to make this notebook's output stable across runs\n",
"np.random.seed(42)\n",
"\n",
"# To plot pretty figures\n",
"%matplotlib inline\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"plt.rcParams['axes.labelsize'] = 14\n",
"plt.rcParams['xtick.labelsize'] = 12\n",
"plt.rcParams['ytick.labelsize'] = 12\n",
"\n",
"# Where to save the figures\n",
"PROJECT_ROOT_DIR = \".\"\n",
"CHAPTER_ID = \"end_to_end_project\"\n",
"IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, \"images\", CHAPTER_ID)\n",
"\n",
"def save_fig(fig_id, tight_layout=True, fig_extension=\"png\", resolution=300):\n",
" path = os.path.join(IMAGES_PATH, fig_id + \".\" + fig_extension)\n",
" print(\"Saving figure\", fig_id)\n",
" if tight_layout:\n",
" plt.tight_layout()\n",
" try:\n",
" plt.savefig(path, format=fig_extension, dpi=resolution)\n",
" except:\n",
" plt.savefig(fig_id + \".\" + fig_extension, format=fig_extension, dpi=resolution)\n",
"\n",
"# Ignore useless warnings (see SciPy issue #5998)\n",
"import warnings\n",
"warnings.filterwarnings(action=\"ignore\", message=\"^internal gelsd\")\n",
"pd.options.display.max_columns = None"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "o6O0qpwJZXEQ"
},
"source": [
"# Get the data"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "ACVfMcS3ZXEQ",
"outputId": "c0ee9078-9f82-4f15-8ca3-ce6bda4a32c3",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Be patient: loading from database (2 minutes)\n"
]
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"<ipython-input-3-1a096963100c>:15: DtypeWarning: Columns (36,54,55) have mixed types. Specify dtype option on import or set low_memory=False.\n",
" df = pd.read_csv(github_p+'sydney_airbnb.csv')\n"
]
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"Done\n"
]
}
],
"source": [
"import pandas as pd\n",
"## This is simply a bit of importing logic that you don't have ..\n",
"## .. to concern yourself with for now.\n",
"\n",
"from pathlib import Path\n",
"\n",
"github_p = \"https://storage.googleapis.com/public-quant/course//content/\"\n",
"\n",
"my_file = Path(\"sydney_airbnb.csv\") # Defines path\n",
"if my_file.is_file(): # See if file exists\n",
" print(\"Local file found\")\n",
" df = pd.read_csv('sydney_airbnb.csv')\n",
"else:\n",
" print(\"Be patient: loading from database (2 minutes)\")\n",
" df = pd.read_csv(github_p+'sydney_airbnb.csv')\n",
" print(\"Done\")"
]
},
{
"cell_type": "code",
"source": [
"df.head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 764
},
"id": "UhaVCzir9pVA",
"outputId": "7ad7b145-e2bf-41e2-94ea-c2d19ba5dfae"
},
"execution_count": 4,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" id listing_url \\\n",
"0 11156 https://www.airbnb.com/rooms/11156 \n",
"1 12351 https://www.airbnb.com/rooms/12351 \n",
"2 14250 https://www.airbnb.com/rooms/14250 \n",
"3 14935 https://www.airbnb.com/rooms/14935 \n",
"4 14974 https://www.airbnb.com/rooms/14974 \n",
"\n",
" name \\\n",
"0 An Oasis in the City \n",
"1 Sydney City & Harbour at the door \n",
"2 Manly Harbour House \n",
"3 Eco-conscious Travellers: Private Room \n",
"4 Eco-conscious Traveller: Sofa Couch \n",
"\n",
" summary \\\n",
"0 Very central to the city which can be reached ... \n",
"1 Come stay with Vinh & Stuart (Awarded as one o... \n",
"2 Beautifully renovated, spacious and quiet, our... \n",
"3 Welcome! This apartment will suit a short term... \n",
"4 Welcome! This apartment will suit a short term... \n",
"\n",
" space \\\n",
"0 Potts Pt. is a vibrant and popular inner-city... \n",
"1 We're pretty relaxed hosts, and we fully appre... \n",
"2 Our home is a thirty minute walk along the sea... \n",
"3 I live upstairs in my own room with my own bat... \n",
"4 Comes with a fully equipped gym and pool - whi... \n",
"\n",
" description \\\n",
"0 Very central to the city which can be reached ... \n",
"1 Come stay with Vinh & Stuart (Awarded as one o... \n",
"2 Beautifully renovated, spacious and quiet, our... \n",
"3 Welcome! This apartment will suit a short term... \n",
"4 Welcome! This apartment will suit a short term... \n",
"\n",
" neighborhood_overview \\\n",
"0 It is very close to everything and everywhere,... \n",
"1 Pyrmont is an inner-city village of Sydney, on... \n",
"2 Balgowlah Heights is one of the most prestigio... \n",
"3 NaN \n",
"4 NaN \n",
"\n",
" notes \\\n",
"0 $150.00 key security deposit, refundable on re... \n",
"1 We've a few reasons for the 6.00pm arrival tim... \n",
"2 NaN \n",
"3 The building can be hard to find, so please en... \n",
"4 I live upstairs in my own room with my own bat... \n",
"\n",
" transit \\\n",
"0 It is 7 minutes walk to the Kings Cross.train ... \n",
"1 Our home is centrally located and an easy walk... \n",
"2 Balgowlah - Manly bus # 131 or #132 (Bus stop... \n",
"3 DIRECTIONS VIA TAXI: Get dropped off at Renwic... \n",
"4 DIRECTIONS VIA TAXI: Get dropped off at Renwic... \n",
"\n",
" access \\\n",
"0 Kitchen & laundry facilities. Shared bathroom. \n",
"1 We look forward to welcoming you just as we wo... \n",
"2 Guests have access to whole house except locke... \n",
"3 I work from home most times - so if I'm home, ... \n",
"4 I work from home most times - so if I'm home, ... \n",
"\n",
" interaction \\\n",
"0 As much as they want. \n",
"1 As much or as little as you like. We live here... \n",
"2 NaN \n",
"3 I'm not a big chatter, so don't get offended i... \n",
"4 I'm not a big chatter, so don't get offended i... \n",
"\n",
" house_rules \\\n",
"0 Be considerate. No showering after 2330h. \n",
"1 We look forward to welcoming you to stay you j... \n",
"2 Standard Terms and Conditions of Temporary Hol... \n",
"3 1. Enjoy and always bring a smile during your ... \n",
"4 1. Enjoy and always bring a smile during your ... \n",
"\n",
" picture_url host_id \\\n",
"0 https://a0.muscache.com/im/pictures/2797669/17... 40855 \n",
"1 https://a0.muscache.com/im/pictures/763ad5c8-c... 17061 \n",
"2 https://a0.muscache.com/im/pictures/56935671/f... 55948 \n",
"3 https://a0.muscache.com/im/pictures/2257353/d3... 58796 \n",
"4 https://a0.muscache.com/im/pictures/2197966/6e... 58796 \n",
"\n",
" host_url host_name host_since \\\n",
"0 https://www.airbnb.com/users/show/40855 Colleen 23/09/09 \n",
"1 https://www.airbnb.com/users/show/17061 Stuart 14/05/09 \n",
"2 https://www.airbnb.com/users/show/55948 Heidi 20/11/09 \n",
"3 https://www.airbnb.com/users/show/58796 Kevin 30/11/09 \n",
"4 https://www.airbnb.com/users/show/58796 Kevin 30/11/09 \n",
"\n",
" host_location \\\n",
"0 Potts Point, New South Wales, Australia \n",
"1 Sydney, New South Wales, Australia \n",
"2 Sydney, New South Wales, Australia \n",
"3 Sydney, New South Wales, Australia \n",
"4 Sydney, New South Wales, Australia \n",
"\n",
" host_about host_response_time \\\n",
"0 Recently retired, I've lived & worked on 4 con... within a day \n",
"1 G'Day from Australia!\\r\\n\\r\\nHe's Vinh, and I'... within an hour \n",
"2 I am a Canadian who has made Australia her hom... within a few hours \n",
"3 I've moved countries twice in the span of 10 y... within an hour \n",
"4 I've moved countries twice in the span of 10 y... within an hour \n",
"\n",
" host_response_rate host_is_superhost \\\n",
"0 67% t \n",
"1 100% f \n",
"2 100% f \n",
"3 100% f \n",
"4 100% f \n",
"\n",
" host_thumbnail_url \\\n",
"0 https://a0.muscache.com/im/users/40855/profile... \n",
"1 https://a0.muscache.com/im/users/17061/profile... \n",
"2 https://a0.muscache.com/im/users/55948/profile... \n",
"3 https://a0.muscache.com/im/users/58796/profile... \n",
"4 https://a0.muscache.com/im/users/58796/profile... \n",
"\n",
" host_picture_url host_neighbourhood \\\n",
"0 https://a0.muscache.com/im/users/40855/profile... Potts Point \n",
"1 https://a0.muscache.com/im/users/17061/profile... Pyrmont \n",
"2 https://a0.muscache.com/im/users/55948/profile... Balgowlah \n",
"3 https://a0.muscache.com/im/users/58796/profile... Redfern \n",
"4 https://a0.muscache.com/im/users/58796/profile... Redfern \n",
"\n",
" host_listings_count host_total_listings_count \\\n",
"0 1.0 1.0 \n",
"1 2.0 2.0 \n",
"2 2.0 2.0 \n",
"3 2.0 2.0 \n",
"4 2.0 2.0 \n",
"\n",
" host_verifications host_has_profile_pic \\\n",
"0 ['email', 'phone', 'reviews'] t \n",
"1 ['email', 'phone', 'manual_online', 'reviews',... t \n",
"2 ['email', 'phone', 'reviews', 'jumio'] t \n",
"3 ['email', 'phone', 'facebook', 'reviews', 'jum... t \n",
"4 ['email', 'phone', 'facebook', 'reviews', 'jum... t \n",
"\n",
" host_identity_verified street neighbourhood \\\n",
"0 f Potts Point, NSW, Australia Potts Point \n",
"1 t Pyrmont, NSW, Australia Pyrmont \n",
"2 t Balgowlah, NSW, Australia Balgowlah \n",
"3 t Redfern, NSW, Australia Redfern \n",
"4 t Redfern, NSW, Australia Redfern \n",
"\n",
" neighbourhood_cleansed neighbourhood_group_cleansed city state \\\n",
"0 Sydney NaN Potts Point NSW \n",
"1 Sydney NaN Pyrmont NSW \n",
"2 Manly NaN Balgowlah NSW \n",
"3 Sydney NaN Redfern NSW \n",
"4 Sydney NaN Redfern NSW \n",
"\n",
" zipcode market smart_location country_code country latitude \\\n",
"0 2011 Sydney Potts Point, Australia AU Australia -33.869168 \n",
"1 2009 Sydney Pyrmont, Australia AU Australia -33.865153 \n",
"2 2093 Sydney Balgowlah, Australia AU Australia -33.800929 \n",
"3 2016 Sydney Redfern, Australia AU Australia -33.890765 \n",
"4 2016 Sydney Redfern, Australia AU Australia -33.889667 \n",
"\n",
" longitude is_location_exact property_type room_type accommodates \\\n",
"0 151.226562 t Apartment Private room 1 \n",
"1 151.191896 t Townhouse Private room 2 \n",
"2 151.261722 t House Entire home/apt 6 \n",
"3 151.200450 t Apartment Private room 2 \n",
"4 151.200896 t Apartment Shared room 1 \n",
"\n",
" bathrooms bedrooms beds bed_type \\\n",
"0 NaN 1.0 1.0 Real Bed \n",
"1 1.0 1.0 1.0 Real Bed \n",
"2 3.0 3.0 3.0 Real Bed \n",
"3 1.0 1.0 1.0 Real Bed \n",
"4 2.0 1.0 1.0 Pull-out Sofa \n",
"\n",
" amenities square_feet price \\\n",
"0 {TV,Kitchen,Elevator,\"Buzzer/wireless intercom... NaN $65.00 \n",
"1 {TV,Internet,Wifi,\"Air conditioning\",\"Paid par... NaN $98.00 \n",
"2 {TV,Wifi,\"Air conditioning\",Kitchen,\"Pets live... NaN $469.00 \n",
"3 {Internet,Wifi,\"Wheelchair accessible\",Pool,Ki... NaN $63.00 \n",
"4 {Internet,Wifi,Pool,Kitchen,Gym,Elevator,\"Buzz... 0.0 $39.00 \n",
"\n",
" weekly_price monthly_price security_deposit cleaning_fee guests_included \\\n",
"0 NaN NaN NaN NaN 1 \n",
"1 $800.00 NaN $0.00 $55.00 2 \n",
"2 $3,000.00 NaN $900.00 $100.00 6 \n",
"3 NaN NaN NaN NaN 1 \n",
"4 NaN NaN NaN NaN 1 \n",
"\n",
" extra_people minimum_nights maximum_nights calendar_updated \\\n",
"0 $0.00 2 180 4 weeks ago \n",
"1 $395.00 2 7 yesterday \n",
"2 $40.00 5 22 4 months ago \n",
"3 $40.00 2 1125 today \n",
"4 $0.00 2 1125 4 days ago \n",
"\n",
" has_availability availability_30 availability_60 availability_90 \\\n",
"0 t 9 39 69 \n",
"1 t 13 30 45 \n",
"2 t 0 0 0 \n",
"3 t 13 31 31 \n",
"4 t 24 50 50 \n",
"\n",
" availability_365 number_of_reviews first_review last_review \\\n",
"0 339 177 5/12/09 1/07/18 \n",
"1 188 468 24/07/10 27/06/18 \n",
"2 168 1 2/01/16 2/01/16 \n",
"3 215 172 28/11/11 26/06/18 \n",
"4 287 147 23/09/11 2/07/18 \n",
"\n",
" review_scores_rating review_scores_accuracy review_scores_cleanliness \\\n",
"0 92.0 9.0 9.0 \n",
"1 95.0 10.0 9.0 \n",
"2 100.0 10.0 10.0 \n",
"3 89.0 9.0 8.0 \n",
"4 90.0 9.0 8.0 \n",
"\n",
" review_scores_checkin review_scores_communication review_scores_location \\\n",
"0 10.0 10.0 10.0 \n",
"1 10.0 10.0 10.0 \n",
"2 10.0 8.0 10.0 \n",
"3 9.0 10.0 9.0 \n",
"4 9.0 9.0 9.0 \n",
"\n",
" review_scores_value instant_bookable cancellation_policy \\\n",
"0 9.0 f moderate \n",
"1 10.0 f strict_14_with_grace_period \n",
"2 10.0 f strict_14_with_grace_period \n",
"3 9.0 f moderate \n",
"4 9.0 f moderate \n",
"\n",
" require_guest_profile_picture require_guest_phone_verification \\\n",
"0 f f \n",
"1 t t \n",
"2 f f \n",
"3 f f \n",
"4 f f \n",
"\n",
" calculated_host_listings_count reviews_per_month \n",
"0 1 1.69 \n",
"1 2 4.83 \n",
"2 2 0.03 \n",
"3 2 2.14 \n",
"4 2 1.78 "
],
"text/html": [
"\n",
" <div id=\"df-41fb46e1-939e-4eb2-a870-e37faa4e627a\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>listing_url</th>\n",
" <th>name</th>\n",
" <th>summary</th>\n",
" <th>space</th>\n",
" <th>description</th>\n",
" <th>neighborhood_overview</th>\n",
" <th>notes</th>\n",
" <th>transit</th>\n",
" <th>access</th>\n",
" <th>interaction</th>\n",
" <th>house_rules</th>\n",
" <th>picture_url</th>\n",
" <th>host_id</th>\n",
" <th>host_url</th>\n",
" <th>host_name</th>\n",
" <th>host_since</th>\n",
" <th>host_location</th>\n",
" <th>host_about</th>\n",
" <th>host_response_time</th>\n",
" <th>host_response_rate</th>\n",
" <th>host_is_superhost</th>\n",
" <th>host_thumbnail_url</th>\n",
" <th>host_picture_url</th>\n",
" <th>host_neighbourhood</th>\n",
" <th>host_listings_count</th>\n",
" <th>host_total_listings_count</th>\n",
" <th>host_verifications</th>\n",
" <th>host_has_profile_pic</th>\n",
" <th>host_identity_verified</th>\n",
" <th>street</th>\n",
" <th>neighbourhood</th>\n",
" <th>neighbourhood_cleansed</th>\n",
" <th>neighbourhood_group_cleansed</th>\n",
" <th>city</th>\n",
" <th>state</th>\n",
" <th>zipcode</th>\n",
" <th>market</th>\n",
" <th>smart_location</th>\n",
" <th>country_code</th>\n",
" <th>country</th>\n",
" <th>latitude</th>\n",
" <th>longitude</th>\n",
" <th>is_location_exact</th>\n",
" <th>property_type</th>\n",
" <th>room_type</th>\n",
" <th>accommodates</th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>beds</th>\n",
" <th>bed_type</th>\n",
" <th>amenities</th>\n",
" <th>square_feet</th>\n",
" <th>price</th>\n",
" <th>weekly_price</th>\n",
" <th>monthly_price</th>\n",
" <th>security_deposit</th>\n",
" <th>cleaning_fee</th>\n",
" <th>guests_included</th>\n",
" <th>extra_people</th>\n",
" <th>minimum_nights</th>\n",
" <th>maximum_nights</th>\n",
" <th>calendar_updated</th>\n",
" <th>has_availability</th>\n",
" <th>availability_30</th>\n",
" <th>availability_60</th>\n",
" <th>availability_90</th>\n",
" <th>availability_365</th>\n",
" <th>number_of_reviews</th>\n",
" <th>first_review</th>\n",
" <th>last_review</th>\n",
" <th>review_scores_rating</th>\n",
" <th>review_scores_accuracy</th>\n",
" <th>review_scores_cleanliness</th>\n",
" <th>review_scores_checkin</th>\n",
" <th>review_scores_communication</th>\n",
" <th>review_scores_location</th>\n",
" <th>review_scores_value</th>\n",
" <th>instant_bookable</th>\n",
" <th>cancellation_policy</th>\n",
" <th>require_guest_profile_picture</th>\n",
" <th>require_guest_phone_verification</th>\n",
" <th>calculated_host_listings_count</th>\n",
" <th>reviews_per_month</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>11156</td>\n",
" <td>https://www.airbnb.com/rooms/11156</td>\n",
" <td>An Oasis in the City</td>\n",
" <td>Very central to the city which can be reached ...</td>\n",
" <td>Potts Pt. is a vibrant and popular inner-city...</td>\n",
" <td>Very central to the city which can be reached ...</td>\n",
" <td>It is very close to everything and everywhere,...</td>\n",
" <td>$150.00 key security deposit, refundable on re...</td>\n",
" <td>It is 7 minutes walk to the Kings Cross.train ...</td>\n",
" <td>Kitchen &amp; laundry facilities. Shared bathroom.</td>\n",
" <td>As much as they want.</td>\n",
" <td>Be considerate. No showering after 2330h.</td>\n",
" <td>https://a0.muscache.com/im/pictures/2797669/17...</td>\n",
" <td>40855</td>\n",
" <td>https://www.airbnb.com/users/show/40855</td>\n",
" <td>Colleen</td>\n",
" <td>23/09/09</td>\n",
" <td>Potts Point, New South Wales, Australia</td>\n",
" <td>Recently retired, I've lived &amp; worked on 4 con...</td>\n",
" <td>within a day</td>\n",
" <td>67%</td>\n",
" <td>t</td>\n",
" <td>https://a0.muscache.com/im/users/40855/profile...</td>\n",
" <td>https://a0.muscache.com/im/users/40855/profile...</td>\n",
" <td>Potts Point</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>['email', 'phone', 'reviews']</td>\n",
" <td>t</td>\n",
" <td>f</td>\n",
" <td>Potts Point, NSW, Australia</td>\n",
" <td>Potts Point</td>\n",
" <td>Sydney</td>\n",
" <td>NaN</td>\n",
" <td>Potts Point</td>\n",
" <td>NSW</td>\n",
" <td>2011</td>\n",
" <td>Sydney</td>\n",
" <td>Potts Point, Australia</td>\n",
" <td>AU</td>\n",
" <td>Australia</td>\n",
" <td>-33.869168</td>\n",
" <td>151.226562</td>\n",
" <td>t</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Real Bed</td>\n",
" <td>{TV,Kitchen,Elevator,\"Buzzer/wireless intercom...</td>\n",
" <td>NaN</td>\n",
" <td>$65.00</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>$0.00</td>\n",
" <td>2</td>\n",
" <td>180</td>\n",
" <td>4 weeks ago</td>\n",
" <td>t</td>\n",
" <td>9</td>\n",
" <td>39</td>\n",
" <td>69</td>\n",
" <td>339</td>\n",
" <td>177</td>\n",
" <td>5/12/09</td>\n",
" <td>1/07/18</td>\n",
" <td>92.0</td>\n",
" <td>9.0</td>\n",
" <td>9.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>9.0</td>\n",
" <td>f</td>\n",
" <td>moderate</td>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" <td>1</td>\n",
" <td>1.69</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>12351</td>\n",
" <td>https://www.airbnb.com/rooms/12351</td>\n",
" <td>Sydney City &amp; Harbour at the door</td>\n",
" <td>Come stay with Vinh &amp; Stuart (Awarded as one o...</td>\n",
" <td>We're pretty relaxed hosts, and we fully appre...</td>\n",
" <td>Come stay with Vinh &amp; Stuart (Awarded as one o...</td>\n",
" <td>Pyrmont is an inner-city village of Sydney, on...</td>\n",
" <td>We've a few reasons for the 6.00pm arrival tim...</td>\n",
" <td>Our home is centrally located and an easy walk...</td>\n",
" <td>We look forward to welcoming you just as we wo...</td>\n",
" <td>As much or as little as you like. We live here...</td>\n",
" <td>We look forward to welcoming you to stay you j...</td>\n",
" <td>https://a0.muscache.com/im/pictures/763ad5c8-c...</td>\n",
" <td>17061</td>\n",
" <td>https://www.airbnb.com/users/show/17061</td>\n",
" <td>Stuart</td>\n",
" <td>14/05/09</td>\n",
" <td>Sydney, New South Wales, Australia</td>\n",
" <td>G'Day from Australia!\\r\\n\\r\\nHe's Vinh, and I'...</td>\n",
" <td>within an hour</td>\n",
" <td>100%</td>\n",
" <td>f</td>\n",
" <td>https://a0.muscache.com/im/users/17061/profile...</td>\n",
" <td>https://a0.muscache.com/im/users/17061/profile...</td>\n",
" <td>Pyrmont</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>['email', 'phone', 'manual_online', 'reviews',...</td>\n",
" <td>t</td>\n",
" <td>t</td>\n",
" <td>Pyrmont, NSW, Australia</td>\n",
" <td>Pyrmont</td>\n",
" <td>Sydney</td>\n",
" <td>NaN</td>\n",
" <td>Pyrmont</td>\n",
" <td>NSW</td>\n",
" <td>2009</td>\n",
" <td>Sydney</td>\n",
" <td>Pyrmont, Australia</td>\n",
" <td>AU</td>\n",
" <td>Australia</td>\n",
" <td>-33.865153</td>\n",
" <td>151.191896</td>\n",
" <td>t</td>\n",
" <td>Townhouse</td>\n",
" <td>Private room</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Real Bed</td>\n",
" <td>{TV,Internet,Wifi,\"Air conditioning\",\"Paid par...</td>\n",
" <td>NaN</td>\n",
" <td>$98.00</td>\n",
" <td>$800.00</td>\n",
" <td>NaN</td>\n",
" <td>$0.00</td>\n",
" <td>$55.00</td>\n",
" <td>2</td>\n",
" <td>$395.00</td>\n",
" <td>2</td>\n",
" <td>7</td>\n",
" <td>yesterday</td>\n",
" <td>t</td>\n",
" <td>13</td>\n",
" <td>30</td>\n",
" <td>45</td>\n",
" <td>188</td>\n",
" <td>468</td>\n",
" <td>24/07/10</td>\n",
" <td>27/06/18</td>\n",
" <td>95.0</td>\n",
" <td>10.0</td>\n",
" <td>9.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>f</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>t</td>\n",
" <td>t</td>\n",
" <td>2</td>\n",
" <td>4.83</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>14250</td>\n",
" <td>https://www.airbnb.com/rooms/14250</td>\n",
" <td>Manly Harbour House</td>\n",
" <td>Beautifully renovated, spacious and quiet, our...</td>\n",
" <td>Our home is a thirty minute walk along the sea...</td>\n",
" <td>Beautifully renovated, spacious and quiet, our...</td>\n",
" <td>Balgowlah Heights is one of the most prestigio...</td>\n",
" <td>NaN</td>\n",
" <td>Balgowlah - Manly bus # 131 or #132 (Bus stop...</td>\n",
" <td>Guests have access to whole house except locke...</td>\n",
" <td>NaN</td>\n",
" <td>Standard Terms and Conditions of Temporary Hol...</td>\n",
" <td>https://a0.muscache.com/im/pictures/56935671/f...</td>\n",
" <td>55948</td>\n",
" <td>https://www.airbnb.com/users/show/55948</td>\n",
" <td>Heidi</td>\n",
" <td>20/11/09</td>\n",
" <td>Sydney, New South Wales, Australia</td>\n",
" <td>I am a Canadian who has made Australia her hom...</td>\n",
" <td>within a few hours</td>\n",
" <td>100%</td>\n",
" <td>f</td>\n",
" <td>https://a0.muscache.com/im/users/55948/profile...</td>\n",
" <td>https://a0.muscache.com/im/users/55948/profile...</td>\n",
" <td>Balgowlah</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>['email', 'phone', 'reviews', 'jumio']</td>\n",
" <td>t</td>\n",
" <td>t</td>\n",
" <td>Balgowlah, NSW, Australia</td>\n",
" <td>Balgowlah</td>\n",
" <td>Manly</td>\n",
" <td>NaN</td>\n",
" <td>Balgowlah</td>\n",
" <td>NSW</td>\n",
" <td>2093</td>\n",
" <td>Sydney</td>\n",
" <td>Balgowlah, Australia</td>\n",
" <td>AU</td>\n",
" <td>Australia</td>\n",
" <td>-33.800929</td>\n",
" <td>151.261722</td>\n",
" <td>t</td>\n",
" <td>House</td>\n",
" <td>Entire home/apt</td>\n",
" <td>6</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>Real Bed</td>\n",
" <td>{TV,Wifi,\"Air conditioning\",Kitchen,\"Pets live...</td>\n",
" <td>NaN</td>\n",
" <td>$469.00</td>\n",
" <td>$3,000.00</td>\n",
" <td>NaN</td>\n",
" <td>$900.00</td>\n",
" <td>$100.00</td>\n",
" <td>6</td>\n",
" <td>$40.00</td>\n",
" <td>5</td>\n",
" <td>22</td>\n",
" <td>4 months ago</td>\n",
" <td>t</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>168</td>\n",
" <td>1</td>\n",
" <td>2/01/16</td>\n",
" <td>2/01/16</td>\n",
" <td>100.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>8.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>f</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" <td>2</td>\n",
" <td>0.03</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>14935</td>\n",
" <td>https://www.airbnb.com/rooms/14935</td>\n",
" <td>Eco-conscious Travellers: Private Room</td>\n",
" <td>Welcome! This apartment will suit a short term...</td>\n",
" <td>I live upstairs in my own room with my own bat...</td>\n",
" <td>Welcome! This apartment will suit a short term...</td>\n",
" <td>NaN</td>\n",
" <td>The building can be hard to find, so please en...</td>\n",
" <td>DIRECTIONS VIA TAXI: Get dropped off at Renwic...</td>\n",
" <td>I work from home most times - so if I'm home, ...</td>\n",
" <td>I'm not a big chatter, so don't get offended i...</td>\n",
" <td>1. Enjoy and always bring a smile during your ...</td>\n",
" <td>https://a0.muscache.com/im/pictures/2257353/d3...</td>\n",
" <td>58796</td>\n",
" <td>https://www.airbnb.com/users/show/58796</td>\n",
" <td>Kevin</td>\n",
" <td>30/11/09</td>\n",
" <td>Sydney, New South Wales, Australia</td>\n",
" <td>I've moved countries twice in the span of 10 y...</td>\n",
" <td>within an hour</td>\n",
" <td>100%</td>\n",
" <td>f</td>\n",
" <td>https://a0.muscache.com/im/users/58796/profile...</td>\n",
" <td>https://a0.muscache.com/im/users/58796/profile...</td>\n",
" <td>Redfern</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>['email', 'phone', 'facebook', 'reviews', 'jum...</td>\n",
" <td>t</td>\n",
" <td>t</td>\n",
" <td>Redfern, NSW, Australia</td>\n",
" <td>Redfern</td>\n",
" <td>Sydney</td>\n",
" <td>NaN</td>\n",
" <td>Redfern</td>\n",
" <td>NSW</td>\n",
" <td>2016</td>\n",
" <td>Sydney</td>\n",
" <td>Redfern, Australia</td>\n",
" <td>AU</td>\n",
" <td>Australia</td>\n",
" <td>-33.890765</td>\n",
" <td>151.200450</td>\n",
" <td>t</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Real Bed</td>\n",
" <td>{Internet,Wifi,\"Wheelchair accessible\",Pool,Ki...</td>\n",
" <td>NaN</td>\n",
" <td>$63.00</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>$40.00</td>\n",
" <td>2</td>\n",
" <td>1125</td>\n",
" <td>today</td>\n",
" <td>t</td>\n",
" <td>13</td>\n",
" <td>31</td>\n",
" <td>31</td>\n",
" <td>215</td>\n",
" <td>172</td>\n",
" <td>28/11/11</td>\n",
" <td>26/06/18</td>\n",
" <td>89.0</td>\n",
" <td>9.0</td>\n",
" <td>8.0</td>\n",
" <td>9.0</td>\n",
" <td>10.0</td>\n",
" <td>9.0</td>\n",
" <td>9.0</td>\n",
" <td>f</td>\n",
" <td>moderate</td>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" <td>2</td>\n",
" <td>2.14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>14974</td>\n",
" <td>https://www.airbnb.com/rooms/14974</td>\n",
" <td>Eco-conscious Traveller: Sofa Couch</td>\n",
" <td>Welcome! This apartment will suit a short term...</td>\n",
" <td>Comes with a fully equipped gym and pool - whi...</td>\n",
" <td>Welcome! This apartment will suit a short term...</td>\n",
" <td>NaN</td>\n",
" <td>I live upstairs in my own room with my own bat...</td>\n",
" <td>DIRECTIONS VIA TAXI: Get dropped off at Renwic...</td>\n",
" <td>I work from home most times - so if I'm home, ...</td>\n",
" <td>I'm not a big chatter, so don't get offended i...</td>\n",
" <td>1. Enjoy and always bring a smile during your ...</td>\n",
" <td>https://a0.muscache.com/im/pictures/2197966/6e...</td>\n",
" <td>58796</td>\n",
" <td>https://www.airbnb.com/users/show/58796</td>\n",
" <td>Kevin</td>\n",
" <td>30/11/09</td>\n",
" <td>Sydney, New South Wales, Australia</td>\n",
" <td>I've moved countries twice in the span of 10 y...</td>\n",
" <td>within an hour</td>\n",
" <td>100%</td>\n",
" <td>f</td>\n",
" <td>https://a0.muscache.com/im/users/58796/profile...</td>\n",
" <td>https://a0.muscache.com/im/users/58796/profile...</td>\n",
" <td>Redfern</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>['email', 'phone', 'facebook', 'reviews', 'jum...</td>\n",
" <td>t</td>\n",
" <td>t</td>\n",
" <td>Redfern, NSW, Australia</td>\n",
" <td>Redfern</td>\n",
" <td>Sydney</td>\n",
" <td>NaN</td>\n",
" <td>Redfern</td>\n",
" <td>NSW</td>\n",
" <td>2016</td>\n",
" <td>Sydney</td>\n",
" <td>Redfern, Australia</td>\n",
" <td>AU</td>\n",
" <td>Australia</td>\n",
" <td>-33.889667</td>\n",
" <td>151.200896</td>\n",
" <td>t</td>\n",
" <td>Apartment</td>\n",
" <td>Shared room</td>\n",
" <td>1</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Pull-out Sofa</td>\n",
" <td>{Internet,Wifi,Pool,Kitchen,Gym,Elevator,\"Buzz...</td>\n",
" <td>0.0</td>\n",
" <td>$39.00</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>$0.00</td>\n",
" <td>2</td>\n",
" <td>1125</td>\n",
" <td>4 days ago</td>\n",
" <td>t</td>\n",
" <td>24</td>\n",
" <td>50</td>\n",
" <td>50</td>\n",
" <td>287</td>\n",
" <td>147</td>\n",
" <td>23/09/11</td>\n",
" <td>2/07/18</td>\n",
" <td>90.0</td>\n",
" <td>9.0</td>\n",
" <td>8.0</td>\n",
" <td>9.0</td>\n",
" <td>9.0</td>\n",
" <td>9.0</td>\n",
" <td>9.0</td>\n",
" <td>f</td>\n",
" <td>moderate</td>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" <td>2</td>\n",
" <td>1.78</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-41fb46e1-939e-4eb2-a870-e37faa4e627a')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-41fb46e1-939e-4eb2-a870-e37faa4e627a button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-41fb46e1-939e-4eb2-a870-e37faa4e627a');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-1760236c-3c40-4fbe-ae47-801f903715c6\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-1760236c-3c40-4fbe-ae47-801f903715c6')\"\n",
" title=\"Suggest charts\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-1760236c-3c40-4fbe-ae47-801f903715c6 button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
]
},
"metadata": {},
"execution_count": 4
}
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "ebrF0xLFZXES"
},
"outputs": [],
"source": [
"### To make this project easier, I will select only a small number of features\n",
"incl = [\"price\",\"city\",\"longitude\",\"latitude\",\"review_scores_rating\",\"number_of_reviews\",\"minimum_nights\",\"security_deposit\",\"cleaning_fee\",\"accommodates\",\"bathrooms\",\"bedrooms\",\"beds\",\"property_type\",\"room_type\",\"availability_365\" ,\"host_identity_verified\", \"host_is_superhost\",\"host_since\",\"cancellation_policy\"]\n",
"df = df[incl]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jWFkhwNdZXET"
},
"source": [
"Lets reformat the price to floats, it is currently a string (object). And lets makes sure the date is in a datetime format."
]
},
{
"cell_type": "code",
"source": [
"df[[\"price\"]].head()"
],
"metadata": {
"id": "oygW0Mptozfq",
"outputId": "2891980e-d702-4d27-fb98-4978633cbcf2",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
}
},
"execution_count": 6,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" price\n",
"0 $65.00 \n",
"1 $98.00 \n",
"2 $469.00 \n",
"3 $63.00 \n",
"4 $39.00 "
],
"text/html": [
"\n",
" <div id=\"df-e757b35f-afc3-4538-aedd-bee0915ced98\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>price</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>$65.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>$98.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>$469.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>$63.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>$39.00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-e757b35f-afc3-4538-aedd-bee0915ced98')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-e757b35f-afc3-4538-aedd-bee0915ced98 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-e757b35f-afc3-4538-aedd-bee0915ced98');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-2470b38c-29d3-4a76-b9e1-178444090e82\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-2470b38c-29d3-4a76-b9e1-178444090e82')\"\n",
" title=\"Suggest charts\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-2470b38c-29d3-4a76-b9e1-178444090e82 button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
]
},
"metadata": {},
"execution_count": 6
}
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"id": "2Y7AZMImZXEU"
},
"outputs": [],
"source": [
"import re\n",
"price_list = [\"price\",\"cleaning_fee\",\"security_deposit\"]\n",
"\n",
"for col in price_list:\n",
" df[col] = df[col].fillna(\"0\")\n",
" df[col] = df[col].apply(lambda x: float(re.compile('[^0-9eE.]').sub('', x)) if len(x)>0 else 0)\n",
"\n",
"df['host_since'] = pd.to_datetime(df['host_since'])"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"id": "V2_un0LhZXEU",
"outputId": "cbdfdd92-5f7d-43a4-ddc4-a2466a8ae2ef",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 65.0\n",
"1 98.0\n",
"2 469.0\n",
"3 63.0\n",
"4 39.0\n",
"Name: price, dtype: float64"
]
},
"metadata": {},
"execution_count": 8
}
],
"source": [
"df[\"price\"].head()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"id": "Qh-IpjZtZXEW",
"outputId": "54a874ee-f2c5-453d-e8e5-b295475658bc",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 452
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<Axes: >"
]
},
"metadata": {},
"execution_count": 9
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
],
"image/png": "\n"
},
"metadata": {}
}
],
"source": [
"## Winsorize for high price values, outliers.\n",
"\n",
"df.boxplot(column=\"price\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"id": "0rk7c4VuZXEW",
"outputId": "72bbc0f5-6da3-4c58-cf3d-36085f25cd93",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"13.808558337216192"
]
},
"metadata": {},
"execution_count": 10
}
],
"source": [
"## this is high, because we have a price we expect it to be high.\n",
"## however, it shouldn't be much above 3.\n",
"df[\"price\"].skew()"
]
},
{
"cell_type": "code",
"source": [
"# df[\"price\"]].clip(low_entry, high_entry)"
],
"metadata": {
"id": "zkRM_IsQpnjy"
},
"execution_count": 12,
"outputs": []
},
{
"cell_type": "code",
"source": [
"df[\"price\"].max()"
],
"metadata": {
"id": "MnGNC0LZpknd",
"outputId": "8a16dfab-ac69-4d1c-eb4a-abd10db5a64d",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"execution_count": 13,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"12999.0"
]
},
"metadata": {},
"execution_count": 13
}
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"id": "0pHgBvGwZXEX",
"outputId": "871085f7-93ec-45ae-955d-3d9e2dcb1492",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"1600.0"
]
},
"metadata": {},
"execution_count": 14
}
],
"source": [
"## This value is still relatively high\n",
"df[\"price\"].quantile(0.995) ## @99.5%"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "W4BY2ErJZXEX"
},
"outputs": [],
"source": [
"df = df[df[\"price\"]<df[\"price\"].quantile(0.995)].reset_index(drop=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "6VOcuojRZXEY",
"outputId": "c7ecb493-8de3-40d0-e6bd-a2b01b0af21d",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"2.957872457159033"
]
},
"metadata": {},
"execution_count": 22
}
],
"source": [
"## This would do for now, it might also be worth transforming ..\n",
"## .. the price with a log function at a later stage\n",
"df[\"price\"].skew()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1m15b3s9ZXEZ",
"outputId": "e1289008-9cc4-402c-e1ad-7eaec4f20436",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"price 0\n",
"city 32\n",
"longitude 0\n",
"latitude 0\n",
"review_scores_rating 7466\n",
"number_of_reviews 0\n",
"minimum_nights 0\n",
"security_deposit 0\n",
"cleaning_fee 0\n",
"accommodates 0\n",
"bathrooms 22\n",
"bedrooms 8\n",
"beds 33\n",
"property_type 0\n",
"room_type 0\n",
"availability_365 0\n",
"host_identity_verified 34\n",
"host_is_superhost 34\n",
"host_since 34\n",
"cancellation_policy 0\n",
"dtype: int64"
]
},
"metadata": {},
"execution_count": 23
}
],
"source": [
"df.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "lDW6Pf9GZXEa",
"outputId": "beaa1e9a-cfb8-4f84-e641-594750486406",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 26931 entries, 0 to 26930\n",
"Data columns (total 20 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 price 26931 non-null float64 \n",
" 1 city 26899 non-null object \n",
" 2 longitude 26931 non-null float64 \n",
" 3 latitude 26931 non-null float64 \n",
" 4 review_scores_rating 19465 non-null float64 \n",
" 5 number_of_reviews 26931 non-null int64 \n",
" 6 minimum_nights 26931 non-null int64 \n",
" 7 security_deposit 26931 non-null float64 \n",
" 8 cleaning_fee 26931 non-null float64 \n",
" 9 accommodates 26931 non-null int64 \n",
" 10 bathrooms 26909 non-null float64 \n",
" 11 bedrooms 26923 non-null float64 \n",
" 12 beds 26898 non-null float64 \n",
" 13 property_type 26931 non-null object \n",
" 14 room_type 26931 non-null object \n",
" 15 availability_365 26931 non-null int64 \n",
" 16 host_identity_verified 26897 non-null object \n",
" 17 host_is_superhost 26897 non-null object \n",
" 18 host_since 26897 non-null datetime64[ns]\n",
" 19 cancellation_policy 26931 non-null object \n",
"dtypes: datetime64[ns](1), float64(9), int64(4), object(6)\n",
"memory usage: 4.1+ MB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5TJ9xolVZXEa",
"outputId": "da660924-4b2c-4251-bb56-54921ebfca18",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 11492\n",
"365 743\n",
"364 476\n",
"89 414\n",
"90 324\n",
" ... \n",
"114 11\n",
"230 11\n",
"100 10\n",
"259 10\n",
"226 9\n",
"Name: availability_365, Length: 366, dtype: int64"
]
},
"metadata": {},
"execution_count": 25
}
],
"source": [
"df[\"availability_365\"].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "nYb7PcN4ZXEc",
"outputId": "cc903474-d6f3-4bf3-f2c0-9c855a78c83f",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 296
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-5e4cb894-de8d-4b05-82c2-fae837958324\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>price</th>\n",
" <th>longitude</th>\n",
" <th>latitude</th>\n",
" <th>review_scores_rating</th>\n",
" <th>number_of_reviews</th>\n",
" <th>minimum_nights</th>\n",
" <th>security_deposit</th>\n",
" <th>cleaning_fee</th>\n",
" <th>accommodates</th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>beds</th>\n",
" <th>availability_365</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>26931.000000</td>\n",
" <td>26931.000000</td>\n",
" <td>26931.000000</td>\n",
" <td>19465.000000</td>\n",
" <td>26931.000000</td>\n",
" <td>26931.000000</td>\n",
" <td>26931.000000</td>\n",
" <td>26931.000000</td>\n",
" <td>26931.000000</td>\n",
" <td>26909.000000</td>\n",
" <td>26923.000000</td>\n",
" <td>26898.000000</td>\n",
" <td>26931.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>196.065464</td>\n",
" <td>151.210438</td>\n",
" <td>-33.862675</td>\n",
" <td>93.404932</td>\n",
" <td>14.070031</td>\n",
" <td>4.482010</td>\n",
" <td>293.870261</td>\n",
" <td>65.268687</td>\n",
" <td>3.357395</td>\n",
" <td>1.340964</td>\n",
" <td>1.600787</td>\n",
" <td>1.996542</td>\n",
" <td>101.575916</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>199.813830</td>\n",
" <td>0.079425</td>\n",
" <td>0.071861</td>\n",
" <td>9.358515</td>\n",
" <td>29.870227</td>\n",
" <td>14.421896</td>\n",
" <td>549.642202</td>\n",
" <td>84.886663</td>\n",
" <td>2.160004</td>\n",
" <td>0.638187</td>\n",
" <td>1.091213</td>\n",
" <td>1.506535</td>\n",
" <td>127.822623</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>0.000000</td>\n",
" <td>150.644964</td>\n",
" <td>-34.135212</td>\n",
" <td>20.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>80.000000</td>\n",
" <td>151.184336</td>\n",
" <td>-33.897653</td>\n",
" <td>90.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>2.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>132.000000</td>\n",
" <td>151.223029</td>\n",
" <td>-33.883161</td>\n",
" <td>96.000000</td>\n",
" <td>3.000000</td>\n",
" <td>2.000000</td>\n",
" <td>0.000000</td>\n",
" <td>40.000000</td>\n",
" <td>2.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>32.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>225.000000</td>\n",
" <td>151.264706</td>\n",
" <td>-33.832189</td>\n",
" <td>100.000000</td>\n",
" <td>13.000000</td>\n",
" <td>5.000000</td>\n",
" <td>400.000000</td>\n",
" <td>99.000000</td>\n",
" <td>4.000000</td>\n",
" <td>1.500000</td>\n",
" <td>2.000000</td>\n",
" <td>2.000000</td>\n",
" <td>179.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>1599.000000</td>\n",
" <td>151.339811</td>\n",
" <td>-33.389728</td>\n",
" <td>100.000000</td>\n",
" <td>468.000000</td>\n",
" <td>1000.000000</td>\n",
" <td>7000.000000</td>\n",
" <td>999.000000</td>\n",
" <td>16.000000</td>\n",
" <td>10.000000</td>\n",
" <td>46.000000</td>\n",
" <td>29.000000</td>\n",
" <td>365.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5e4cb894-de8d-4b05-82c2-fae837958324')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-5e4cb894-de8d-4b05-82c2-fae837958324 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-5e4cb894-de8d-4b05-82c2-fae837958324');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" price longitude latitude review_scores_rating \\\n",
"count 26931.000000 26931.000000 26931.000000 19465.000000 \n",
"mean 196.065464 151.210438 -33.862675 93.404932 \n",
"std 199.813830 0.079425 0.071861 9.358515 \n",
"min 0.000000 150.644964 -34.135212 20.000000 \n",
"25% 80.000000 151.184336 -33.897653 90.000000 \n",
"50% 132.000000 151.223029 -33.883161 96.000000 \n",
"75% 225.000000 151.264706 -33.832189 100.000000 \n",
"max 1599.000000 151.339811 -33.389728 100.000000 \n",
"\n",
" number_of_reviews minimum_nights security_deposit cleaning_fee \\\n",
"count 26931.000000 26931.000000 26931.000000 26931.000000 \n",
"mean 14.070031 4.482010 293.870261 65.268687 \n",
"std 29.870227 14.421896 549.642202 84.886663 \n",
"min 0.000000 1.000000 0.000000 0.000000 \n",
"25% 1.000000 1.000000 0.000000 0.000000 \n",
"50% 3.000000 2.000000 0.000000 40.000000 \n",
"75% 13.000000 5.000000 400.000000 99.000000 \n",
"max 468.000000 1000.000000 7000.000000 999.000000 \n",
"\n",
" accommodates bathrooms bedrooms beds \\\n",
"count 26931.000000 26909.000000 26923.000000 26898.000000 \n",
"mean 3.357395 1.340964 1.600787 1.996542 \n",
"std 2.160004 0.638187 1.091213 1.506535 \n",
"min 1.000000 0.000000 0.000000 0.000000 \n",
"25% 2.000000 1.000000 1.000000 1.000000 \n",
"50% 2.000000 1.000000 1.000000 1.000000 \n",
"75% 4.000000 1.500000 2.000000 2.000000 \n",
"max 16.000000 10.000000 46.000000 29.000000 \n",
"\n",
" availability_365 \n",
"count 26931.000000 \n",
"mean 101.575916 \n",
"std 127.822623 \n",
"min 0.000000 \n",
"25% 0.000000 \n",
"50% 32.000000 \n",
"75% 179.000000 \n",
"max 365.000000 "
]
},
"metadata": {},
"execution_count": 26
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Uy_enZlZZXEd",
"outputId": "ef71fa30-de2a-4bfc-f5e7-b6a994fa0616",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Saving figure attribute_histogram_plots\n"
]
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x1080 with 9 Axes>"
]
},
"metadata": {
"needs_background": "light"
}
}
],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"\n",
"try:\n",
" df.iloc[:,6:].hist(bins=50, figsize=(20,15))\n",
" save_fig(\"attribute_histogram_plots\")\n",
" plt.show()\n",
"except AttributeError:\n",
" pass\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "GgkH-ZWzZXEe",
"outputId": "779f5770-7c2d-4ceb-bd85-6a566f222571",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Bondi Beach 1671\n",
"Manly 958\n",
"Surry Hills 919\n",
"Bondi 785\n",
"Randwick 684\n",
"Sydney 682\n",
"Coogee 675\n",
"Darlinghurst 660\n",
"North Bondi 629\n",
"Newtown 490\n",
"Name: city, dtype: int64"
]
},
"metadata": {},
"execution_count": 28
}
],
"source": [
"## Even though our customer, sepecifcally wants information about..\n",
"## .. Bondi the addition of other areas will help the final prediction\n",
"\n",
"df[\"city\"].value_counts().head(10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "OvITLiauZXEf"
},
"outputs": [],
"source": [
"## For this taks we will keep the top 20 Sydney locations\n",
"\n",
"list_of_20 = list(df[\"city\"].value_counts().head(10).index)\n",
"df = df[df[\"city\"].isin(list_of_20)].reset_index(drop=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "gbbpk03GZXEf",
"outputId": "04e8ea80-c922-4b7b-d3e6-610fadfe6562",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Apartment 5970\n",
"House 1497\n",
"Townhouse 271\n",
"Condominium 115\n",
"Loft 59\n",
"Guest suite 44\n",
"Other 33\n",
"Hostel 30\n",
"Bed and breakfast 25\n",
"Guesthouse 24\n",
"Serviced apartment 23\n",
"Villa 16\n",
"Bungalow 7\n",
"Boutique hotel 6\n",
"Tent 6\n",
"Cottage 6\n",
"Resort 5\n",
"Tiny house 5\n",
"Hotel 3\n",
"Cabin 2\n",
"Aparthotel 1\n",
"Earth house 1\n",
"Houseboat 1\n",
"Chalet 1\n",
"Yurt 1\n",
"Camper/RV 1\n",
"Name: property_type, dtype: int64"
]
},
"metadata": {},
"execution_count": 30
}
],
"source": [
"df[\"property_type\"].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0TnvlsBUZXEf"
},
"outputs": [],
"source": [
"## Remove rare occurences in categories as is necessary for..\n",
"## .. the eventaul cross validation step, the below step is somewhat ..\n",
"## .. similar for what has been done with cities above\n",
"\n",
"item_counts = df.groupby(['property_type']).size()\n",
"rare_items = list(item_counts.loc[item_counts <= 10].index.values)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "A5W30DNnZXEf"
},
"outputs": [],
"source": [
"df = df[~df[\"property_type\"].isin(rare_items)].reset_index(drop=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "kt9-HbbXZXEf"
},
"outputs": [],
"source": [
"# to make this notebook's output identical at every run\n",
"np.random.seed(42)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "lq-WJp8TZXEg"
},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"# For illustration only. Sklearn has train_test_split()\n",
"def split_train_test(data, test_ratio):\n",
" shuffled_indices = np.random.permutation(len(data))\n",
" test_set_size = int(len(data) * test_ratio)\n",
" test_indices = shuffled_indices[:test_set_size]\n",
" train_indices = shuffled_indices[test_set_size:]\n",
" return data.iloc[train_indices], data.iloc[test_indices]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5Hrqdh3RZXEg",
"outputId": "42709019-1c84-4492-c645-c8a09a68a38d",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"6486 train + 1621 test\n"
]
}
],
"source": [
"train_set, test_set = split_train_test(df, 0.2)\n",
"print(len(train_set), \"train +\", len(test_set), \"test\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "79eD-QSnZXEg"
},
"outputs": [],
"source": [
"from zlib import crc32\n",
"\n",
"def test_set_check(identifier, test_ratio):\n",
" return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32\n",
"\n",
"def split_train_test_by_id(data, test_ratio, id_column):\n",
" ids = data[id_column]\n",
" in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))\n",
" return data.loc[~in_test_set], data.loc[in_test_set]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "J0AASY24ZXEh"
},
"source": [
"The implementation of `test_set_check()` above works fine in both Python 2 and Python 3. In earlier releases, the following implementation was proposed, which supported any hash function, but was much slower and did not support Python 2:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "iC7rt_JuZXEh"
},
"outputs": [],
"source": [
"import hashlib\n",
"\n",
"def test_set_check(identifier, test_ratio, hash=hashlib.md5):\n",
" return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fRgaKim3ZXEh"
},
"source": [
"If you want an implementation that supports any hash function and is compatible with both Python 2 and Python 3, here is one:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "V8hHq3PaZXEh"
},
"outputs": [],
"source": [
"def test_set_check(identifier, test_ratio, hash=hashlib.md5):\n",
" return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "c85a4a--ZXEi"
},
"outputs": [],
"source": [
"df_with_id = df.reset_index() # adds an `index` column\n",
"train_set, test_set = split_train_test_by_id(df_with_id, 0.2, \"index\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7a3cLzzMZXEi"
},
"outputs": [],
"source": [
"df_with_id[\"id\"] = df[\"longitude\"] * 1000 + df_with_id[\"latitude\"]\n",
"train_set, test_set = split_train_test_by_id(df_with_id, 0.2, \"id\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "RJ4scQdlZXEi",
"outputId": "d8f5414b-2da2-46b0-95bd-c47f7756ecf6",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 357
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-1cc7b124-01cf-4952-8521-d124ea1b6fd6\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>index</th>\n",
" <th>price</th>\n",
" <th>city</th>\n",
" <th>longitude</th>\n",
" <th>latitude</th>\n",
" <th>review_scores_rating</th>\n",
" <th>number_of_reviews</th>\n",
" <th>minimum_nights</th>\n",
" <th>security_deposit</th>\n",
" <th>cleaning_fee</th>\n",
" <th>accommodates</th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>beds</th>\n",
" <th>property_type</th>\n",
" <th>room_type</th>\n",
" <th>availability_365</th>\n",
" <th>host_identity_verified</th>\n",
" <th>host_is_superhost</th>\n",
" <th>host_since</th>\n",
" <th>cancellation_policy</th>\n",
" <th>id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>111.0</td>\n",
" <td>Darlinghurst</td>\n",
" <td>151.216541</td>\n",
" <td>-33.880455</td>\n",
" <td>88.0</td>\n",
" <td>272</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>285</td>\n",
" <td>t</td>\n",
" <td>f</td>\n",
" <td>2009-03-12</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>151182.660345</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>130.0</td>\n",
" <td>Bondi Beach</td>\n",
" <td>151.273084</td>\n",
" <td>-33.891846</td>\n",
" <td>95.0</td>\n",
" <td>119</td>\n",
" <td>4</td>\n",
" <td>200.0</td>\n",
" <td>60.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>94</td>\n",
" <td>t</td>\n",
" <td>t</td>\n",
" <td>2012-01-18</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>151239.192454</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5</td>\n",
" <td>111.0</td>\n",
" <td>Sydney</td>\n",
" <td>151.268865</td>\n",
" <td>-33.885690</td>\n",
" <td>89.0</td>\n",
" <td>11</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>100.0</td>\n",
" <td>4</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>14</td>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" <td>2010-12-14</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>151234.979210</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>9</td>\n",
" <td>990.0</td>\n",
" <td>Coogee</td>\n",
" <td>151.260116</td>\n",
" <td>-33.914816</td>\n",
" <td>98.0</td>\n",
" <td>13</td>\n",
" <td>7</td>\n",
" <td>3000.0</td>\n",
" <td>0.0</td>\n",
" <td>12</td>\n",
" <td>5.0</td>\n",
" <td>6.0</td>\n",
" <td>6.0</td>\n",
" <td>Villa</td>\n",
" <td>Entire home/apt</td>\n",
" <td>33</td>\n",
" <td>t</td>\n",
" <td>f</td>\n",
" <td>2011-10-02</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>151226.201484</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>12</td>\n",
" <td>202.0</td>\n",
" <td>Bondi</td>\n",
" <td>151.268418</td>\n",
" <td>-33.895158</td>\n",
" <td>91.0</td>\n",
" <td>90</td>\n",
" <td>1</td>\n",
" <td>1000.0</td>\n",
" <td>150.0</td>\n",
" <td>4</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>204</td>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" <td>2011-03-31</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>151234.523342</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-1cc7b124-01cf-4952-8521-d124ea1b6fd6')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-1cc7b124-01cf-4952-8521-d124ea1b6fd6 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-1cc7b124-01cf-4952-8521-d124ea1b6fd6');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" index price city longitude latitude review_scores_rating \\\n",
"0 0 111.0 Darlinghurst 151.216541 -33.880455 88.0 \n",
"4 4 130.0 Bondi Beach 151.273084 -33.891846 95.0 \n",
"5 5 111.0 Sydney 151.268865 -33.885690 89.0 \n",
"9 9 990.0 Coogee 151.260116 -33.914816 98.0 \n",
"12 12 202.0 Bondi 151.268418 -33.895158 91.0 \n",
"\n",
" number_of_reviews minimum_nights security_deposit cleaning_fee \\\n",
"0 272 2 0.0 0.0 \n",
"4 119 4 200.0 60.0 \n",
"5 11 4 0.0 100.0 \n",
"9 13 7 3000.0 0.0 \n",
"12 90 1 1000.0 150.0 \n",
"\n",
" accommodates bathrooms bedrooms beds property_type room_type \\\n",
"0 2 1.0 1.0 1.0 Apartment Private room \n",
"4 2 1.0 1.0 1.0 Apartment Entire home/apt \n",
"5 4 1.0 2.0 2.0 Apartment Entire home/apt \n",
"9 12 5.0 6.0 6.0 Villa Entire home/apt \n",
"12 4 1.0 2.0 2.0 Apartment Entire home/apt \n",
"\n",
" availability_365 host_identity_verified host_is_superhost host_since \\\n",
"0 285 t f 2009-03-12 \n",
"4 94 t t 2012-01-18 \n",
"5 14 f f 2010-12-14 \n",
"9 33 t f 2011-10-02 \n",
"12 204 f f 2011-03-31 \n",
"\n",
" cancellation_policy id \n",
"0 strict_14_with_grace_period 151182.660345 \n",
"4 strict_14_with_grace_period 151239.192454 \n",
"5 strict_14_with_grace_period 151234.979210 \n",
"9 strict_14_with_grace_period 151226.201484 \n",
"12 strict_14_with_grace_period 151234.523342 "
]
},
"metadata": {},
"execution_count": 41
}
],
"source": [
"test_set.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "eyyGG0fcZXEj"
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "v9LZBp0fZXEj",
"outputId": "f53186d0-0df3-47d9-d06a-966480296b33",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 357
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-38a1fab0-45bc-4013-abe5-2c334d1d2fa5\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>price</th>\n",
" <th>city</th>\n",
" <th>longitude</th>\n",
" <th>latitude</th>\n",
" <th>review_scores_rating</th>\n",
" <th>number_of_reviews</th>\n",
" <th>minimum_nights</th>\n",
" <th>security_deposit</th>\n",
" <th>cleaning_fee</th>\n",
" <th>accommodates</th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>beds</th>\n",
" <th>property_type</th>\n",
" <th>room_type</th>\n",
" <th>availability_365</th>\n",
" <th>host_identity_verified</th>\n",
" <th>host_is_superhost</th>\n",
" <th>host_since</th>\n",
" <th>cancellation_policy</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4084</th>\n",
" <td>68.0</td>\n",
" <td>North Bondi</td>\n",
" <td>151.279684</td>\n",
" <td>-33.884092</td>\n",
" <td>93.0</td>\n",
" <td>3</td>\n",
" <td>7</td>\n",
" <td>150.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>2.5</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>House</td>\n",
" <td>Private room</td>\n",
" <td>4</td>\n",
" <td>t</td>\n",
" <td>f</td>\n",
" <td>2016-08-18</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" <tr>\n",
" <th>965</th>\n",
" <td>128.0</td>\n",
" <td>Surry Hills</td>\n",
" <td>151.212610</td>\n",
" <td>-33.891416</td>\n",
" <td>100.0</td>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>690.0</td>\n",
" <td>99.0</td>\n",
" <td>4</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>Townhouse</td>\n",
" <td>Entire home/apt</td>\n",
" <td>173</td>\n",
" <td>t</td>\n",
" <td>t</td>\n",
" <td>2014-10-31</td>\n",
" <td>moderate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8100</th>\n",
" <td>115.0</td>\n",
" <td>Darlinghurst</td>\n",
" <td>151.217882</td>\n",
" <td>-33.874271</td>\n",
" <td>98.0</td>\n",
" <td>8</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>30.0</td>\n",
" <td>3</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>12</td>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" <td>2017-04-02</td>\n",
" <td>moderate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3882</th>\n",
" <td>125.0</td>\n",
" <td>Sydney</td>\n",
" <td>151.204837</td>\n",
" <td>-33.875924</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>150.0</td>\n",
" <td>50.0</td>\n",
" <td>4</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>3.0</td>\n",
" <td>Other</td>\n",
" <td>Shared room</td>\n",
" <td>363</td>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" <td>2014-12-01</td>\n",
" <td>flexible</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1010</th>\n",
" <td>250.0</td>\n",
" <td>North Bondi</td>\n",
" <td>151.274298</td>\n",
" <td>-33.885652</td>\n",
" <td>100.0</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>80.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>363</td>\n",
" <td>t</td>\n",
" <td>f</td>\n",
" <td>2012-09-29</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-38a1fab0-45bc-4013-abe5-2c334d1d2fa5')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-38a1fab0-45bc-4013-abe5-2c334d1d2fa5 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-38a1fab0-45bc-4013-abe5-2c334d1d2fa5');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" price city longitude latitude review_scores_rating \\\n",
"4084 68.0 North Bondi 151.279684 -33.884092 93.0 \n",
"965 128.0 Surry Hills 151.212610 -33.891416 100.0 \n",
"8100 115.0 Darlinghurst 151.217882 -33.874271 98.0 \n",
"3882 125.0 Sydney 151.204837 -33.875924 NaN \n",
"1010 250.0 North Bondi 151.274298 -33.885652 100.0 \n",
"\n",
" number_of_reviews minimum_nights security_deposit cleaning_fee \\\n",
"4084 3 7 150.0 0.0 \n",
"965 4 5 690.0 99.0 \n",
"8100 8 2 0.0 30.0 \n",
"3882 0 2 150.0 50.0 \n",
"1010 4 2 0.0 80.0 \n",
"\n",
" accommodates bathrooms bedrooms beds property_type room_type \\\n",
"4084 2 2.5 1.0 1.0 House Private room \n",
"965 4 1.0 2.0 2.0 Townhouse Entire home/apt \n",
"8100 3 1.0 1.0 1.0 Apartment Entire home/apt \n",
"3882 4 1.0 1.0 3.0 Other Shared room \n",
"1010 2 1.0 1.0 1.0 Apartment Entire home/apt \n",
"\n",
" availability_365 host_identity_verified host_is_superhost host_since \\\n",
"4084 4 t f 2016-08-18 \n",
"965 173 t t 2014-10-31 \n",
"8100 12 f f 2017-04-02 \n",
"3882 363 f f 2014-12-01 \n",
"1010 363 t f 2012-09-29 \n",
"\n",
" cancellation_policy \n",
"4084 strict_14_with_grace_period \n",
"965 moderate \n",
"8100 moderate \n",
"3882 flexible \n",
"1010 strict_14_with_grace_period "
]
},
"metadata": {},
"execution_count": 43
}
],
"source": [
"test_set.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HuKh50cyZXEj"
},
"source": [
"The models that would be used in this project can't read textual data, thus we have to turn text categories into numeric categories. The code below will create city codes, this time for the purpose of statified sampeing.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Z70O8_ZrZXEk"
},
"outputs": [],
"source": [
"from sklearn import preprocessing\n",
"le = preprocessing.LabelEncoder()\n",
"\n",
"for col in [\"city\"]:\n",
" df[col+\"_code\"] = le.fit_transform(df[col])\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "oGpC3nWIZXEk"
},
"outputs": [],
"source": [
"## Similar to above encoding, here we encode binary 1, 0 for t and f.\n",
"\n",
"df[\"host_identity_verified\"] = df[\"host_identity_verified\"].apply(lambda x: 1 if x==\"t\" else 0)\n",
"df[\"host_is_superhost\"] = df[\"host_is_superhost\"].apply(lambda x: 1 if x==\"t\" else 0)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "k9feliBNZXEk"
},
"outputs": [],
"source": [
"from sklearn.model_selection import StratifiedShuffleSplit\n",
"\n",
"## we will stratify according to city\n",
"\n",
"split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)\n",
"for train_index, test_index in split.split(df, df[\"city_code\"]):\n",
" del df[\"city_code\"]\n",
" strat_train_set = df.loc[train_index]\n",
" strat_test_set = df.loc[test_index]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0eDCFB1vZXEl",
"outputId": "e2586f3f-dde5-44da-c143-2b990729b610",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"city\n",
"Bondi 198.745223\n",
"Bondi Beach 199.879880\n",
"Coogee 196.574627\n",
"Darlinghurst 184.700000\n",
"Manly 223.447368\n",
"Newtown 117.938776\n",
"North Bondi 248.857143\n",
"Randwick 178.072993\n",
"Surry Hills 175.732240\n",
"Sydney 193.962687\n",
"Name: price, dtype: float64"
]
},
"metadata": {},
"execution_count": 48
}
],
"source": [
"## Average price per area\n",
"strat_test_set.groupby(\"city\")[\"price\"].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JDojCQEeZXEl"
},
"source": [
"# Discover and visualize the data to gain insights"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "XpoW0SnUZXEl"
},
"outputs": [],
"source": [
"traval = strat_train_set.copy() ##traval - training and validation set"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "a1uLDcx_ZXEl",
"outputId": "a612acfc-1878-4c0d-eee6-6470dfe775ea",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 314
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Saving figure bad_visualization_plot\n"
]
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
}
}
],
"source": [
"traval.plot(kind=\"scatter\", x=\"longitude\", y=\"latitude\")\n",
"save_fig(\"bad_visualization_plot\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "qIV__NDiZXEm",
"outputId": "14170acf-b876-47d4-a231-4f66419a7011",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 314
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Saving figure better_visualization_plot\n"
]
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
}
}
],
"source": [
"traval.plot(kind=\"scatter\", x=\"longitude\", y=\"latitude\", alpha=0.1)\n",
"save_fig(\"better_visualization_plot\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nUK4fLMkZXEm"
},
"source": [
"The argument `sharex=False` fixes a display bug (the x-axis values and legend were not displayed). This is a temporary fix (see: https://github.com/pandas-dev/pandas/issues/10611). Thanks to Wilmer Arellano for pointing it out."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "LmGs6YM9ZXEm"
},
"outputs": [],
"source": [
"traval_co = traval[(traval[\"longitude\"]>151.16)&(traval[\"latitude\"]<-33.75)].reset_index(drop=True)\n",
"\n",
"traval_co = traval_co[traval_co[\"latitude\"]>-33.95].reset_index(drop=True)\n",
"\n",
"traval_co = traval_co[traval_co[\"price\"]<600].reset_index(drop=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "texzSnfjZXEn",
"outputId": "641d5ded-695c-409b-bf92-94864343ec71",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 530
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Saving figure housing_prices_scatterplot\n"
]
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 720x504 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
}
}
],
"source": [
"traval_co.plot(kind=\"scatter\", x=\"longitude\", y=\"latitude\", alpha=0.5,\n",
" s=traval_co[\"number_of_reviews\"]/2, label=\"Reviews\", figsize=(10,7),\n",
" c=\"price\", cmap=plt.get_cmap(\"jet\"), colorbar=True,\n",
" sharex=False)\n",
"plt.legend()\n",
"save_fig(\"housing_prices_scatterplot\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "R27lovILZXEo"
},
"outputs": [],
"source": [
"corr_matrix = traval.corr()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "jg5vrs2EZXEo",
"outputId": "d86eb393-9f90-43dd-968d-c53fddb1c8bb",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"price 1.000000\n",
"accommodates 0.674368\n",
"bedrooms 0.668963\n",
"beds 0.582378\n",
"bathrooms 0.553773\n",
"cleaning_fee 0.529834\n",
"security_deposit 0.469423\n",
"longitude 0.157902\n",
"availability_365 0.148263\n",
"latitude 0.131160\n",
"review_scores_rating 0.067066\n",
"host_identity_verified 0.048821\n",
"minimum_nights 0.022103\n",
"host_is_superhost -0.016695\n",
"number_of_reviews -0.064011\n",
"Name: price, dtype: float64"
]
},
"metadata": {},
"execution_count": 55
}
],
"source": [
"corr_matrix[\"price\"].sort_values(ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tHoRs0SuZXEo",
"outputId": "8b3a08fe-0cca-4133-eeb5-6ba59164882f",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 602
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Saving figure scatter_matrix_plot\n"
]
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 864x576 with 25 Axes>"
]
},
"metadata": {
"needs_background": "light"
}
}
],
"source": [
"# from pandas.tools.plotting import scatter_matrix # For older versions of Pandas\n",
"from pandas.plotting import scatter_matrix\n",
"\n",
"attributes = [\"price\", \"accommodates\", \"bedrooms\",\n",
" \"cleaning_fee\",\"review_scores_rating\"]\n",
"scatter_matrix(traval[attributes], figsize=(12, 8))\n",
"save_fig(\"scatter_matrix_plot\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "sv2J8-7PZXEp",
"outputId": "9aeeb3b9-5586-4b47-984a-91d9813870e5",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 314
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Saving figure income_vs_house_value_scatterplot\n"
]
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
}
}
],
"source": [
"traval.plot(kind=\"scatter\", x=\"accommodates\", y=\"price\",\n",
" alpha=0.1)\n",
"save_fig(\"income_vs_house_value_scatterplot\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "PGPu3Zq_ZXEp",
"outputId": "23e932bd-78e4-4958-842b-10e6b393f020",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 357
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-fa214074-a9f7-484d-91da-28af693ae0da\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>price</th>\n",
" <th>city</th>\n",
" <th>longitude</th>\n",
" <th>latitude</th>\n",
" <th>review_scores_rating</th>\n",
" <th>number_of_reviews</th>\n",
" <th>minimum_nights</th>\n",
" <th>security_deposit</th>\n",
" <th>cleaning_fee</th>\n",
" <th>accommodates</th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>beds</th>\n",
" <th>property_type</th>\n",
" <th>room_type</th>\n",
" <th>availability_365</th>\n",
" <th>host_identity_verified</th>\n",
" <th>host_is_superhost</th>\n",
" <th>host_since</th>\n",
" <th>cancellation_policy</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5484</th>\n",
" <td>200.0</td>\n",
" <td>Newtown</td>\n",
" <td>151.178552</td>\n",
" <td>-33.907150</td>\n",
" <td>96.0</td>\n",
" <td>61</td>\n",
" <td>2</td>\n",
" <td>250.0</td>\n",
" <td>85.0</td>\n",
" <td>4</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>House</td>\n",
" <td>Entire home/apt</td>\n",
" <td>127</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2016-01-22</td>\n",
" <td>moderate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1267</th>\n",
" <td>183.0</td>\n",
" <td>Randwick</td>\n",
" <td>151.249030</td>\n",
" <td>-33.906190</td>\n",
" <td>97.0</td>\n",
" <td>6</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>20.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2014-03-28</td>\n",
" <td>moderate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6658</th>\n",
" <td>175.0</td>\n",
" <td>Manly</td>\n",
" <td>151.288491</td>\n",
" <td>-33.802074</td>\n",
" <td>100.0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>40.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2014-01-09</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2522</th>\n",
" <td>85.0</td>\n",
" <td>Randwick</td>\n",
" <td>151.236423</td>\n",
" <td>-33.913614</td>\n",
" <td>94.0</td>\n",
" <td>20</td>\n",
" <td>3</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>90</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2015-11-22</td>\n",
" <td>flexible</td>\n",
" </tr>\n",
" <tr>\n",
" <th>722</th>\n",
" <td>80.0</td>\n",
" <td>Coogee</td>\n",
" <td>151.259342</td>\n",
" <td>-33.918435</td>\n",
" <td>92.0</td>\n",
" <td>139</td>\n",
" <td>30</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2014-01-07</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-fa214074-a9f7-484d-91da-28af693ae0da')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-fa214074-a9f7-484d-91da-28af693ae0da button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-fa214074-a9f7-484d-91da-28af693ae0da');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" price city longitude latitude review_scores_rating \\\n",
"5484 200.0 Newtown 151.178552 -33.907150 96.0 \n",
"1267 183.0 Randwick 151.249030 -33.906190 97.0 \n",
"6658 175.0 Manly 151.288491 -33.802074 100.0 \n",
"2522 85.0 Randwick 151.236423 -33.913614 94.0 \n",
"722 80.0 Coogee 151.259342 -33.918435 92.0 \n",
"\n",
" number_of_reviews minimum_nights security_deposit cleaning_fee \\\n",
"5484 61 2 250.0 85.0 \n",
"1267 6 4 0.0 20.0 \n",
"6658 2 2 0.0 40.0 \n",
"2522 20 3 0.0 0.0 \n",
"722 139 30 0.0 0.0 \n",
"\n",
" accommodates bathrooms bedrooms beds property_type room_type \\\n",
"5484 4 1.0 2.0 2.0 House Entire home/apt \n",
"1267 2 1.0 1.0 1.0 Apartment Private room \n",
"6658 2 1.0 1.0 1.0 Apartment Entire home/apt \n",
"2522 2 1.0 1.0 1.0 Apartment Private room \n",
"722 3 1.0 1.0 2.0 Apartment Private room \n",
"\n",
" availability_365 host_identity_verified host_is_superhost host_since \\\n",
"5484 127 1 0 2016-01-22 \n",
"1267 0 1 0 2014-03-28 \n",
"6658 0 1 0 2014-01-09 \n",
"2522 90 0 0 2015-11-22 \n",
"722 0 1 0 2014-01-07 \n",
"\n",
" cancellation_policy \n",
"5484 moderate \n",
"1267 moderate \n",
"6658 strict_14_with_grace_period \n",
"2522 flexible \n",
"722 strict_14_with_grace_period "
]
},
"metadata": {},
"execution_count": 58
}
],
"source": [
"traval.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "hcQJ17YIZXEp"
},
"outputs": [],
"source": [
"#### Some Feature Engineering"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "eQMO4aceZXEq"
},
"outputs": [],
"source": [
"traval[\"bedrooms_per_person\"] = traval[\"bedrooms\"]/traval[\"accommodates\"]\n",
"traval[\"bathrooms_per_person\"] = traval[\"bathrooms\"]/traval[\"accommodates\"]\n",
"traval['host_since'] = pd.to_datetime(traval['host_since'])\n",
"traval['days_on_airbnb'] = (pd.to_datetime('today') - traval['host_since']).dt.days"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "i82bqIMTZXEq"
},
"source": [
"# Prepare the data for Machine Learning algorithms"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "f-0iksJhZXEq"
},
"outputs": [],
"source": [
"## Here I will forget about traval and use a more formal way of introducing...\n",
"## ..preprocessin using pipelines"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "-WH9zsrXZXEq"
},
"outputs": [],
"source": [
"X = traval.copy().drop(\"price\", axis=1) # drop labels for training set"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "xaXicvUAZXEr",
"outputId": "6b72ca39-81ed-4729-e66d-06eeb5468967",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 374
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"(5, 22)\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-9659e329-74e6-417e-8071-b58c3bfeca17\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>city</th>\n",
" <th>longitude</th>\n",
" <th>latitude</th>\n",
" <th>review_scores_rating</th>\n",
" <th>number_of_reviews</th>\n",
" <th>minimum_nights</th>\n",
" <th>security_deposit</th>\n",
" <th>cleaning_fee</th>\n",
" <th>accommodates</th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>beds</th>\n",
" <th>property_type</th>\n",
" <th>room_type</th>\n",
" <th>availability_365</th>\n",
" <th>host_identity_verified</th>\n",
" <th>host_is_superhost</th>\n",
" <th>host_since</th>\n",
" <th>cancellation_policy</th>\n",
" <th>bedrooms_per_person</th>\n",
" <th>bathrooms_per_person</th>\n",
" <th>days_on_airbnb</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5594</th>\n",
" <td>Randwick</td>\n",
" <td>151.238806</td>\n",
" <td>-33.913834</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>800.0</td>\n",
" <td>80.0</td>\n",
" <td>6</td>\n",
" <td>1.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>House</td>\n",
" <td>Entire home/apt</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2013-11-27</td>\n",
" <td>moderate</td>\n",
" <td>0.500000</td>\n",
" <td>0.166667</td>\n",
" <td>2987.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5439</th>\n",
" <td>Newtown</td>\n",
" <td>151.184469</td>\n",
" <td>-33.894582</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>5000.0</td>\n",
" <td>100.0</td>\n",
" <td>11</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2014-07-16</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>0.272727</td>\n",
" <td>0.181818</td>\n",
" <td>2756.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3847</th>\n",
" <td>Bondi Beach</td>\n",
" <td>151.273077</td>\n",
" <td>-33.895142</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>7</td>\n",
" <td>271.0</td>\n",
" <td>27.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2015-12-07</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>0.500000</td>\n",
" <td>0.500000</td>\n",
" <td>2247.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1312</th>\n",
" <td>Randwick</td>\n",
" <td>151.245793</td>\n",
" <td>-33.920622</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>0.0</td>\n",
" <td>80.0</td>\n",
" <td>3</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2015-10-02</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>0.666667</td>\n",
" <td>0.333333</td>\n",
" <td>2313.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6194</th>\n",
" <td>Bondi Beach</td>\n",
" <td>151.273411</td>\n",
" <td>-33.888113</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>10</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2015-08-14</td>\n",
" <td>moderate</td>\n",
" <td>0.500000</td>\n",
" <td>0.500000</td>\n",
" <td>2362.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-9659e329-74e6-417e-8071-b58c3bfeca17')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-9659e329-74e6-417e-8071-b58c3bfeca17 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-9659e329-74e6-417e-8071-b58c3bfeca17');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" city longitude latitude review_scores_rating \\\n",
"5594 Randwick 151.238806 -33.913834 NaN \n",
"5439 Newtown 151.184469 -33.894582 NaN \n",
"3847 Bondi Beach 151.273077 -33.895142 NaN \n",
"1312 Randwick 151.245793 -33.920622 NaN \n",
"6194 Bondi Beach 151.273411 -33.888113 NaN \n",
"\n",
" number_of_reviews minimum_nights security_deposit cleaning_fee \\\n",
"5594 0 2 800.0 80.0 \n",
"5439 0 3 5000.0 100.0 \n",
"3847 0 7 271.0 27.0 \n",
"1312 0 3 0.0 80.0 \n",
"6194 0 10 0.0 0.0 \n",
"\n",
" accommodates bathrooms bedrooms beds property_type room_type \\\n",
"5594 6 1.0 3.0 3.0 House Entire home/apt \n",
"5439 11 2.0 3.0 4.0 Apartment Entire home/apt \n",
"3847 2 1.0 1.0 1.0 Apartment Private room \n",
"1312 3 1.0 2.0 2.0 Apartment Entire home/apt \n",
"6194 2 1.0 1.0 1.0 Apartment Private room \n",
"\n",
" availability_365 host_identity_verified host_is_superhost host_since \\\n",
"5594 0 0 0 2013-11-27 \n",
"5439 0 1 0 2014-07-16 \n",
"3847 0 0 0 2015-12-07 \n",
"1312 0 0 0 2015-10-02 \n",
"6194 0 1 0 2015-08-14 \n",
"\n",
" cancellation_policy bedrooms_per_person bathrooms_per_person \\\n",
"5594 moderate 0.500000 0.166667 \n",
"5439 strict_14_with_grace_period 0.272727 0.181818 \n",
"3847 strict_14_with_grace_period 0.500000 0.500000 \n",
"1312 strict_14_with_grace_period 0.666667 0.333333 \n",
"6194 moderate 0.500000 0.500000 \n",
"\n",
" days_on_airbnb \n",
"5594 2987.0 \n",
"5439 2756.0 \n",
"3847 2247.0 \n",
"1312 2313.0 \n",
"6194 2362.0 "
]
},
"metadata": {},
"execution_count": 62
}
],
"source": [
"sample_incomplete_rows = X[X.isnull().any(axis=1)].head()\n",
"print(sample_incomplete_rows.shape)\n",
"sample_incomplete_rows"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "4XwodyVPZXEr",
"outputId": "1855ce10-a079-4a18-e4d9-144b8719433b",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 114
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-23de0ab3-9b09-4b91-9ff6-3408892044a1\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>city</th>\n",
" <th>longitude</th>\n",
" <th>latitude</th>\n",
" <th>review_scores_rating</th>\n",
" <th>number_of_reviews</th>\n",
" <th>minimum_nights</th>\n",
" <th>security_deposit</th>\n",
" <th>cleaning_fee</th>\n",
" <th>accommodates</th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>beds</th>\n",
" <th>property_type</th>\n",
" <th>room_type</th>\n",
" <th>availability_365</th>\n",
" <th>host_identity_verified</th>\n",
" <th>host_is_superhost</th>\n",
" <th>host_since</th>\n",
" <th>cancellation_policy</th>\n",
" <th>bedrooms_per_person</th>\n",
" <th>bathrooms_per_person</th>\n",
" <th>days_on_airbnb</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-23de0ab3-9b09-4b91-9ff6-3408892044a1')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-23de0ab3-9b09-4b91-9ff6-3408892044a1 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-23de0ab3-9b09-4b91-9ff6-3408892044a1');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [city, longitude, latitude, review_scores_rating, number_of_reviews, minimum_nights, security_deposit, cleaning_fee, accommodates, bathrooms, bedrooms, beds, property_type, room_type, availability_365, host_identity_verified, host_is_superhost, host_since, cancellation_policy, bedrooms_per_person, bathrooms_per_person, days_on_airbnb]\n",
"Index: []"
]
},
"metadata": {},
"execution_count": 63
}
],
"source": [
"# Rows Remove\n",
"sample_incomplete_rows.dropna(subset=[\"review_scores_rating\"]) # option 1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "eX4eI_xeZXEr",
"outputId": "60560a2f-a9b9-43fb-ee5d-1a3df5b0da23",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 357
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-edfcb2bd-1907-4231-af18-64a4fd2f85f5\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>city</th>\n",
" <th>longitude</th>\n",
" <th>latitude</th>\n",
" <th>number_of_reviews</th>\n",
" <th>minimum_nights</th>\n",
" <th>security_deposit</th>\n",
" <th>cleaning_fee</th>\n",
" <th>accommodates</th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>beds</th>\n",
" <th>property_type</th>\n",
" <th>room_type</th>\n",
" <th>availability_365</th>\n",
" <th>host_identity_verified</th>\n",
" <th>host_is_superhost</th>\n",
" <th>host_since</th>\n",
" <th>cancellation_policy</th>\n",
" <th>bedrooms_per_person</th>\n",
" <th>bathrooms_per_person</th>\n",
" <th>days_on_airbnb</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5594</th>\n",
" <td>Randwick</td>\n",
" <td>151.238806</td>\n",
" <td>-33.913834</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>800.0</td>\n",
" <td>80.0</td>\n",
" <td>6</td>\n",
" <td>1.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>House</td>\n",
" <td>Entire home/apt</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2013-11-27</td>\n",
" <td>moderate</td>\n",
" <td>0.500000</td>\n",
" <td>0.166667</td>\n",
" <td>2987.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5439</th>\n",
" <td>Newtown</td>\n",
" <td>151.184469</td>\n",
" <td>-33.894582</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>5000.0</td>\n",
" <td>100.0</td>\n",
" <td>11</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2014-07-16</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>0.272727</td>\n",
" <td>0.181818</td>\n",
" <td>2756.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3847</th>\n",
" <td>Bondi Beach</td>\n",
" <td>151.273077</td>\n",
" <td>-33.895142</td>\n",
" <td>0</td>\n",
" <td>7</td>\n",
" <td>271.0</td>\n",
" <td>27.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2015-12-07</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>0.500000</td>\n",
" <td>0.500000</td>\n",
" <td>2247.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1312</th>\n",
" <td>Randwick</td>\n",
" <td>151.245793</td>\n",
" <td>-33.920622</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>0.0</td>\n",
" <td>80.0</td>\n",
" <td>3</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2015-10-02</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>0.666667</td>\n",
" <td>0.333333</td>\n",
" <td>2313.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6194</th>\n",
" <td>Bondi Beach</td>\n",
" <td>151.273411</td>\n",
" <td>-33.888113</td>\n",
" <td>0</td>\n",
" <td>10</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2015-08-14</td>\n",
" <td>moderate</td>\n",
" <td>0.500000</td>\n",
" <td>0.500000</td>\n",
" <td>2362.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-edfcb2bd-1907-4231-af18-64a4fd2f85f5')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-edfcb2bd-1907-4231-af18-64a4fd2f85f5 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-edfcb2bd-1907-4231-af18-64a4fd2f85f5');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" city longitude latitude number_of_reviews minimum_nights \\\n",
"5594 Randwick 151.238806 -33.913834 0 2 \n",
"5439 Newtown 151.184469 -33.894582 0 3 \n",
"3847 Bondi Beach 151.273077 -33.895142 0 7 \n",
"1312 Randwick 151.245793 -33.920622 0 3 \n",
"6194 Bondi Beach 151.273411 -33.888113 0 10 \n",
"\n",
" security_deposit cleaning_fee accommodates bathrooms bedrooms beds \\\n",
"5594 800.0 80.0 6 1.0 3.0 3.0 \n",
"5439 5000.0 100.0 11 2.0 3.0 4.0 \n",
"3847 271.0 27.0 2 1.0 1.0 1.0 \n",
"1312 0.0 80.0 3 1.0 2.0 2.0 \n",
"6194 0.0 0.0 2 1.0 1.0 1.0 \n",
"\n",
" property_type room_type availability_365 host_identity_verified \\\n",
"5594 House Entire home/apt 0 0 \n",
"5439 Apartment Entire home/apt 0 1 \n",
"3847 Apartment Private room 0 0 \n",
"1312 Apartment Entire home/apt 0 0 \n",
"6194 Apartment Private room 0 1 \n",
"\n",
" host_is_superhost host_since cancellation_policy \\\n",
"5594 0 2013-11-27 moderate \n",
"5439 0 2014-07-16 strict_14_with_grace_period \n",
"3847 0 2015-12-07 strict_14_with_grace_period \n",
"1312 0 2015-10-02 strict_14_with_grace_period \n",
"6194 0 2015-08-14 moderate \n",
"\n",
" bedrooms_per_person bathrooms_per_person days_on_airbnb \n",
"5594 0.500000 0.166667 2987.0 \n",
"5439 0.272727 0.181818 2756.0 \n",
"3847 0.500000 0.500000 2247.0 \n",
"1312 0.666667 0.333333 2313.0 \n",
"6194 0.500000 0.500000 2362.0 "
]
},
"metadata": {},
"execution_count": 64
}
],
"source": [
"# Columns Remove\n",
"sample_incomplete_rows.drop([\"review_scores_rating\"], axis=1) # option 2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "oF5OvJd8ZXEr",
"outputId": "b3a0b257-fc93-4a14-fd43-14f007fdadcc",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 357
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-45ff1141-25a2-4a4a-8852-5015a89a4fd2\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>city</th>\n",
" <th>longitude</th>\n",
" <th>latitude</th>\n",
" <th>review_scores_rating</th>\n",
" <th>number_of_reviews</th>\n",
" <th>minimum_nights</th>\n",
" <th>security_deposit</th>\n",
" <th>cleaning_fee</th>\n",
" <th>accommodates</th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>beds</th>\n",
" <th>property_type</th>\n",
" <th>room_type</th>\n",
" <th>availability_365</th>\n",
" <th>host_identity_verified</th>\n",
" <th>host_is_superhost</th>\n",
" <th>host_since</th>\n",
" <th>cancellation_policy</th>\n",
" <th>bedrooms_per_person</th>\n",
" <th>bathrooms_per_person</th>\n",
" <th>days_on_airbnb</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5594</th>\n",
" <td>Randwick</td>\n",
" <td>151.238806</td>\n",
" <td>-33.913834</td>\n",
" <td>96.0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>800.0</td>\n",
" <td>80.0</td>\n",
" <td>6</td>\n",
" <td>1.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>House</td>\n",
" <td>Entire home/apt</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2013-11-27</td>\n",
" <td>moderate</td>\n",
" <td>0.500000</td>\n",
" <td>0.166667</td>\n",
" <td>2987.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5439</th>\n",
" <td>Newtown</td>\n",
" <td>151.184469</td>\n",
" <td>-33.894582</td>\n",
" <td>96.0</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>5000.0</td>\n",
" <td>100.0</td>\n",
" <td>11</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2014-07-16</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>0.272727</td>\n",
" <td>0.181818</td>\n",
" <td>2756.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3847</th>\n",
" <td>Bondi Beach</td>\n",
" <td>151.273077</td>\n",
" <td>-33.895142</td>\n",
" <td>96.0</td>\n",
" <td>0</td>\n",
" <td>7</td>\n",
" <td>271.0</td>\n",
" <td>27.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2015-12-07</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>0.500000</td>\n",
" <td>0.500000</td>\n",
" <td>2247.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1312</th>\n",
" <td>Randwick</td>\n",
" <td>151.245793</td>\n",
" <td>-33.920622</td>\n",
" <td>96.0</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>0.0</td>\n",
" <td>80.0</td>\n",
" <td>3</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2015-10-02</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>0.666667</td>\n",
" <td>0.333333</td>\n",
" <td>2313.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6194</th>\n",
" <td>Bondi Beach</td>\n",
" <td>151.273411</td>\n",
" <td>-33.888113</td>\n",
" <td>96.0</td>\n",
" <td>0</td>\n",
" <td>10</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2015-08-14</td>\n",
" <td>moderate</td>\n",
" <td>0.500000</td>\n",
" <td>0.500000</td>\n",
" <td>2362.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-45ff1141-25a2-4a4a-8852-5015a89a4fd2')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-45ff1141-25a2-4a4a-8852-5015a89a4fd2 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-45ff1141-25a2-4a4a-8852-5015a89a4fd2');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" city longitude latitude review_scores_rating \\\n",
"5594 Randwick 151.238806 -33.913834 96.0 \n",
"5439 Newtown 151.184469 -33.894582 96.0 \n",
"3847 Bondi Beach 151.273077 -33.895142 96.0 \n",
"1312 Randwick 151.245793 -33.920622 96.0 \n",
"6194 Bondi Beach 151.273411 -33.888113 96.0 \n",
"\n",
" number_of_reviews minimum_nights security_deposit cleaning_fee \\\n",
"5594 0 2 800.0 80.0 \n",
"5439 0 3 5000.0 100.0 \n",
"3847 0 7 271.0 27.0 \n",
"1312 0 3 0.0 80.0 \n",
"6194 0 10 0.0 0.0 \n",
"\n",
" accommodates bathrooms bedrooms beds property_type room_type \\\n",
"5594 6 1.0 3.0 3.0 House Entire home/apt \n",
"5439 11 2.0 3.0 4.0 Apartment Entire home/apt \n",
"3847 2 1.0 1.0 1.0 Apartment Private room \n",
"1312 3 1.0 2.0 2.0 Apartment Entire home/apt \n",
"6194 2 1.0 1.0 1.0 Apartment Private room \n",
"\n",
" availability_365 host_identity_verified host_is_superhost host_since \\\n",
"5594 0 0 0 2013-11-27 \n",
"5439 0 1 0 2014-07-16 \n",
"3847 0 0 0 2015-12-07 \n",
"1312 0 0 0 2015-10-02 \n",
"6194 0 1 0 2015-08-14 \n",
"\n",
" cancellation_policy bedrooms_per_person bathrooms_per_person \\\n",
"5594 moderate 0.500000 0.166667 \n",
"5439 strict_14_with_grace_period 0.272727 0.181818 \n",
"3847 strict_14_with_grace_period 0.500000 0.500000 \n",
"1312 strict_14_with_grace_period 0.666667 0.333333 \n",
"6194 moderate 0.500000 0.500000 \n",
"\n",
" days_on_airbnb \n",
"5594 2987.0 \n",
"5439 2756.0 \n",
"3847 2247.0 \n",
"1312 2313.0 \n",
"6194 2362.0 "
]
},
"metadata": {},
"execution_count": 65
}
],
"source": [
"median = X[\"review_scores_rating\"].median()\n",
"sample_incomplete_rows[\"review_scores_rating\"].fillna(median, inplace=True) # option 3\n",
"\n",
"sample_incomplete_rows"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "LRGC0OEFZXEs"
},
"outputs": [],
"source": [
"from sklearn.impute import SimpleImputer\n",
"imputer = SimpleImputer(missing_values=np.nan, strategy='median')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CLvExC7xZXEs"
},
"source": [
"Remove the text attribute because median can only be calculated on numerical attributes:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "mG3Tkpj5ZXEs"
},
"outputs": [],
"source": [
"cat_cols = [\"city\",\"cancellation_policy\",\"host_since\",\"room_type\",\"property_type\",\"host_since\"]\n",
"X_num = X.drop(cat_cols, axis=1)\n",
"# alternatively: X_num = X.select_dtypes(include=[int, float])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "EF1i_7GYZXEs",
"outputId": "3c16b362-ecb0-4869-f507-395ede376ab9",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"SimpleImputer(strategy='median')"
]
},
"metadata": {},
"execution_count": 68
}
],
"source": [
"imputer.fit(X_num)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0HhMdYSwZXEs",
"outputId": "a36bd71d-8cf9-4029-88e7-9a3ce4d61eb0",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([ 1.51259665e+02, -3.38885369e+01, 9.60000000e+01, 3.00000000e+00,\n",
" 3.00000000e+00, 0.00000000e+00, 5.00000000e+01, 2.00000000e+00,\n",
" 1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 4.00000000e+00,\n",
" 1.00000000e+00, 0.00000000e+00, 5.00000000e-01, 5.00000000e-01,\n",
" 2.62700000e+03])"
]
},
"metadata": {},
"execution_count": 69
}
],
"source": [
"imputer.statistics_"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8FHn3TgDZXEt"
},
"source": [
"Check that this is the same as manually computing the median of each attribute:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "d2IXsugRZXEt",
"outputId": "dfab33a4-1119-46d2-efff-368a30652350",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([ 1.51259665e+02, -3.38885369e+01, 9.60000000e+01, 3.00000000e+00,\n",
" 3.00000000e+00, 0.00000000e+00, 5.00000000e+01, 2.00000000e+00,\n",
" 1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 4.00000000e+00,\n",
" 1.00000000e+00, 0.00000000e+00, 5.00000000e-01, 5.00000000e-01,\n",
" 2.62700000e+03])"
]
},
"metadata": {},
"execution_count": 70
}
],
"source": [
"X_num.median().values"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "D-DxnF2LZXEt"
},
"source": [
"Transform the training set:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "mUAqpg8yZXEu"
},
"outputs": [],
"source": [
"X_num_np = imputer.transform(X_num)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "NYr-ylqvZXEu"
},
"outputs": [],
"source": [
"X_num = pd.DataFrame(X_num_np, columns=X_num.columns,\n",
" index = list(X_num.index.values))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Qjt67ohqZXEu",
"outputId": "de8050eb-0a2f-4c9a-ccfc-51466643b2ce",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 270
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-e76ba259-ce98-4f89-a734-01077cdb21e7\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>longitude</th>\n",
" <th>latitude</th>\n",
" <th>review_scores_rating</th>\n",
" <th>number_of_reviews</th>\n",
" <th>minimum_nights</th>\n",
" <th>security_deposit</th>\n",
" <th>cleaning_fee</th>\n",
" <th>accommodates</th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>beds</th>\n",
" <th>availability_365</th>\n",
" <th>host_identity_verified</th>\n",
" <th>host_is_superhost</th>\n",
" <th>bedrooms_per_person</th>\n",
" <th>bathrooms_per_person</th>\n",
" <th>days_on_airbnb</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5594</th>\n",
" <td>151.238806</td>\n",
" <td>-33.913834</td>\n",
" <td>96.0</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>800.0</td>\n",
" <td>80.0</td>\n",
" <td>6.0</td>\n",
" <td>1.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.500000</td>\n",
" <td>0.166667</td>\n",
" <td>2987.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5439</th>\n",
" <td>151.184469</td>\n",
" <td>-33.894582</td>\n",
" <td>96.0</td>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>5000.0</td>\n",
" <td>100.0</td>\n",
" <td>11.0</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.272727</td>\n",
" <td>0.181818</td>\n",
" <td>2756.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3847</th>\n",
" <td>151.273077</td>\n",
" <td>-33.895142</td>\n",
" <td>96.0</td>\n",
" <td>0.0</td>\n",
" <td>7.0</td>\n",
" <td>271.0</td>\n",
" <td>27.0</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.500000</td>\n",
" <td>0.500000</td>\n",
" <td>2247.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1312</th>\n",
" <td>151.245793</td>\n",
" <td>-33.920622</td>\n",
" <td>96.0</td>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" <td>80.0</td>\n",
" <td>3.0</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.666667</td>\n",
" <td>0.333333</td>\n",
" <td>2313.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6194</th>\n",
" <td>151.273411</td>\n",
" <td>-33.888113</td>\n",
" <td>96.0</td>\n",
" <td>0.0</td>\n",
" <td>10.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.500000</td>\n",
" <td>0.500000</td>\n",
" <td>2362.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-e76ba259-ce98-4f89-a734-01077cdb21e7')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-e76ba259-ce98-4f89-a734-01077cdb21e7 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-e76ba259-ce98-4f89-a734-01077cdb21e7');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" longitude latitude review_scores_rating number_of_reviews \\\n",
"5594 151.238806 -33.913834 96.0 0.0 \n",
"5439 151.184469 -33.894582 96.0 0.0 \n",
"3847 151.273077 -33.895142 96.0 0.0 \n",
"1312 151.245793 -33.920622 96.0 0.0 \n",
"6194 151.273411 -33.888113 96.0 0.0 \n",
"\n",
" minimum_nights security_deposit cleaning_fee accommodates bathrooms \\\n",
"5594 2.0 800.0 80.0 6.0 1.0 \n",
"5439 3.0 5000.0 100.0 11.0 2.0 \n",
"3847 7.0 271.0 27.0 2.0 1.0 \n",
"1312 3.0 0.0 80.0 3.0 1.0 \n",
"6194 10.0 0.0 0.0 2.0 1.0 \n",
"\n",
" bedrooms beds availability_365 host_identity_verified \\\n",
"5594 3.0 3.0 0.0 0.0 \n",
"5439 3.0 4.0 0.0 1.0 \n",
"3847 1.0 1.0 0.0 0.0 \n",
"1312 2.0 2.0 0.0 0.0 \n",
"6194 1.0 1.0 0.0 1.0 \n",
"\n",
" host_is_superhost bedrooms_per_person bathrooms_per_person \\\n",
"5594 0.0 0.500000 0.166667 \n",
"5439 0.0 0.272727 0.181818 \n",
"3847 0.0 0.500000 0.500000 \n",
"1312 0.0 0.666667 0.333333 \n",
"6194 0.0 0.500000 0.500000 \n",
"\n",
" days_on_airbnb \n",
"5594 2987.0 \n",
"5439 2756.0 \n",
"3847 2247.0 \n",
"1312 2313.0 \n",
"6194 2362.0 "
]
},
"metadata": {},
"execution_count": 73
}
],
"source": [
"X_num.loc[sample_incomplete_rows.index.values]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "6pHX0HJTZXEu",
"outputId": "e4224d9b-d221-4620-c121-2583f2a12dec",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
},
"text/plain": [
"'median'"
]
},
"metadata": {},
"execution_count": 74
}
],
"source": [
"imputer.strategy"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9p6IGSz9ZXEv"
},
"source": [
"Now let's preprocess the categorical input feature, `ocean_proximity`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "jt18aFVyZXEv",
"outputId": "e60470a4-a8cd-4dec-fcc6-903240be691f",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 363
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-93677505-3490-4d99-b479-31a143e85ba6\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>city</th>\n",
" <th>property_type</th>\n",
" <th>room_type</th>\n",
" <th>cancellation_policy</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5484</th>\n",
" <td>Newtown</td>\n",
" <td>House</td>\n",
" <td>Entire home/apt</td>\n",
" <td>moderate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1267</th>\n",
" <td>Randwick</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>moderate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6658</th>\n",
" <td>Manly</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2522</th>\n",
" <td>Randwick</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>flexible</td>\n",
" </tr>\n",
" <tr>\n",
" <th>722</th>\n",
" <td>Coogee</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3150</th>\n",
" <td>Manly</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2865</th>\n",
" <td>Surry Hills</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4906</th>\n",
" <td>Bondi Beach</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" <tr>\n",
" <th>575</th>\n",
" <td>Coogee</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5827</th>\n",
" <td>Newtown</td>\n",
" <td>House</td>\n",
" <td>Private room</td>\n",
" <td>flexible</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-93677505-3490-4d99-b479-31a143e85ba6')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-93677505-3490-4d99-b479-31a143e85ba6 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-93677505-3490-4d99-b479-31a143e85ba6');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" city property_type room_type cancellation_policy\n",
"5484 Newtown House Entire home/apt moderate\n",
"1267 Randwick Apartment Private room moderate\n",
"6658 Manly Apartment Entire home/apt strict_14_with_grace_period\n",
"2522 Randwick Apartment Private room flexible\n",
"722 Coogee Apartment Private room strict_14_with_grace_period\n",
"3150 Manly Apartment Entire home/apt strict_14_with_grace_period\n",
"2865 Surry Hills Apartment Entire home/apt strict_14_with_grace_period\n",
"4906 Bondi Beach Apartment Entire home/apt strict_14_with_grace_period\n",
"575 Coogee Apartment Entire home/apt strict_14_with_grace_period\n",
"5827 Newtown House Private room flexible"
]
},
"metadata": {},
"execution_count": 75
}
],
"source": [
"X_cat = X.select_dtypes(include=[object])\n",
"X_cat.head(10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "avVCgxAFZXEv"
},
"outputs": [],
"source": [
"from sklearn.preprocessing import OrdinalEncoder"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "H0KRRktvZXEv",
"outputId": "9159236b-b125-41f1-944f-b2148445a1e1",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-ee476872-36dc-462e-ad86-fbe3178db0b1\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>city</th>\n",
" <th>property_type</th>\n",
" <th>room_type</th>\n",
" <th>cancellation_policy</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5484</th>\n",
" <td>Newtown</td>\n",
" <td>House</td>\n",
" <td>Entire home/apt</td>\n",
" <td>moderate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1267</th>\n",
" <td>Randwick</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>moderate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6658</th>\n",
" <td>Manly</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2522</th>\n",
" <td>Randwick</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>flexible</td>\n",
" </tr>\n",
" <tr>\n",
" <th>722</th>\n",
" <td>Coogee</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-ee476872-36dc-462e-ad86-fbe3178db0b1')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-ee476872-36dc-462e-ad86-fbe3178db0b1 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-ee476872-36dc-462e-ad86-fbe3178db0b1');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" city property_type room_type cancellation_policy\n",
"5484 Newtown House Entire home/apt moderate\n",
"1267 Randwick Apartment Private room moderate\n",
"6658 Manly Apartment Entire home/apt strict_14_with_grace_period\n",
"2522 Randwick Apartment Private room flexible\n",
"722 Coogee Apartment Private room strict_14_with_grace_period"
]
},
"metadata": {},
"execution_count": 77
}
],
"source": [
"X_cat.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "wBwZIFPrZXEw",
"outputId": "02c74a9d-41ec-4af1-b09a-40743b03a7ca",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([[5., 6., 0., 1.],\n",
" [7., 0., 1., 1.],\n",
" [4., 0., 0., 2.],\n",
" [7., 0., 1., 0.],\n",
" [2., 0., 1., 2.],\n",
" [4., 0., 0., 2.],\n",
" [8., 0., 0., 2.],\n",
" [1., 0., 0., 2.],\n",
" [2., 0., 0., 2.],\n",
" [5., 6., 1., 0.]])"
]
},
"metadata": {},
"execution_count": 78
}
],
"source": [
"ordinal_encoder = OrdinalEncoder()\n",
"X_cat_enc = ordinal_encoder.fit_transform(X_cat)\n",
"X_cat_enc[:10]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "iz8WYbg6ZXEx",
"outputId": "c70c5ee5-cbfe-4dce-8a92-f4685b8b0221",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[array(['Bondi', 'Bondi Beach', 'Coogee', 'Darlinghurst', 'Manly',\n",
" 'Newtown', 'North Bondi', 'Randwick', 'Surry Hills', 'Sydney'],\n",
" dtype=object),\n",
" array(['Apartment', 'Bed and breakfast', 'Condominium', 'Guest suite',\n",
" 'Guesthouse', 'Hostel', 'House', 'Loft', 'Other',\n",
" 'Serviced apartment', 'Townhouse', 'Villa'], dtype=object),\n",
" array(['Entire home/apt', 'Private room', 'Shared room'], dtype=object),\n",
" array(['flexible', 'moderate', 'strict_14_with_grace_period',\n",
" 'super_strict_60'], dtype=object)]"
]
},
"metadata": {},
"execution_count": 79
}
],
"source": [
"ordinal_encoder.categories_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "VTp9-r3LZXEx",
"outputId": "b51e56b8-c245-4fee-889b-1b2c0c33a343",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<6485x29 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 25940 stored elements in Compressed Sparse Row format>"
]
},
"metadata": {},
"execution_count": 80
}
],
"source": [
"from sklearn.preprocessing import OneHotEncoder\n",
"\n",
"cat_encoder = OneHotEncoder()\n",
"X_cat_1hot = cat_encoder.fit_transform(X_cat)\n",
"X_cat_1hot"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FRN0MRzoZXEx"
},
"source": [
"By default, the `OneHotEncoder` class returns a sparse array, but we can convert it to a dense array if needed by calling the `toarray()` method:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ejwzN3QXZXEx",
"outputId": "1cb93388-b58a-4aea-adc6-cf2b91366823",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([[0., 0., 0., ..., 1., 0., 0.],\n",
" [0., 0., 0., ..., 1., 0., 0.],\n",
" [0., 0., 0., ..., 0., 1., 0.],\n",
" ...,\n",
" [1., 0., 0., ..., 1., 0., 0.],\n",
" [0., 0., 0., ..., 0., 1., 0.],\n",
" [0., 0., 1., ..., 0., 1., 0.]])"
]
},
"metadata": {},
"execution_count": 81
}
],
"source": [
"X_cat_1hot.toarray()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GQAIib7XZXEy"
},
"source": [
"Alternatively, you can set `sparse=False` when creating the `OneHotEncoder`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "UbmeOaTTZXEy",
"outputId": "602cb2ca-5b14-4ef0-eb5a-89b880351fb2",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([[0., 0., 0., ..., 1., 0., 0.],\n",
" [0., 0., 0., ..., 1., 0., 0.],\n",
" [0., 0., 0., ..., 0., 1., 0.],\n",
" ...,\n",
" [1., 0., 0., ..., 1., 0., 0.],\n",
" [0., 0., 0., ..., 0., 1., 0.],\n",
" [0., 0., 1., ..., 0., 1., 0.]])"
]
},
"metadata": {},
"execution_count": 83
}
],
"source": [
"cat_encoder = OneHotEncoder(sparse=False)\n",
"X_cat_1hot = cat_encoder.fit_transform(X_cat)\n",
"X_cat_1hot"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "BXYuh9cBZXEy",
"outputId": "a5b5aac6-a296-427f-956d-e046d62bf5bf",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[array(['Bondi', 'Bondi Beach', 'Coogee', 'Darlinghurst', 'Manly',\n",
" 'Newtown', 'North Bondi', 'Randwick', 'Surry Hills', 'Sydney'],\n",
" dtype=object),\n",
" array(['Apartment', 'Bed and breakfast', 'Condominium', 'Guest suite',\n",
" 'Guesthouse', 'Hostel', 'House', 'Loft', 'Other',\n",
" 'Serviced apartment', 'Townhouse', 'Villa'], dtype=object),\n",
" array(['Entire home/apt', 'Private room', 'Shared room'], dtype=object),\n",
" array(['flexible', 'moderate', 'strict_14_with_grace_period',\n",
" 'super_strict_60'], dtype=object)]"
]
},
"metadata": {},
"execution_count": 84
}
],
"source": [
"cat_encoder.categories_"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "R__DyD99ZXEz"
},
"source": [
"Let's create a custom transformer to add extra attributes:"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6eUaohqrZXEz"
},
"source": [
"#### **Now let's create a pipeline for preprocessing that is built on the techniques we used up and till now and introduce some new pipeline techniques.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "2lH84PvDZXEz"
},
"outputs": [],
"source": [
"from sklearn.base import BaseEstimator, TransformerMixin\n",
"from datetime import datetime\n",
"numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']\n",
"\n",
"# Receive numpy array, convert to pandas for features, convert back to array for output.\n",
"\n",
"class CombinedAttributesAdder(BaseEstimator, TransformerMixin):\n",
" def __init__(self, popularity = True, num_cols=[]): # no *args or **kargs\n",
" self.popularity = popularity\n",
" self.num_cols = num_cols\n",
" def fit(self, X, y=None):\n",
" return self # nothing else to do\n",
" def transform(self, X, y=None):\n",
"\n",
" ### Some feature engineering\n",
" X = pd.DataFrame(X, columns=self.num_cols)\n",
" X[\"bedrooms_per_person\"] = X[\"bedrooms\"]/X[\"accommodates\"]\n",
" X[\"bathrooms_per_person\"] = X[\"bathrooms\"]/X[\"accommodates\"]\n",
"\n",
" global feats\n",
" feats = [\"bedrooms_per_person\",\"bathrooms_per_person\"]\n",
"\n",
" if self.popularity:\n",
" X[\"past_and_future_popularity\"]=X[\"number_of_reviews\"]/(X[\"availability_365\"]+1)\n",
" feats.append(\"past_and_future_popularity\")\n",
"\n",
" return X.values\n",
" else:\n",
" return X.values\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "QZN0uWl0ZXEz"
},
"outputs": [],
"source": [
"from sklearn.pipeline import Pipeline\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"X = strat_train_set.copy().drop(\"price\",axis=1)\n",
"Y = strat_train_set[\"price\"]\n",
"\n",
"num_cols = list(X.select_dtypes(include=numerics).columns)\n",
"cat_cols = list(X.select_dtypes(include=[object]).columns)\n",
"\n",
"num_pipeline = Pipeline([\n",
" ('imputer', SimpleImputer(strategy='median')),\n",
" ('attribs_adder', CombinedAttributesAdder(num_cols=num_cols,popularity=True)),\n",
" ('std_scaler', StandardScaler()),\n",
" ])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "iOIWdl0nZXE0"
},
"outputs": [],
"source": [
"from sklearn.compose import ColumnTransformer\n",
"import itertools\n",
"\n",
"\n",
"numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']\n",
"\n",
"mid_pipeline = ColumnTransformer([\n",
" (\"num\", num_pipeline, num_cols),\n",
" (\"cat\", OneHotEncoder(),cat_cols ),\n",
" ])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "lGYW5ngnZXE1"
},
"outputs": [],
"source": [
"mid_pipeline.fit(X) # this one specifically has to be fitted for the cat names\n",
"cat_encoder = mid_pipeline.named_transformers_[\"cat\"]\n",
"sublists = [list(bas) for bas in cat_encoder.categories_]\n",
"one_cols = list(itertools.chain(*sublists))\n",
"\n",
"## In this class, I will be converting numpy back to pandas\n",
"\n",
"class ToPandasDF(BaseEstimator, TransformerMixin):\n",
" def __init__(self, fit_index = [] ): # no *args or **kargs\n",
" self.fit_index = fit_index\n",
" def fit(self, X_df, y=None):\n",
" return self # nothing else to do\n",
" def transform(self, X_df, y=None):\n",
" global cols\n",
" cols = num_cols.copy()\n",
" cols.extend(feats)\n",
" cols.extend(one_cols) # one in place of cat\n",
" X_df = pd.DataFrame(X_df, columns=cols,index=self.fit_index)\n",
"\n",
" return X_df\n",
"\n",
"def pipe(inds):\n",
" return Pipeline([\n",
" (\"mid\", mid_pipeline),\n",
" (\"PD\", ToPandasDF(inds)),\n",
" ])\n",
"\n",
"params = {\"inds\" : list(X.index)}\n",
"\n",
"X_pr = pipe(**params).fit_transform(X) # Now we have done all the preprocessing instead of\n",
" #.. doing it bit by bit. The pipeline becomes\n",
" #.. extremely handy in the cross-validation step."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "q1NzMdZMZXE2"
},
"source": [
"# Select and train a model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "PmMUHQUrZXE3",
"outputId": "aaa7e283-6697-4b01-8d2b-55aaed1bbe30",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"LinearRegression()"
]
},
"metadata": {},
"execution_count": 90
}
],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"Y_pr = Y.copy() # just for naming convention, _pr for processed.\n",
"\n",
"lin_reg = LinearRegression()\n",
"lin_reg.fit(X_pr, Y_pr)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Z3Vy7xZrZXE4",
"outputId": "fcd3b708-ba58-4d14-9265-a27dcc0f2187",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Predictions: [213.73373314 41.43930499 159.42015301 51.73470971 57.8647697 ]\n"
]
}
],
"source": [
"# let's try the full preprocessing pipeline on a few training instances\n",
"some_data = X.iloc[:5]\n",
"some_labels = Y.iloc[:5]\n",
"some_data_prepared = pipe(inds=list(some_data.index)).transform(some_data)\n",
"\n",
"print(\"Predictions:\", lin_reg.predict(some_data_prepared))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kouA5WArZXE4"
},
"source": [
"Compare against the actual values:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "OSu92IcSZXE4",
"outputId": "cd410e9a-02ae-48d1-94e4-7e49dcf8dccf",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Labels: [200.0, 183.0, 175.0, 85.0, 80.0]\n"
]
}
],
"source": [
"print(\"Labels:\", list(some_labels))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "3T4flEmtZXE5"
},
"outputs": [],
"source": [
"## Naturally, these metrics are not that fair, because it is insample.\n",
"## However the first model is linear so overfitting is less likley.\n",
"## We will look at some out of sample validation later on."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cTdwKQ6QZXE5",
"outputId": "c5824d5e-5305-4175-bee2-69f0e1c2f18f",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"105.96524352798795"
]
},
"metadata": {},
"execution_count": 94
}
],
"source": [
"from sklearn.metrics import mean_squared_error, mean_absolute_error\n",
"\n",
"X_pred = lin_reg.predict(X_pr)\n",
"lin_mse = mean_squared_error(Y, X_pred)\n",
"lin_rmse = np.sqrt(lin_mse)\n",
"lin_rmse"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cETn4gAkZXE5",
"outputId": "4f420c6e-0f94-483a-a1e4-1801ee6e81eb",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"67.26978406629445"
]
},
"metadata": {},
"execution_count": 95
}
],
"source": [
"from sklearn.metrics import mean_absolute_error\n",
"\n",
"lin_mae = mean_absolute_error(Y, X_pred)\n",
"lin_mae"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ZRK9Iu9oZXE6",
"outputId": "ba5e7dda-8853-4a96-a67a-fd5d5f2751ae",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"DecisionTreeRegressor(random_state=42)"
]
},
"metadata": {},
"execution_count": 96
}
],
"source": [
"from sklearn.tree import DecisionTreeRegressor\n",
"\n",
"tree_reg = DecisionTreeRegressor(random_state=42)\n",
"tree_reg.fit(X_pr, Y)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "vNI9NayEZXE6",
"outputId": "333c2913-c465-4e38-da0d-130d12a8c5c0",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0.0"
]
},
"metadata": {},
"execution_count": 97
}
],
"source": [
"X_pred = tree_reg.predict(X_pr)\n",
"tree_mse = mean_squared_error(Y, X_pred)\n",
"tree_rmse = np.sqrt(tree_mse)\n",
"tree_rmse ## Model is complex and overfits completely."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "A4PNKzEzZXE6"
},
"source": [
"# Fine-tune your model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Bsd5tpbMZXE7"
},
"outputs": [],
"source": [
"from sklearn.model_selection import cross_val_score\n",
"\n",
"scores = cross_val_score(DecisionTreeRegressor(random_state=42), X_pr, Y,\n",
" scoring=\"neg_mean_squared_error\", cv=10)\n",
"tree_rmse_scores = np.sqrt(-scores)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "QmwGenEsZXE7",
"outputId": "dfb7cbf9-bf73-4321-d040-0ebe82561ece",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Scores: [156.28105238 124.28821071 134.0732385 128.15088853 139.53568724\n",
" 132.41886055 131.85578893 140.82139989 145.64308755 147.38275838]\n",
"Mean: 138.04509726649928\n",
"Standard deviation: 9.275161339670507\n"
]
}
],
"source": [
"def display_scores(scores):\n",
" print(\"Scores:\", scores)\n",
" print(\"Mean:\", scores.mean())\n",
" print(\"Standard deviation:\", scores.std())\n",
"\n",
"display_scores(tree_rmse_scores)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "dZdAf0FVZXE7",
"outputId": "aea19a7a-0295-4069-ce80-a938d57ccf92",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Scores: [8.49207574 8.14863276 8.16271694 8.23897312 8.05176911 8.1944819\n",
" 8.28255886 8.13214541 8.09775686 8.57665211]\n",
"Mean: 8.237776280571119\n",
"Standard deviation: 0.16196568113321622\n"
]
}
],
"source": [
"lin_scores = cross_val_score(LinearRegression(), X_pr, Y,\n",
" scoring=\"neg_mean_absolute_error\", cv=10)\n",
"lin_rmse_scores = np.sqrt(-lin_scores)\n",
"display_scores(lin_rmse_scores)\n",
"## bad performance, might need some regularisation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bQ87leZjZXE7",
"outputId": "a40d7c88-2c46-4ebb-d654-c93831fe0fb5",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"RandomForestRegressor(random_state=42)"
]
},
"metadata": {},
"execution_count": 101
}
],
"source": [
"from sklearn.ensemble import RandomForestRegressor\n",
"\n",
"forest_reg = RandomForestRegressor(random_state=42)\n",
"forest_reg.fit(X_pr, Y)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "i3eHPk_FZXE8",
"outputId": "327da65d-8c97-4116-c19a-56c24cb658f4",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"37.13182906316284"
]
},
"metadata": {},
"execution_count": 103
}
],
"source": [
"X_pred = forest_reg.predict(X_pr)\n",
"forest_mse = mean_squared_error(Y, X_pred)\n",
"forest_rmse = np.sqrt(forest_mse)\n",
"forest_rmse"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "a7WxR5raZXE8",
"outputId": "55e5cb0f-f2cf-45a7-d604-b7b43b968714",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Scores: [100.1462783 99.02900267 90.66372495 95.33295731 94.68716428\n",
" 95.99410459 98.76173054 97.77569275 98.33173491 118.6137359 ]\n",
"Mean: 98.93361262081221\n",
"Standard deviation: 7.060795137979904\n"
]
}
],
"source": [
"#might take 40 seconds\n",
"\n",
"from sklearn.model_selection import cross_val_score\n",
"\n",
"forest_scores = cross_val_score(forest_reg, X_pr, Y,\n",
" scoring=\"neg_mean_squared_error\", cv=10)\n",
"forest_rmse_scores = np.sqrt(-forest_scores)\n",
"display_scores(forest_rmse_scores)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "q-ECZNJ2ZXE8",
"outputId": "9635be4c-0750-42e4-9fda-024deb889f4d",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"count 10.000000\n",
"mean 106.860441\n",
"std 7.980777\n",
"min 101.432398\n",
"25% 102.532852\n",
"50% 103.917220\n",
"75% 106.914584\n",
"max 127.604757\n",
"dtype: float64"
]
},
"metadata": {},
"execution_count": 105
}
],
"source": [
"scores = cross_val_score(lin_reg, X_pr, Y, scoring=\"neg_mean_squared_error\", cv=10)\n",
"pd.Series(np.sqrt(-scores)).describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "XSzo6ZnXZXE9",
"outputId": "e9498960-9511-4802-bae3-00093226fcbe",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"115.17213979847577"
]
},
"metadata": {},
"execution_count": 106
}
],
"source": [
"from sklearn.svm import SVR\n",
"\n",
"svm_reg = SVR(kernel=\"linear\")\n",
"svm_reg.fit( X_pr, Y,)\n",
"X_pred = svm_reg.predict(X_pr)\n",
"svm_mse = mean_squared_error(Y, X_pred)\n",
"svm_rmse = np.sqrt(svm_mse)\n",
"svm_rmse"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "GE_D08IFZXE9",
"outputId": "ca275551-9b10-420f-cba4-269cb06b3ec8",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),\n",
" param_grid=[{'max_features': [2, 4, 6, 8],\n",
" 'n_estimators': [3, 10, 30]},\n",
" {'bootstrap': [False], 'max_features': [2, 3, 4],\n",
" 'n_estimators': [3, 10]}],\n",
" return_train_score=True, scoring='neg_mean_squared_error')"
]
},
"metadata": {},
"execution_count": 107
}
],
"source": [
"## 50 Seconds to run this code block.\n",
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"param_grid = [\n",
" # try 12 (3×4) combinations of hyperparameters\n",
" {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},\n",
" # then try 6 (2×3) combinations with bootstrap set as False\n",
" {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},\n",
" ]\n",
"\n",
"forest_reg = RandomForestRegressor(random_state=42)\n",
"# train across 5 folds, that's a total of (12+6)*5=90 rounds of training\n",
"grid_search = GridSearchCV(forest_reg, param_grid, cv=5,\n",
" scoring='neg_mean_squared_error', return_train_score=True)\n",
"grid_search.fit( X_pr, Y)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2Kkj9tt5ZXE-"
},
"source": [
"The best hyperparameter combination found:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "FOO2uIjyZXE-",
"outputId": "fa8b27bc-68ea-496c-9616-bcbb9a2e2dc6",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"{'max_features': 8, 'n_estimators': 30}"
]
},
"metadata": {},
"execution_count": 108
}
],
"source": [
"grid_search.best_params_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ETCJEzqeZXE_",
"outputId": "a375ea67-fd86-42d4-eb3b-d50459299b4a",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"RandomForestRegressor(max_features=8, n_estimators=30, random_state=42)"
]
},
"metadata": {},
"execution_count": 109
}
],
"source": [
"grid_search.best_estimator_"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OxG9iaZ5ZXE_"
},
"source": [
"Let's look at the score of each hyperparameter combination tested during the grid search:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ybvom2NLZXE_",
"outputId": "d9c48476-5c0e-4e3e-91d3-45fb9a5f6014",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"122.39284414100575 {'max_features': 2, 'n_estimators': 3}\n",
"107.52558724072307 {'max_features': 2, 'n_estimators': 10}\n",
"103.24120436817456 {'max_features': 2, 'n_estimators': 30}\n",
"117.01056323911455 {'max_features': 4, 'n_estimators': 3}\n",
"104.52887121598017 {'max_features': 4, 'n_estimators': 10}\n",
"100.54607113831867 {'max_features': 4, 'n_estimators': 30}\n",
"114.82721863582032 {'max_features': 6, 'n_estimators': 3}\n",
"102.40550231127109 {'max_features': 6, 'n_estimators': 10}\n",
"99.29218505733148 {'max_features': 6, 'n_estimators': 30}\n",
"113.39151446263007 {'max_features': 8, 'n_estimators': 3}\n",
"102.32887396868892 {'max_features': 8, 'n_estimators': 10}\n",
"99.00755576481376 {'max_features': 8, 'n_estimators': 30}\n",
"117.6929070419739 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}\n",
"103.26241827713363 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}\n",
"119.41447085064425 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}\n",
"105.63627556700698 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}\n",
"115.8025113703555 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}\n",
"103.66218346831528 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}\n",
"\n",
"Best grid-search performance: 99.00755576481376\n"
]
}
],
"source": [
"cvres = grid_search.cv_results_\n",
"for mean_score, params in zip(cvres[\"mean_test_score\"], cvres[\"params\"]):\n",
" print(np.sqrt(-mean_score), params)\n",
"\n",
"print(\"\")\n",
"print(\"Best grid-search performance: \", np.sqrt(-cvres[\"mean_test_score\"].max()))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "SpbhLoVfZXFA",
"outputId": "84133b91-ba69-41bd-b51f-9f7a4ece78e0",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 530
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-fc8bc834-f406-4caa-9b99-decf5f43d5af\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>mean_fit_time</th>\n",
" <th>std_fit_time</th>\n",
" <th>mean_score_time</th>\n",
" <th>std_score_time</th>\n",
" <th>param_max_features</th>\n",
" <th>param_n_estimators</th>\n",
" <th>param_bootstrap</th>\n",
" <th>params</th>\n",
" <th>split0_test_score</th>\n",
" <th>split1_test_score</th>\n",
" <th>split2_test_score</th>\n",
" <th>split3_test_score</th>\n",
" <th>split4_test_score</th>\n",
" <th>mean_test_score</th>\n",
" <th>std_test_score</th>\n",
" <th>rank_test_score</th>\n",
" <th>split0_train_score</th>\n",
" <th>split1_train_score</th>\n",
" <th>split2_train_score</th>\n",
" <th>split3_train_score</th>\n",
" <th>split4_train_score</th>\n",
" <th>mean_train_score</th>\n",
" <th>std_train_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.026091</td>\n",
" <td>0.002264</td>\n",
" <td>0.005229</td>\n",
" <td>0.000997</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>NaN</td>\n",
" <td>{'max_features': 2, 'n_estimators': 3}</td>\n",
" <td>-16555.022466</td>\n",
" <td>-13597.055770</td>\n",
" <td>-13638.039236</td>\n",
" <td>-14055.813587</td>\n",
" <td>-17054.110426</td>\n",
" <td>-14980.008297</td>\n",
" <td>1506.661461</td>\n",
" <td>18</td>\n",
" <td>-3758.787383</td>\n",
" <td>-4104.555748</td>\n",
" <td>-4184.716932</td>\n",
" <td>-4050.117879</td>\n",
" <td>-4046.602223</td>\n",
" <td>-4028.956033</td>\n",
" <td>144.032695</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.068738</td>\n",
" <td>0.001743</td>\n",
" <td>0.008437</td>\n",
" <td>0.000537</td>\n",
" <td>2</td>\n",
" <td>10</td>\n",
" <td>NaN</td>\n",
" <td>{'max_features': 2, 'n_estimators': 10}</td>\n",
" <td>-12260.708984</td>\n",
" <td>-9805.314089</td>\n",
" <td>-10820.006739</td>\n",
" <td>-10569.772097</td>\n",
" <td>-14352.957648</td>\n",
" <td>-11561.751911</td>\n",
" <td>1606.154048</td>\n",
" <td>11</td>\n",
" <td>-2159.404627</td>\n",
" <td>-2301.210635</td>\n",
" <td>-2226.197450</td>\n",
" <td>-2135.518113</td>\n",
" <td>-2068.878288</td>\n",
" <td>-2178.241823</td>\n",
" <td>79.450122</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.282132</td>\n",
" <td>0.067924</td>\n",
" <td>0.025084</td>\n",
" <td>0.005109</td>\n",
" <td>2</td>\n",
" <td>30</td>\n",
" <td>NaN</td>\n",
" <td>{'max_features': 2, 'n_estimators': 30}</td>\n",
" <td>-10838.332181</td>\n",
" <td>-9061.133565</td>\n",
" <td>-9598.273290</td>\n",
" <td>-9842.449315</td>\n",
" <td>-13953.543045</td>\n",
" <td>-10658.746279</td>\n",
" <td>1745.350790</td>\n",
" <td>6</td>\n",
" <td>-1742.879858</td>\n",
" <td>-1760.021413</td>\n",
" <td>-1738.622983</td>\n",
" <td>-1670.326953</td>\n",
" <td>-1554.314992</td>\n",
" <td>-1693.233240</td>\n",
" <td>75.906064</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.046774</td>\n",
" <td>0.004537</td>\n",
" <td>0.007873</td>\n",
" <td>0.000975</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>NaN</td>\n",
" <td>{'max_features': 4, 'n_estimators': 3}</td>\n",
" <td>-14263.993918</td>\n",
" <td>-12641.935663</td>\n",
" <td>-13288.890088</td>\n",
" <td>-13525.348068</td>\n",
" <td>-14737.191810</td>\n",
" <td>-13691.471910</td>\n",
" <td>736.546963</td>\n",
" <td>15</td>\n",
" <td>-3554.445408</td>\n",
" <td>-3943.233637</td>\n",
" <td>-3761.036002</td>\n",
" <td>-3538.098711</td>\n",
" <td>-3404.721173</td>\n",
" <td>-3640.306986</td>\n",
" <td>189.557099</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.137139</td>\n",
" <td>0.004381</td>\n",
" <td>0.011503</td>\n",
" <td>0.000295</td>\n",
" <td>4</td>\n",
" <td>10</td>\n",
" <td>NaN</td>\n",
" <td>{'max_features': 4, 'n_estimators': 10}</td>\n",
" <td>-11013.902089</td>\n",
" <td>-9836.500710</td>\n",
" <td>-10254.743855</td>\n",
" <td>-10532.813608</td>\n",
" <td>-12993.464325</td>\n",
" <td>-10926.284918</td>\n",
" <td>1102.209069</td>\n",
" <td>9</td>\n",
" <td>-2085.552922</td>\n",
" <td>-2147.860768</td>\n",
" <td>-2174.949850</td>\n",
" <td>-1951.323770</td>\n",
" <td>-1870.559938</td>\n",
" <td>-2046.049450</td>\n",
" <td>116.885333</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-fc8bc834-f406-4caa-9b99-decf5f43d5af')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-fc8bc834-f406-4caa-9b99-decf5f43d5af button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-fc8bc834-f406-4caa-9b99-decf5f43d5af');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" mean_fit_time std_fit_time mean_score_time std_score_time \\\n",
"0 0.026091 0.002264 0.005229 0.000997 \n",
"1 0.068738 0.001743 0.008437 0.000537 \n",
"2 0.282132 0.067924 0.025084 0.005109 \n",
"3 0.046774 0.004537 0.007873 0.000975 \n",
"4 0.137139 0.004381 0.011503 0.000295 \n",
"\n",
" param_max_features param_n_estimators param_bootstrap \\\n",
"0 2 3 NaN \n",
"1 2 10 NaN \n",
"2 2 30 NaN \n",
"3 4 3 NaN \n",
"4 4 10 NaN \n",
"\n",
" params split0_test_score \\\n",
"0 {'max_features': 2, 'n_estimators': 3} -16555.022466 \n",
"1 {'max_features': 2, 'n_estimators': 10} -12260.708984 \n",
"2 {'max_features': 2, 'n_estimators': 30} -10838.332181 \n",
"3 {'max_features': 4, 'n_estimators': 3} -14263.993918 \n",
"4 {'max_features': 4, 'n_estimators': 10} -11013.902089 \n",
"\n",
" split1_test_score split2_test_score split3_test_score split4_test_score \\\n",
"0 -13597.055770 -13638.039236 -14055.813587 -17054.110426 \n",
"1 -9805.314089 -10820.006739 -10569.772097 -14352.957648 \n",
"2 -9061.133565 -9598.273290 -9842.449315 -13953.543045 \n",
"3 -12641.935663 -13288.890088 -13525.348068 -14737.191810 \n",
"4 -9836.500710 -10254.743855 -10532.813608 -12993.464325 \n",
"\n",
" mean_test_score std_test_score rank_test_score split0_train_score \\\n",
"0 -14980.008297 1506.661461 18 -3758.787383 \n",
"1 -11561.751911 1606.154048 11 -2159.404627 \n",
"2 -10658.746279 1745.350790 6 -1742.879858 \n",
"3 -13691.471910 736.546963 15 -3554.445408 \n",
"4 -10926.284918 1102.209069 9 -2085.552922 \n",
"\n",
" split1_train_score split2_train_score split3_train_score \\\n",
"0 -4104.555748 -4184.716932 -4050.117879 \n",
"1 -2301.210635 -2226.197450 -2135.518113 \n",
"2 -1760.021413 -1738.622983 -1670.326953 \n",
"3 -3943.233637 -3761.036002 -3538.098711 \n",
"4 -2147.860768 -2174.949850 -1951.323770 \n",
"\n",
" split4_train_score mean_train_score std_train_score \n",
"0 -4046.602223 -4028.956033 144.032695 \n",
"1 -2068.878288 -2178.241823 79.450122 \n",
"2 -1554.314992 -1693.233240 75.906064 \n",
"3 -3404.721173 -3640.306986 189.557099 \n",
"4 -1870.559938 -2046.049450 116.885333 "
]
},
"metadata": {},
"execution_count": 111
}
],
"source": [
"# Top five results as presented in a dataframe\n",
"pd.DataFrame(grid_search.cv_results_).head(5)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "UVW3yJYXZXFB",
"outputId": "064d0157-04e7-4816-e35d-c2fccdad0158",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),\n",
" n_iter=5,\n",
" param_distributions={'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f0bcda4c190>,\n",
" 'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f0bcda4c790>},\n",
" random_state=42, scoring='neg_mean_squared_error')"
]
},
"metadata": {},
"execution_count": 112
}
],
"source": [
"from sklearn.model_selection import RandomizedSearchCV\n",
"from scipy.stats import randint\n",
"\n",
"param_distribs = {\n",
" 'n_estimators': randint(low=1, high=200),\n",
" 'max_features': randint(low=1, high=8),\n",
" }\n",
"\n",
"forest_reg = RandomForestRegressor(random_state=42)\n",
"rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,\n",
" n_iter=5, cv=5, scoring='neg_mean_squared_error', random_state=42)\n",
"rnd_search.fit( X_pr, Y)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0Elj0p-3ZXFB",
"outputId": "6eac1ab4-470c-457f-ba65-dd0bd647c05d",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"98.00290801665426 {'max_features': 7, 'n_estimators': 180}\n",
"101.90428852795614 {'max_features': 5, 'n_estimators': 15}\n",
"101.22457105793195 {'max_features': 3, 'n_estimators': 72}\n",
"100.66971697857105 {'max_features': 5, 'n_estimators': 21}\n",
"98.2244904709187 {'max_features': 7, 'n_estimators': 122}\n",
"Best grid-search performance: 98.00290801665426\n"
]
}
],
"source": [
"cvres = rnd_search.cv_results_\n",
"for mean_score, params in zip(cvres[\"mean_test_score\"], cvres[\"params\"]):\n",
" print(np.sqrt(-mean_score), params)\n",
"\n",
"print(\"Best grid-search performance: \", np.sqrt(-cvres[\"mean_test_score\"].max()))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "hf382EgvZXFB",
"outputId": "026c55b8-7419-4b2f-cdd4-5150f9a0c5ce",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([5.21516849e-02, 4.26770168e-02, 1.83638010e-02, 2.43596502e-02,\n",
" 2.55534252e-02, 5.27505928e-02, 4.16792757e-02, 1.30305162e-01,\n",
" 9.97171303e-02, 9.72528704e-02, 1.21297875e-01, 3.02257465e-02,\n",
" 6.78430943e-03, 2.03658936e-03, 2.09530971e-02, 4.03208286e-02,\n",
" 2.78096690e-02, 2.73549738e-03, 5.15415355e-03, 1.78486022e-03,\n",
" 2.70978771e-03, 2.97859493e-03, 1.40150013e-03, 5.68502898e-03,\n",
" 2.33712174e-03, 1.58189052e-03, 2.13285259e-03, 1.25708518e-02,\n",
" 1.55850071e-04, 6.47611317e-04, 5.90204089e-05, 7.76555510e-04,\n",
" 5.55735018e-04, 2.24163824e-02, 1.23442982e-04, 5.09695472e-04,\n",
" 1.67304588e-04, 2.14379870e-03, 4.11953985e-04, 4.49042422e-02,\n",
" 3.31642784e-02, 2.71052020e-03, 4.70309008e-03, 4.44714650e-03,\n",
" 6.60464381e-03, 1.87864242e-04])"
]
},
"metadata": {},
"execution_count": 114
}
],
"source": [
"feature_importances = grid_search.best_estimator_.feature_importances_\n",
"feature_importances"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "jND7SmfKZXFC"
},
"outputs": [],
"source": [
"feats = pd.DataFrame()\n",
"feats[\"Name\"] = list(X_pr.columns)\n",
"feats[\"Score\"] = feature_importances"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "LXbIqFlvZXFC",
"outputId": "f7d35e0f-55f0-4da5-c7c1-80cc65335fbb",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 676
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-1ae2b63e-67ef-49db-b9cb-c0e9e7974afc\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>accommodates</td>\n",
" <td>0.13031</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>beds</td>\n",
" <td>0.12130</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>bathrooms</td>\n",
" <td>0.09972</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>bedrooms</td>\n",
" <td>0.09725</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>security_deposit</td>\n",
" <td>0.05275</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>longitude</td>\n",
" <td>0.05215</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39</th>\n",
" <td>Entire home/apt</td>\n",
" <td>0.04490</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>latitude</td>\n",
" <td>0.04268</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>cleaning_fee</td>\n",
" <td>0.04168</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>bathrooms_per_person</td>\n",
" <td>0.04032</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40</th>\n",
" <td>Private room</td>\n",
" <td>0.03316</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>availability_365</td>\n",
" <td>0.03023</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>past_and_future_popularity</td>\n",
" <td>0.02781</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>minimum_nights</td>\n",
" <td>0.02555</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>number_of_reviews</td>\n",
" <td>0.02436</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>House</td>\n",
" <td>0.02242</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>bedrooms_per_person</td>\n",
" <td>0.02095</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>review_scores_rating</td>\n",
" <td>0.01836</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>Apartment</td>\n",
" <td>0.01257</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>host_identity_verified</td>\n",
" <td>0.00678</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-1ae2b63e-67ef-49db-b9cb-c0e9e7974afc')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-1ae2b63e-67ef-49db-b9cb-c0e9e7974afc button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-1ae2b63e-67ef-49db-b9cb-c0e9e7974afc');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" Name Score\n",
"7 accommodates 0.13031\n",
"10 beds 0.12130\n",
"8 bathrooms 0.09972\n",
"9 bedrooms 0.09725\n",
"5 security_deposit 0.05275\n",
"0 longitude 0.05215\n",
"39 Entire home/apt 0.04490\n",
"1 latitude 0.04268\n",
"6 cleaning_fee 0.04168\n",
"15 bathrooms_per_person 0.04032\n",
"40 Private room 0.03316\n",
"11 availability_365 0.03023\n",
"16 past_and_future_popularity 0.02781\n",
"4 minimum_nights 0.02555\n",
"3 number_of_reviews 0.02436\n",
"33 House 0.02242\n",
"14 bedrooms_per_person 0.02095\n",
"2 review_scores_rating 0.01836\n",
"27 Apartment 0.01257\n",
"12 host_identity_verified 0.00678"
]
},
"metadata": {},
"execution_count": 116
}
],
"source": [
"feats.sort_values(\"Score\",ascending=False).round(5).head(20)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "WMCMhuiSZXFC",
"outputId": "f415c157-0408-4572-feb6-b382cb989743",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 357
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-f212cb2d-461d-48bd-988c-f0d024eef577\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>price</th>\n",
" <th>city</th>\n",
" <th>longitude</th>\n",
" <th>latitude</th>\n",
" <th>review_scores_rating</th>\n",
" <th>number_of_reviews</th>\n",
" <th>minimum_nights</th>\n",
" <th>security_deposit</th>\n",
" <th>cleaning_fee</th>\n",
" <th>accommodates</th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>beds</th>\n",
" <th>property_type</th>\n",
" <th>room_type</th>\n",
" <th>availability_365</th>\n",
" <th>host_identity_verified</th>\n",
" <th>host_is_superhost</th>\n",
" <th>host_since</th>\n",
" <th>cancellation_policy</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1098</th>\n",
" <td>110.0</td>\n",
" <td>Darlinghurst</td>\n",
" <td>151.219597</td>\n",
" <td>-33.876213</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>250.0</td>\n",
" <td>50.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2014-07-16</td>\n",
" <td>moderate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4811</th>\n",
" <td>225.0</td>\n",
" <td>Manly</td>\n",
" <td>151.288501</td>\n",
" <td>-33.806436</td>\n",
" <td>100.0</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>750.0</td>\n",
" <td>219.0</td>\n",
" <td>4</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2012-09-18</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3678</th>\n",
" <td>91.0</td>\n",
" <td>Bondi Beach</td>\n",
" <td>151.277634</td>\n",
" <td>-33.888887</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>200.0</td>\n",
" <td>25.0</td>\n",
" <td>1</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2013-10-14</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2982</th>\n",
" <td>45.0</td>\n",
" <td>Bondi Beach</td>\n",
" <td>151.281182</td>\n",
" <td>-33.889755</td>\n",
" <td>84.0</td>\n",
" <td>25</td>\n",
" <td>7</td>\n",
" <td>400.0</td>\n",
" <td>50.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Other</td>\n",
" <td>Private room</td>\n",
" <td>361</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2015-05-07</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5140</th>\n",
" <td>115.0</td>\n",
" <td>Manly</td>\n",
" <td>151.282273</td>\n",
" <td>-33.793341</td>\n",
" <td>100.0</td>\n",
" <td>14</td>\n",
" <td>1</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>House</td>\n",
" <td>Private room</td>\n",
" <td>19</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2016-10-22</td>\n",
" <td>moderate</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-f212cb2d-461d-48bd-988c-f0d024eef577')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-f212cb2d-461d-48bd-988c-f0d024eef577 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-f212cb2d-461d-48bd-988c-f0d024eef577');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" price city longitude latitude review_scores_rating \\\n",
"1098 110.0 Darlinghurst 151.219597 -33.876213 NaN \n",
"4811 225.0 Manly 151.288501 -33.806436 100.0 \n",
"3678 91.0 Bondi Beach 151.277634 -33.888887 NaN \n",
"2982 45.0 Bondi Beach 151.281182 -33.889755 84.0 \n",
"5140 115.0 Manly 151.282273 -33.793341 100.0 \n",
"\n",
" number_of_reviews minimum_nights security_deposit cleaning_fee \\\n",
"1098 0 2 250.0 50.0 \n",
"4811 1 3 750.0 219.0 \n",
"3678 0 3 200.0 25.0 \n",
"2982 25 7 400.0 50.0 \n",
"5140 14 1 0.0 0.0 \n",
"\n",
" accommodates bathrooms bedrooms beds property_type room_type \\\n",
"1098 2 1.0 1.0 1.0 Apartment Entire home/apt \n",
"4811 4 1.0 2.0 2.0 Apartment Entire home/apt \n",
"3678 1 1.0 1.0 1.0 Apartment Private room \n",
"2982 2 1.0 1.0 1.0 Other Private room \n",
"5140 2 1.0 1.0 1.0 House Private room \n",
"\n",
" availability_365 host_identity_verified host_is_superhost host_since \\\n",
"1098 0 0 0 2014-07-16 \n",
"4811 0 0 0 2012-09-18 \n",
"3678 0 1 0 2013-10-14 \n",
"2982 361 1 0 2015-05-07 \n",
"5140 19 1 0 2016-10-22 \n",
"\n",
" cancellation_policy \n",
"1098 moderate \n",
"4811 strict_14_with_grace_period \n",
"3678 strict_14_with_grace_period \n",
"2982 strict_14_with_grace_period \n",
"5140 moderate "
]
},
"metadata": {},
"execution_count": 117
}
],
"source": [
"strat_test_set.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "RnZVVxx6ZXFC"
},
"outputs": [],
"source": [
"### Now we can test the out of sample performance.\n",
"\n",
"final_model = grid_search.best_estimator_\n",
"\n",
"X_test = strat_test_set.drop(\"price\", axis=1)\n",
"y_test = strat_test_set[\"price\"].copy()\n",
"\n",
"X_test_prepared = pipe(list(X_test.index)).transform(X_test)\n",
"final_predictions = final_model.predict(X_test_prepared)\n",
"\n",
"final_mse = mean_squared_error(y_test, final_predictions)\n",
"final_rmse = np.sqrt(final_mse)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "D-wE7tl0ZXFD",
"outputId": "fbad4694-777e-49d8-b761-5b3259233a1c",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"96.41999111314205"
]
},
"metadata": {},
"execution_count": 119
}
],
"source": [
"final_rmse"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "oBs0b4fpZXFD"
},
"outputs": [],
"source": [
"final_mae = mean_absolute_error(y_test, final_predictions)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Yx0NPbruZXFD",
"outputId": "d996d802-41cf-4f6c-9647-0b9581d6eef0",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"58.29936292642828"
]
},
"metadata": {},
"execution_count": 121
}
],
"source": [
"final_mae ## not too bad"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0imyjZlhZXFE"
},
"outputs": [],
"source": [
"## Value Estimation for Client"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tvUezT8LZXFE"
},
"outputs": [],
"source": [
"df_client = pd.DataFrame.from_dict(dict_client, orient='index').T"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "s2SRq3CbZXFE",
"outputId": "9dca0441-b013-40d5-b6cb-332dc478d707",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 162
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-df048ef0-bf1d-4fa9-805b-ba89e6550ce2\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>city</th>\n",
" <th>longitude</th>\n",
" <th>latitude</th>\n",
" <th>review_scores_rating</th>\n",
" <th>number_of_reviews</th>\n",
" <th>minimum_nights</th>\n",
" <th>accommodates</th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>beds</th>\n",
" <th>security_deposit</th>\n",
" <th>cleaning_fee</th>\n",
" <th>property_type</th>\n",
" <th>room_type</th>\n",
" <th>availability_365</th>\n",
" <th>host_identity_verified</th>\n",
" <th>host_is_superhost</th>\n",
" <th>cancellation_policy</th>\n",
" <th>host_since</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Bondi Beach</td>\n",
" <td>151.275</td>\n",
" <td>-33.8891</td>\n",
" <td>95</td>\n",
" <td>53</td>\n",
" <td>4</td>\n",
" <td>10</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" <td>7</td>\n",
" <td>1500</td>\n",
" <td>370</td>\n",
" <td>House</td>\n",
" <td>Entire home/apt</td>\n",
" <td>255</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>2010-01-08</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-df048ef0-bf1d-4fa9-805b-ba89e6550ce2')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-df048ef0-bf1d-4fa9-805b-ba89e6550ce2 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-df048ef0-bf1d-4fa9-805b-ba89e6550ce2');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" city longitude latitude review_scores_rating number_of_reviews \\\n",
"0 Bondi Beach 151.275 -33.8891 95 53 \n",
"\n",
" minimum_nights accommodates bathrooms bedrooms beds security_deposit \\\n",
"0 4 10 3 5 7 1500 \n",
"\n",
" cleaning_fee property_type room_type availability_365 \\\n",
"0 370 House Entire home/apt 255 \n",
"\n",
" host_identity_verified host_is_superhost cancellation_policy \\\n",
"0 1 1 strict_14_with_grace_period \n",
"\n",
" host_since \n",
"0 2010-01-08 "
]
},
"metadata": {},
"execution_count": 123
}
],
"source": [
"df_client"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "LJlvhehKZXFE"
},
"outputs": [],
"source": [
"df_client = pipe(list(df_client.index)).transform(df_client)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "M57QpQ6QZXFF"
},
"outputs": [],
"source": [
"client_pred = final_model.predict(df_client)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "EqfZb-JHZXFF",
"outputId": "9478826f-76c7-4139-d943-a5fdc6daa392",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"\u001b[1;31m648.5666666666667\u001b[0m\n",
"\u001b[1;31m-500\u001b[0m\n",
"\u001b[1;31m= 148.56666666666672\u001b[0m\n"
]
}
],
"source": [
"### Client should be charging about $150 more.\n",
"print('\\x1b[1;31m'+str(client_pred[0])+'\\x1b[0m')\n",
"print('\\x1b[1;31m'+str(-500)+'\\x1b[0m')\n",
"print('\\x1b[1;31m'+\"= \"+str(client_pred[0]-500)+'\\x1b[0m')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4Pk0__NRZXFF"
},
"source": [
"#### We can compute a crude 95% confidence interval for the test RMSE:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "vY9kd2GXZXFF"
},
"outputs": [],
"source": [
"from scipy import stats"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bSQ2bTtyZXFG",
"outputId": "cec56fd2-f99b-4085-edf8-92e8e36bd3a3",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"22.0"
]
},
"metadata": {},
"execution_count": 128
}
],
"source": [
"y_test.min()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "4HmHsc9ZZXFG",
"outputId": "902f5cea-d795-4964-8679-b1be0ff3588a",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"MSE Interval: [ 86.99451312 105.0027812 ]\n"
]
}
],
"source": [
"## This calculates the RMSE confidence interval\n",
"\n",
"confidence = 0.95\n",
"squared_errors = (final_predictions - y_test) ** 2\n",
"mean = squared_errors.mean()\n",
"m = len(squared_errors)\n",
"\n",
"## MSE\n",
"MSE_int = np.sqrt(stats.t.interval(confidence, m - 1,\n",
" loc=np.mean(squared_errors),\n",
" scale=stats.sem(squared_errors)))\n",
"\n",
"print(\"MSE Interval: \", MSE_int)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HxLsGqt0ZXFG"
},
"source": [
"We could also compute the interval manually like this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tIK_sxZKZXFG",
"outputId": "47413b2b-57c7-403b-baab-a040035e06a4",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(86.99451311678477, 105.00278120169172)"
]
},
"metadata": {},
"execution_count": 130
}
],
"source": [
"tscore = stats.t.ppf((1 + confidence) / 2, df=m - 1)\n",
"tmargin = tscore * squared_errors.std(ddof=1) / np.sqrt(m)\n",
"np.sqrt(mean - tmargin), np.sqrt(mean + tmargin)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "F1XBucP5ZXFG"
},
"source": [
"Alternatively, we could use a z-scores rather than t-scores:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_Eps1ZorZXFG",
"outputId": "43fa3750-36ae-475d-cd1a-b05820688585",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(87.0019317577372, 104.99663443624668)"
]
},
"metadata": {},
"execution_count": 131
}
],
"source": [
"zscore = stats.norm.ppf((1 + confidence) / 2)\n",
"zmargin = zscore * squared_errors.std(ddof=1) / np.sqrt(m)\n",
"np.sqrt(mean - zmargin), np.sqrt(mean + zmargin)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Jh82mE0JZXFH",
"outputId": "1800971c-e198-451f-926f-dbf3828e3837",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"MAE Interval: (54.55796996620988, 62.040755886646565)\n"
]
}
],
"source": [
"####### What about for MAE\n",
"\n",
"absolute_errors = (final_predictions - y_test).abs()\n",
"mean = absolute_errors.mean()\n",
"m = len(absolute_errors)\n",
"\n",
"MAE_int = stats.t.interval(confidence, m - 1,\n",
" loc=np.mean(absolute_errors),\n",
" scale=stats.sem(absolute_errors))\n",
"\n",
"print(\"MAE Interval: \", MAE_int)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xSyJvcfxZXFH"
},
"source": [
"# Extra material"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0me0PggXZXFH"
},
"source": [
"## You can also include the parameter optimisation in a pipline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "X4P6k25XZXFH",
"outputId": "5e549c56-cfff-4af4-d633-1028353009e3",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 357
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-161c4797-35e5-4ceb-bdec-daea3ebdf05a\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>city</th>\n",
" <th>longitude</th>\n",
" <th>latitude</th>\n",
" <th>review_scores_rating</th>\n",
" <th>number_of_reviews</th>\n",
" <th>minimum_nights</th>\n",
" <th>security_deposit</th>\n",
" <th>cleaning_fee</th>\n",
" <th>accommodates</th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>beds</th>\n",
" <th>property_type</th>\n",
" <th>room_type</th>\n",
" <th>availability_365</th>\n",
" <th>host_identity_verified</th>\n",
" <th>host_is_superhost</th>\n",
" <th>host_since</th>\n",
" <th>cancellation_policy</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5484</th>\n",
" <td>Newtown</td>\n",
" <td>151.178552</td>\n",
" <td>-33.907150</td>\n",
" <td>96.0</td>\n",
" <td>61</td>\n",
" <td>2</td>\n",
" <td>250.0</td>\n",
" <td>85.0</td>\n",
" <td>4</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>House</td>\n",
" <td>Entire home/apt</td>\n",
" <td>127</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2016-01-22</td>\n",
" <td>moderate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1267</th>\n",
" <td>Randwick</td>\n",
" <td>151.249030</td>\n",
" <td>-33.906190</td>\n",
" <td>97.0</td>\n",
" <td>6</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>20.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2014-03-28</td>\n",
" <td>moderate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6658</th>\n",
" <td>Manly</td>\n",
" <td>151.288491</td>\n",
" <td>-33.802074</td>\n",
" <td>100.0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>40.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2014-01-09</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2522</th>\n",
" <td>Randwick</td>\n",
" <td>151.236423</td>\n",
" <td>-33.913614</td>\n",
" <td>94.0</td>\n",
" <td>20</td>\n",
" <td>3</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>90</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2015-11-22</td>\n",
" <td>flexible</td>\n",
" </tr>\n",
" <tr>\n",
" <th>722</th>\n",
" <td>Coogee</td>\n",
" <td>151.259342</td>\n",
" <td>-33.918435</td>\n",
" <td>92.0</td>\n",
" <td>139</td>\n",
" <td>30</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2014-01-07</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-161c4797-35e5-4ceb-bdec-daea3ebdf05a')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-161c4797-35e5-4ceb-bdec-daea3ebdf05a button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-161c4797-35e5-4ceb-bdec-daea3ebdf05a');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" city longitude latitude review_scores_rating \\\n",
"5484 Newtown 151.178552 -33.907150 96.0 \n",
"1267 Randwick 151.249030 -33.906190 97.0 \n",
"6658 Manly 151.288491 -33.802074 100.0 \n",
"2522 Randwick 151.236423 -33.913614 94.0 \n",
"722 Coogee 151.259342 -33.918435 92.0 \n",
"\n",
" number_of_reviews minimum_nights security_deposit cleaning_fee \\\n",
"5484 61 2 250.0 85.0 \n",
"1267 6 4 0.0 20.0 \n",
"6658 2 2 0.0 40.0 \n",
"2522 20 3 0.0 0.0 \n",
"722 139 30 0.0 0.0 \n",
"\n",
" accommodates bathrooms bedrooms beds property_type room_type \\\n",
"5484 4 1.0 2.0 2.0 House Entire home/apt \n",
"1267 2 1.0 1.0 1.0 Apartment Private room \n",
"6658 2 1.0 1.0 1.0 Apartment Entire home/apt \n",
"2522 2 1.0 1.0 1.0 Apartment Private room \n",
"722 3 1.0 1.0 2.0 Apartment Private room \n",
"\n",
" availability_365 host_identity_verified host_is_superhost host_since \\\n",
"5484 127 1 0 2016-01-22 \n",
"1267 0 1 0 2014-03-28 \n",
"6658 0 1 0 2014-01-09 \n",
"2522 90 0 0 2015-11-22 \n",
"722 0 1 0 2014-01-07 \n",
"\n",
" cancellation_policy \n",
"5484 moderate \n",
"1267 moderate \n",
"6658 strict_14_with_grace_period \n",
"2522 flexible \n",
"722 strict_14_with_grace_period "
]
},
"metadata": {},
"execution_count": 133
}
],
"source": [
"X.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ApZFzmykZXFH",
"outputId": "71240bbe-ee31-4e5f-a275-58b6405fa6d7",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"5484 200.0\n",
"1267 183.0\n",
"6658 175.0\n",
"2522 85.0\n",
"722 80.0\n",
"Name: price, dtype: float64"
]
},
"metadata": {},
"execution_count": 134
}
],
"source": [
"Y.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "MilKfoaeZXFI"
},
"outputs": [],
"source": [
"from sklearn.model_selection import RandomizedSearchCV\n",
"from scipy.stats import randint\n",
"\n",
"class Optimise(BaseEstimator, TransformerMixin):\n",
" def __init__(self, Y=[] ): # no *args or **kargs\n",
" self.Y = Y\n",
" def fit(self, X_df, y=None):\n",
" return self # nothing else to do\n",
" def transform(self, X_df, y=None):\n",
" param_distribs = {\n",
" 'n_estimators': randint(low=1, high=200),\n",
" 'max_features': randint(low=1, high=8),\n",
" }\n",
"\n",
" forest_reg = RandomForestRegressor(random_state=42)\n",
" rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,\n",
" n_iter=5, cv=5, scoring='neg_mean_squared_error', random_state=42)\n",
"\n",
" rnd_search.fit(X_df, self.Y)\n",
"\n",
" return rnd_search.best_estimator_\n",
"\n",
"def pipe_full(inds, Y):\n",
" return Pipeline([\n",
" (\"first\", pipe(inds)),\n",
" (\"opt\", Optimise(Y)),\n",
" ])\n",
"\n",
"params = {\"inds\" : list(X.index),\"Y\" : Y}\n",
"\n",
"modell = pipe_full(**params).fit_transform(X) # Now we have done all the preprocessing instead of\n",
" #.. doing it bit by bit. The pipeline becomes\n",
" #.. extremely handy in the cross-validation step.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "o8QfCuirZXFI"
},
"outputs": [],
"source": [
"X_test_prepared = pipe(list(X_test.index)).transform(X_test)\n",
"X_pred = modell.predict(X_test_prepared)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "K_YsHpVGZXFJ",
"outputId": "17d2a74d-f878-41a3-988c-804680e1fcf6",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([142.5 , 268.67222222, 78.78888889, 67.51666667,\n",
" 93.05 , 119.42222222, 65.49444444, 189.74444444,\n",
" 113.12222222, 287.10555556])"
]
},
"metadata": {},
"execution_count": 137
}
],
"source": [
"X_pred[:10]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
},
"nav_menu": {
"height": "279px",
"width": "309px"
},
"toc": {
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"toc_cell": false,
"toc_position": {},
"toc_section_display": "block",
"toc_window_display": false
},
"colab": {
"name": "AirBnB Valuation.ipynb",
"provenance": [],
"include_colab_link": true
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment