Skip to content

Instantly share code, notes, and snippets.

@thedatajango
Last active May 7, 2022 05:07
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save thedatajango/d5bf238df20dc0cf76bbb9e25ed1ec86 to your computer and use it in GitHub Desktop.
Save thedatajango/d5bf238df20dc0cf76bbb9e25ed1ec86 to your computer and use it in GitHub Desktop.
Step-by-Step Data Science project execution
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Problem Statement : Predict Melbourne house price using historic data."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.linear_model import SGDRegressor"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Load Housing Price Prediction data from csv file"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"data = pd.read_csv('./data/Melbourne_housing_FULL.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Suburb: Suburb\n",
"\n",
"Address: Address\n",
"\n",
"Rooms: Number of rooms\n",
"\n",
"Price: Price in Australian dollars\n",
"\n",
"Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.\n",
"\n",
"Type: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential.\n",
"\n",
"SellerG: Real Estate Agent\n",
"\n",
"Date: Date sold\n",
"\n",
"Distance: Distance from CBD in Kilometres\n",
"\n",
"Regionname: General Region (West, North West, North, North east ...etc)\n",
"\n",
"Propertycount: Number of properties that exist in the suburb.\n",
"\n",
"Bedroom2 : Scraped # of Bedrooms (from different source)\n",
"\n",
"Bathroom: Number of Bathrooms\n",
"\n",
"Car: Number of carspots\n",
"\n",
"Landsize: Land Size in Metres\n",
"\n",
"BuildingArea: Building Size in Metres\n",
"\n",
"YearBuilt: Year the house was built\n",
"\n",
"CouncilArea: Governing council for the area\n",
"\n",
"Lattitude: Self explanitory\n",
"\n",
"Longtitude: Self explanitory\n",
"\n",
"Postcode: Postal Code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's have a look at shape of the dataframe"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(34857, 21)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's look at column names in dataset"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',\n",
" 'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',\n",
" 'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',\n",
" 'Longtitude', 'Regionname', 'Propertycount'],\n",
" dtype='object')"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Check if all records have \"Price\". Filter records with \"Price\" and take that as training set."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"27247"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.Price.count() # Some historic records have missing values in target variable."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"data_no_nulls_in_target = data[~ data.Price.isnull() ] ## Eliminated recores with \"Price\" null"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(27247, 21)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_no_nulls_in_target.shape"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"data_no_nulls_in_target.reset_index(drop=True, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"data_no_nulls_in_target.to_csv(index=False, path_or_buf='./data/Melbourne_housing.csv')"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import ShuffleSplit\n",
"shuffleSplit = ShuffleSplit(n_splits=1,test_size = 0.2 , random_state=42)\n",
"\n",
"for train_index, test_index in shuffleSplit.split(data_no_nulls_in_target):\n",
" training_set = data_no_nulls_in_target.loc[train_index]\n",
" test_set = data_no_nulls_in_target.loc[test_index]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(21797, 21)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Just to get an idea, let's have a look at first 5 records"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Suburb</th>\n",
" <th>Address</th>\n",
" <th>Rooms</th>\n",
" <th>Type</th>\n",
" <th>Price</th>\n",
" <th>Method</th>\n",
" <th>SellerG</th>\n",
" <th>Date</th>\n",
" <th>Distance</th>\n",
" <th>Postcode</th>\n",
" <th>...</th>\n",
" <th>Bathroom</th>\n",
" <th>Car</th>\n",
" <th>Landsize</th>\n",
" <th>BuildingArea</th>\n",
" <th>YearBuilt</th>\n",
" <th>CouncilArea</th>\n",
" <th>Lattitude</th>\n",
" <th>Longtitude</th>\n",
" <th>Regionname</th>\n",
" <th>Propertycount</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>17957</th>\n",
" <td>Brunswick East</td>\n",
" <td>2/20 Lyndhurst Cr</td>\n",
" <td>2</td>\n",
" <td>u</td>\n",
" <td>580000.0</td>\n",
" <td>VB</td>\n",
" <td>Nelson</td>\n",
" <td>26/08/2017</td>\n",
" <td>4.0</td>\n",
" <td>3057.0</td>\n",
" <td>...</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>61.0</td>\n",
" <td>1970.0</td>\n",
" <td>Moreland City Council</td>\n",
" <td>-37.76134</td>\n",
" <td>144.97796</td>\n",
" <td>Northern Metropolitan</td>\n",
" <td>5533.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6753</th>\n",
" <td>Richmond</td>\n",
" <td>21 Lambert St</td>\n",
" <td>3</td>\n",
" <td>t</td>\n",
" <td>1025000.0</td>\n",
" <td>SP</td>\n",
" <td>Jellis</td>\n",
" <td>3/12/2016</td>\n",
" <td>2.6</td>\n",
" <td>3121.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Yarra City Council</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Northern Metropolitan</td>\n",
" <td>14949.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1833</th>\n",
" <td>Brunswick West</td>\n",
" <td>11/7 Egginton St</td>\n",
" <td>2</td>\n",
" <td>t</td>\n",
" <td>380000.0</td>\n",
" <td>VB</td>\n",
" <td>Nelson</td>\n",
" <td>7/11/2016</td>\n",
" <td>5.9</td>\n",
" <td>3055.0</td>\n",
" <td>...</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>85.0</td>\n",
" <td>1970.0</td>\n",
" <td>Moreland City Council</td>\n",
" <td>-37.76070</td>\n",
" <td>144.93930</td>\n",
" <td>Northern Metropolitan</td>\n",
" <td>7082.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19795</th>\n",
" <td>Pascoe Vale</td>\n",
" <td>3/246 Cumberland Rd</td>\n",
" <td>2</td>\n",
" <td>u</td>\n",
" <td>456000.0</td>\n",
" <td>S</td>\n",
" <td>Brad</td>\n",
" <td>21/10/2017</td>\n",
" <td>8.5</td>\n",
" <td>3044.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Moreland City Council</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Northern Metropolitan</td>\n",
" <td>7485.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17054</th>\n",
" <td>Glen Waverley</td>\n",
" <td>31 Paxton Dr</td>\n",
" <td>3</td>\n",
" <td>h</td>\n",
" <td>1300000.0</td>\n",
" <td>S</td>\n",
" <td>hockingstuart</td>\n",
" <td>19/08/2017</td>\n",
" <td>16.7</td>\n",
" <td>3150.0</td>\n",
" <td>...</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>733.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Monash City Council</td>\n",
" <td>-37.89014</td>\n",
" <td>145.18269</td>\n",
" <td>Eastern Metropolitan</td>\n",
" <td>15321.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 21 columns</p>\n",
"</div>"
],
"text/plain": [
" Suburb Address Rooms Type Price Method \\\n",
"17957 Brunswick East 2/20 Lyndhurst Cr 2 u 580000.0 VB \n",
"6753 Richmond 21 Lambert St 3 t 1025000.0 SP \n",
"1833 Brunswick West 11/7 Egginton St 2 t 380000.0 VB \n",
"19795 Pascoe Vale 3/246 Cumberland Rd 2 u 456000.0 S \n",
"17054 Glen Waverley 31 Paxton Dr 3 h 1300000.0 S \n",
"\n",
" SellerG Date Distance Postcode ... Bathroom Car \\\n",
"17957 Nelson 26/08/2017 4.0 3057.0 ... 1.0 1.0 \n",
"6753 Jellis 3/12/2016 2.6 3121.0 ... NaN NaN \n",
"1833 Nelson 7/11/2016 5.9 3055.0 ... 1.0 1.0 \n",
"19795 Brad 21/10/2017 8.5 3044.0 ... NaN NaN \n",
"17054 hockingstuart 19/08/2017 16.7 3150.0 ... 2.0 3.0 \n",
"\n",
" Landsize BuildingArea YearBuilt CouncilArea Lattitude \\\n",
"17957 NaN 61.0 1970.0 Moreland City Council -37.76134 \n",
"6753 NaN NaN NaN Yarra City Council NaN \n",
"1833 0.0 85.0 1970.0 Moreland City Council -37.76070 \n",
"19795 NaN NaN NaN Moreland City Council NaN \n",
"17054 733.0 NaN NaN Monash City Council -37.89014 \n",
"\n",
" Longtitude Regionname Propertycount \n",
"17957 144.97796 Northern Metropolitan 5533.0 \n",
"6753 NaN Northern Metropolitan 14949.0 \n",
"1833 144.93930 Northern Metropolitan 7082.0 \n",
"19795 NaN Northern Metropolitan 7485.0 \n",
"17054 145.18269 Eastern Metropolitan 15321.0 \n",
"\n",
"[5 rows x 21 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exploratory Data Analysis\n",
"### As part of data analysis we refine the data - below are some common activities we do.\n",
"\n",
"- **Missing : ** Check for missing or incomplete data, impute/fillna with appropriate data\n",
"- **Quality : ** Check for duplicates, accuracy, unusual data.\n",
"- **Parse : ** Prase existing data and create new fearures. e.g. Extract year and month from date\n",
"- **Convert : ** Free text to coded value (LabelEncoder, One-Hot-Encoding or LabelBinarizer)\n",
"- **Derive** Derive new feature out of existing featre/fearues e.g. gender from title Mr. Mrs.\n",
"- **Calculate** percentages, proportion\n",
"- **Remove** Remove redundant or not so useful data\n",
"- **Merge** Merge multiple columns e.g. first and surname for full name\n",
"- **Aggregate** e.g. rollup by year, cluster by area\n",
"- **Filter** e.g. exclude based on location\n",
"- **Sample** e.g. extract a representative data\n",
"- **Summary** Pandas describe function or stats like mean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Check if any columns have got nulls"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Suburb False\n",
"Address False\n",
"Rooms False\n",
"Type False\n",
"Price False\n",
"Method False\n",
"SellerG False\n",
"Date False\n",
"Distance True\n",
"Postcode True\n",
"Bedroom2 True\n",
"Bathroom True\n",
"Car True\n",
"Landsize True\n",
"BuildingArea True\n",
"YearBuilt True\n",
"CouncilArea True\n",
"Lattitude True\n",
"Longtitude True\n",
"Regionname True\n",
"Propertycount True\n",
"dtype: bool"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.isnull().any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Look at distinct data types in our data"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"float64 12\n",
"object 8\n",
"int64 1\n",
"dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.dtypes.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Check if numerical columns have got nulls. \n",
" * If we see below, most of the numerical columns have got nulls. \n",
" * Some features like \"Postcode\", \"Lattitude\", and \"Longtitude\" are related features. These three reatures are related to \"Address\". I am thinking if we can use \"Postcode\" feature to represent all there."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Rooms False\n",
"Price False\n",
"Distance True\n",
"Postcode True\n",
"Bedroom2 True\n",
"Bathroom True\n",
"Car True\n",
"Landsize True\n",
"BuildingArea True\n",
"YearBuilt True\n",
"Lattitude True\n",
"Longtitude True\n",
"Propertycount True\n",
"dtype: bool"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.select_dtypes(['float64','int64']).isnull().any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Check if categorical/text columns have got nulls. \n",
" * \"CouncilArea\", \"Regionname\" have got nulls. These two reatures are related to \"Address\". I am thinking if we can use \"Postcode\" feature to represent all there."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Suburb False\n",
"Address False\n",
"Type False\n",
"Method False\n",
"SellerG False\n",
"Date False\n",
"CouncilArea True\n",
"Regionname True\n",
"dtype: bool"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.select_dtypes(['object']).isnull().any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Big Decision (importance of Statistical Data Analysis - Bivariate data analysis):\n",
" #### Should we remove \"CouncilArea\", \"Regionname\", \"Lattitude\", and \"Longtitude\" features and have \"Postcode\" or \"Suburb\" as it may represent the address - in other workds, we need to check if \"Postcode\" or \"Suburb\" an important feature in predicting \"Price\" ?\n",
" * Before we make a decision we have to fix some problems with \"Postcode\" column by do some statistical data analysis\n",
"\n",
"#### Step 1: Check \"Postcode\" equal for all the addresses in same \"Suburb\" (a column with no-nulls). If so, we can write some logic to populate \"Postcode\" based on \"Suburb\".\n",
"#### Step 2: Is \"Postcode\", \"Suburb\" an important feature to predict \"Price\" ?."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"df_null_postcodes = training_set[training_set['Postcode'].isnull()]"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Suburb</th>\n",
" <th>Address</th>\n",
" <th>Rooms</th>\n",
" <th>Type</th>\n",
" <th>Price</th>\n",
" <th>Method</th>\n",
" <th>SellerG</th>\n",
" <th>Date</th>\n",
" <th>Distance</th>\n",
" <th>Postcode</th>\n",
" <th>...</th>\n",
" <th>Bathroom</th>\n",
" <th>Car</th>\n",
" <th>Landsize</th>\n",
" <th>BuildingArea</th>\n",
" <th>YearBuilt</th>\n",
" <th>CouncilArea</th>\n",
" <th>Lattitude</th>\n",
" <th>Longtitude</th>\n",
" <th>Regionname</th>\n",
" <th>Propertycount</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>23051</th>\n",
" <td>Fawkner Lot</td>\n",
" <td>1/3 Brian St</td>\n",
" <td>3</td>\n",
" <td>h</td>\n",
" <td>616000.0</td>\n",
" <td>SP</td>\n",
" <td>Brad</td>\n",
" <td>6/01/2018</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1 rows × 21 columns</p>\n",
"</div>"
],
"text/plain": [
" Suburb Address Rooms Type Price Method SellerG \\\n",
"23051 Fawkner Lot 1/3 Brian St 3 h 616000.0 SP Brad \n",
"\n",
" Date Distance Postcode ... Bathroom Car Landsize \\\n",
"23051 6/01/2018 NaN NaN ... NaN NaN NaN \n",
"\n",
" BuildingArea YearBuilt CouncilArea Lattitude Longtitude Regionname \\\n",
"23051 NaN NaN NaN NaN NaN NaN \n",
"\n",
" Propertycount \n",
"23051 NaN \n",
"\n",
"[1 rows x 21 columns]"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_null_postcodes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Check if any row with non-null \"Suburb\" is present for row in \"Postcode\""
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Suburb</th>\n",
" <th>Address</th>\n",
" <th>Rooms</th>\n",
" <th>Type</th>\n",
" <th>Price</th>\n",
" <th>Method</th>\n",
" <th>SellerG</th>\n",
" <th>Date</th>\n",
" <th>Distance</th>\n",
" <th>Postcode</th>\n",
" <th>...</th>\n",
" <th>Bathroom</th>\n",
" <th>Car</th>\n",
" <th>Landsize</th>\n",
" <th>BuildingArea</th>\n",
" <th>YearBuilt</th>\n",
" <th>CouncilArea</th>\n",
" <th>Lattitude</th>\n",
" <th>Longtitude</th>\n",
" <th>Regionname</th>\n",
" <th>Propertycount</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>23051</th>\n",
" <td>Fawkner Lot</td>\n",
" <td>1/3 Brian St</td>\n",
" <td>3</td>\n",
" <td>h</td>\n",
" <td>616000.0</td>\n",
" <td>SP</td>\n",
" <td>Brad</td>\n",
" <td>6/01/2018</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1 rows × 21 columns</p>\n",
"</div>"
],
"text/plain": [
" Suburb Address Rooms Type Price Method SellerG \\\n",
"23051 Fawkner Lot 1/3 Brian St 3 h 616000.0 SP Brad \n",
"\n",
" Date Distance Postcode ... Bathroom Car Landsize \\\n",
"23051 6/01/2018 NaN NaN ... NaN NaN NaN \n",
"\n",
" BuildingArea YearBuilt CouncilArea Lattitude Longtitude Regionname \\\n",
"23051 NaN NaN NaN NaN NaN NaN \n",
"\n",
" Propertycount \n",
"23051 NaN \n",
"\n",
"[1 rows x 21 columns]"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_null_postcodes[~ df_null_postcodes['Suburb'].isnull()]"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"training_set = training_set[~ training_set['Postcode'].isnull()]"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(21796, 21)"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Check if distict \"Suburb\" have distict \"Postcode\" "
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(337,)"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set['Suburb'].value_counts().shape"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(205,)"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set['Postcode'].value_counts().shape"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"ser_postcode_suburb = training_set.groupby(['Postcode', 'Suburb']).size()"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(ser_postcode_suburb)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Postcode Suburb \n",
"3000.0 Melbourne 117\n",
"3002.0 East Melbourne 30\n",
"3003.0 West Melbourne 37\n",
"3006.0 Southbank 46\n",
"3008.0 Docklands 6\n",
"3011.0 Footscray 168\n",
" Seddon 76\n",
"3012.0 Brooklyn 31\n",
" Kingsville 50\n",
" Maidstone 118\n",
" West Footscray 128\n",
"3013.0 Yarraville 194\n",
"3015.0 Newport 168\n",
" South Kingsville 24\n",
" Spotswood 38\n",
"3016.0 Williamstown 140\n",
" Williamstown North 19\n",
"3018.0 Altona 87\n",
" Seaholme 17\n",
"3019.0 Braybrook 64\n",
"dtype: int64"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ser_postcode_suburb.head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### reset_index on pandas Series object will create a DataFrame when drop flag is set to \"False\"\n",
"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reset_index.html"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"df_postcode_suburb = training_set.groupby(['Postcode', 'Suburb']).size().reset_index().rename(columns={0:'count'})"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Postcode</th>\n",
" <th>Suburb</th>\n",
" <th>count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3000.0</td>\n",
" <td>Melbourne</td>\n",
" <td>117</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3002.0</td>\n",
" <td>East Melbourne</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3003.0</td>\n",
" <td>West Melbourne</td>\n",
" <td>37</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3006.0</td>\n",
" <td>Southbank</td>\n",
" <td>46</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3008.0</td>\n",
" <td>Docklands</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>3011.0</td>\n",
" <td>Footscray</td>\n",
" <td>168</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>3011.0</td>\n",
" <td>Seddon</td>\n",
" <td>76</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>3012.0</td>\n",
" <td>Brooklyn</td>\n",
" <td>31</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>3012.0</td>\n",
" <td>Kingsville</td>\n",
" <td>50</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>3012.0</td>\n",
" <td>Maidstone</td>\n",
" <td>118</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>3012.0</td>\n",
" <td>West Footscray</td>\n",
" <td>128</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>3013.0</td>\n",
" <td>Yarraville</td>\n",
" <td>194</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>3015.0</td>\n",
" <td>Newport</td>\n",
" <td>168</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>3015.0</td>\n",
" <td>South Kingsville</td>\n",
" <td>24</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>3015.0</td>\n",
" <td>Spotswood</td>\n",
" <td>38</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>3016.0</td>\n",
" <td>Williamstown</td>\n",
" <td>140</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>3016.0</td>\n",
" <td>Williamstown North</td>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>3018.0</td>\n",
" <td>Altona</td>\n",
" <td>87</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>3018.0</td>\n",
" <td>Seaholme</td>\n",
" <td>17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>3019.0</td>\n",
" <td>Braybrook</td>\n",
" <td>64</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Postcode Suburb count\n",
"0 3000.0 Melbourne 117\n",
"1 3002.0 East Melbourne 30\n",
"2 3003.0 West Melbourne 37\n",
"3 3006.0 Southbank 46\n",
"4 3008.0 Docklands 6\n",
"5 3011.0 Footscray 168\n",
"6 3011.0 Seddon 76\n",
"7 3012.0 Brooklyn 31\n",
"8 3012.0 Kingsville 50\n",
"9 3012.0 Maidstone 118\n",
"10 3012.0 West Footscray 128\n",
"11 3013.0 Yarraville 194\n",
"12 3015.0 Newport 168\n",
"13 3015.0 South Kingsville 24\n",
"14 3015.0 Spotswood 38\n",
"15 3016.0 Williamstown 140\n",
"16 3016.0 Williamstown North 19\n",
"17 3018.0 Altona 87\n",
"18 3018.0 Seaholme 17\n",
"19 3019.0 Braybrook 64"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_postcode_suburb.head(20)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Postcode = 3000.0\n",
"*************************\n",
" Suburb count\n",
"0 Melbourne 117\n",
"--------------------------------------------------\n",
"Postcode = 3002.0\n",
"*************************\n",
" Suburb count\n",
"1 East Melbourne 30\n",
"--------------------------------------------------\n",
"Postcode = 3003.0\n",
"*************************\n",
" Suburb count\n",
"2 West Melbourne 37\n",
"--------------------------------------------------\n",
"Postcode = 3006.0\n",
"*************************\n",
" Suburb count\n",
"3 Southbank 46\n",
"--------------------------------------------------\n",
"Postcode = 3008.0\n",
"*************************\n",
" Suburb count\n",
"4 Docklands 6\n",
"--------------------------------------------------\n",
"Postcode = 3011.0\n",
"*************************\n",
" Suburb count\n",
"5 Footscray 168\n",
"6 Seddon 76\n",
"--------------------------------------------------\n",
"Postcode = 3012.0\n",
"*************************\n",
" Suburb count\n",
"7 Brooklyn 31\n",
"8 Kingsville 50\n",
"9 Maidstone 118\n",
"10 West Footscray 128\n",
"--------------------------------------------------\n",
"Postcode = 3013.0\n",
"*************************\n",
" Suburb count\n",
"11 Yarraville 194\n",
"--------------------------------------------------\n",
"Postcode = 3015.0\n",
"*************************\n",
" Suburb count\n",
"12 Newport 168\n",
"13 South Kingsville 24\n",
"14 Spotswood 38\n",
"--------------------------------------------------\n",
"Postcode = 3016.0\n",
"*************************\n",
" Suburb count\n",
"15 Williamstown 140\n",
"16 Williamstown North 19\n",
"--------------------------------------------------\n",
"Postcode = 3018.0\n",
"*************************\n",
" Suburb count\n",
"17 Altona 87\n",
"18 Seaholme 17\n",
"--------------------------------------------------\n",
"Postcode = 3019.0\n",
"*************************\n",
" Suburb count\n",
"19 Braybrook 64\n",
"--------------------------------------------------\n",
"Postcode = 3020.0\n",
"*************************\n",
" Suburb count\n",
"20 Albion 53\n",
"21 Sunshine 137\n",
"22 Sunshine North 97\n",
"23 Sunshine West 131\n",
"--------------------------------------------------\n",
"Postcode = 3021.0\n",
"*************************\n",
" Suburb count\n",
"24 Albanvale 7\n",
"25 Kealba 32\n",
"26 Kings Park 10\n",
"27 St Albans 80\n",
"--------------------------------------------------\n",
"Postcode = 3022.0\n",
"*************************\n",
" Suburb count\n",
"28 Ardeer 17\n",
"--------------------------------------------------\n",
"Postcode = 3023.0\n",
"*************************\n",
" Suburb count\n",
"29 Burnside 4\n",
"30 Burnside Heights 9\n",
"31 Cairnlea 15\n",
"32 Caroline Springs 20\n",
"33 Deer Park 31\n",
"--------------------------------------------------\n",
"Postcode = 3024.0\n",
"*************************\n",
" Suburb count\n",
"34 Wyndham Vale 42\n",
"--------------------------------------------------\n",
"Postcode = 3025.0\n",
"*************************\n",
" Suburb count\n",
"35 Altona North 118\n",
"--------------------------------------------------\n",
"Postcode = 3027.0\n",
"*************************\n",
" Suburb count\n",
"36 Williams Landing 5\n",
"--------------------------------------------------\n",
"Postcode = 3028.0\n",
"*************************\n",
" Suburb count\n",
"37 Altona Meadows 21\n",
"38 Laverton 6\n",
"39 Seabrook 3\n",
"--------------------------------------------------\n",
"Postcode = 3029.0\n",
"*************************\n",
" Suburb count\n",
"40 Hoppers Crossing 107\n",
"41 Tarneit 29\n",
"42 Truganina 18\n",
"--------------------------------------------------\n",
"Postcode = 3030.0\n",
"*************************\n",
" Suburb count\n",
"43 Derrimut 14\n",
"44 Point Cook 68\n",
"45 Werribee 120\n",
"46 Werribee South 2\n",
"--------------------------------------------------\n",
"Postcode = 3031.0\n",
"*************************\n",
" Suburb count\n",
"47 Flemington 80\n",
"48 Kensington 150\n",
"--------------------------------------------------\n",
"Postcode = 3032.0\n",
"*************************\n",
" Suburb count\n",
"49 Ascot Vale 167\n",
"50 Maribyrnong 146\n",
"51 Travancore 10\n",
"--------------------------------------------------\n",
"Postcode = 3033.0\n",
"*************************\n",
" Suburb count\n",
"52 Keilor East 172\n",
"--------------------------------------------------\n",
"Postcode = 3034.0\n",
"*************************\n",
" Suburb count\n",
"53 Avondale Heights 117\n",
"--------------------------------------------------\n",
"Postcode = 3036.0\n",
"*************************\n",
" Suburb count\n",
"54 Keilor 13\n",
"--------------------------------------------------\n",
"Postcode = 3037.0\n",
"*************************\n",
" Suburb count\n",
"55 Delahey 17\n",
"56 Hillside 54\n",
"57 Sydenham 21\n",
"58 Taylors Hill 28\n",
"--------------------------------------------------\n",
"Postcode = 3038.0\n",
"*************************\n",
" Suburb count\n",
"59 Keilor Downs 35\n",
"60 Keilor Lodge 5\n",
"61 Taylors Lakes 51\n",
"--------------------------------------------------\n",
"Postcode = 3039.0\n",
"*************************\n",
" Suburb count\n",
"62 Moonee Ponds 201\n",
"--------------------------------------------------\n",
"Postcode = 3040.0\n",
"*************************\n",
" Suburb count\n",
"63 Aberfeldie 58\n",
"64 Essendon 291\n",
"65 Essendon West 31\n",
"--------------------------------------------------\n",
"Postcode = 3041.0\n",
"*************************\n",
" Suburb count\n",
"66 Essendon North 33\n",
"67 Strathmore 115\n",
"68 Strathmore Heights 11\n",
"--------------------------------------------------\n",
"Postcode = 3042.0\n",
"*************************\n",
" Suburb count\n",
"69 Airport West 125\n",
"70 Keilor Park 29\n",
"71 Niddrie 100\n",
"--------------------------------------------------\n",
"Postcode = 3043.0\n",
"*************************\n",
" Suburb count\n",
"72 Gladstone Park 44\n",
"73 Gowanbrae 33\n",
"74 Tullamarine 41\n",
"--------------------------------------------------\n",
"Postcode = 3044.0\n",
"*************************\n",
" Suburb count\n",
"75 Pascoe Vale 253\n",
"--------------------------------------------------\n",
"Postcode = 3046.0\n",
"*************************\n",
" Suburb count\n",
"76 Glenroy 275\n",
"77 Hadfield 66\n",
"78 Oak Park 87\n",
"--------------------------------------------------\n",
"Postcode = 3047.0\n",
"*************************\n",
" Suburb count\n",
"79 Broadmeadows 62\n",
"80 Dallas 27\n",
"81 Jacana 44\n",
"--------------------------------------------------\n",
"Postcode = 3048.0\n",
"*************************\n",
" Suburb count\n",
"82 Coolaroo 7\n",
"83 Meadow Heights 38\n",
"--------------------------------------------------\n",
"Postcode = 3049.0\n",
"*************************\n",
" Suburb count\n",
"84 Attwood 12\n",
"85 Westmeadows 33\n",
"--------------------------------------------------\n",
"Postcode = 3051.0\n",
"*************************\n",
" Suburb count\n",
"86 North Melbourne 95\n",
"--------------------------------------------------\n",
"Postcode = 3052.0\n",
"*************************\n",
" Suburb count\n",
"87 Parkville 42\n",
"--------------------------------------------------\n",
"Postcode = 3053.0\n",
"*************************\n",
" Suburb count\n",
"88 Carlton 57\n",
"--------------------------------------------------\n",
"Postcode = 3054.0\n",
"*************************\n",
" Suburb count\n",
"89 Carlton North 57\n",
"90 Princes Hill 10\n",
"--------------------------------------------------\n",
"Postcode = 3055.0\n",
"*************************\n",
" Suburb count\n",
"91 Brunswick West 169\n",
"--------------------------------------------------\n",
"Postcode = 3056.0\n",
"*************************\n",
" Suburb count\n",
"92 Brunswick 323\n",
"--------------------------------------------------\n",
"Postcode = 3057.0\n",
"*************************\n",
" Suburb count\n",
"93 Brunswick East 127\n",
"--------------------------------------------------\n",
"Postcode = 3058.0\n",
"*************************\n",
" Suburb count\n",
"94 Coburg 243\n",
"95 Coburg North 99\n",
"--------------------------------------------------\n",
"Postcode = 3059.0\n",
"*************************\n",
" Suburb count\n",
"96 Greenvale 63\n",
"--------------------------------------------------\n",
"Postcode = 3060.0\n",
"*************************\n",
" Suburb count\n",
"97 Fawkner 125\n",
"--------------------------------------------------\n",
"Postcode = 3061.0\n",
"*************************\n",
" Suburb count\n",
"98 Campbellfield 15\n",
"--------------------------------------------------\n",
"Postcode = 3064.0\n",
"*************************\n",
" Suburb count\n",
"99 Craigieburn 168\n",
"100 Kalkallo 1\n",
"101 Mickleham 7\n",
"102 Roxburgh Park 66\n",
"--------------------------------------------------\n",
"Postcode = 3065.0\n",
"*************************\n",
" Suburb count\n",
"103 Fitzroy 87\n",
"--------------------------------------------------\n",
"Postcode = 3066.0\n",
"*************************\n",
" Suburb count\n",
"104 Collingwood 69\n",
"--------------------------------------------------\n",
"Postcode = 3067.0\n",
"*************************\n",
" Suburb count\n",
"105 Abbotsford 84\n",
"--------------------------------------------------\n",
"Postcode = 3068.0\n",
"*************************\n",
" Suburb count\n",
"106 Clifton Hill 91\n",
"107 Fitzroy North 119\n",
"--------------------------------------------------\n",
"Postcode = 3070.0\n",
"*************************\n",
" Suburb count\n",
"108 Northcote 270\n",
"--------------------------------------------------\n",
"Postcode = 3071.0\n",
"*************************\n",
" Suburb count\n",
"109 Thornbury 215\n",
"--------------------------------------------------\n",
"Postcode = 3072.0\n",
"*************************\n",
" Suburb count\n",
"110 Preston 332\n",
"--------------------------------------------------\n",
"Postcode = 3073.0\n",
"*************************\n",
" Suburb count\n",
"111 Reservoir 592\n",
"--------------------------------------------------\n",
"Postcode = 3074.0\n",
"*************************\n",
" Suburb count\n",
"112 Thomastown 70\n",
"--------------------------------------------------\n",
"Postcode = 3075.0\n",
"*************************\n",
" Suburb count\n",
"113 Lalor 67\n",
"--------------------------------------------------\n",
"Postcode = 3076.0\n",
"*************************\n",
" Suburb count\n",
"114 Epping 122\n",
"--------------------------------------------------\n",
"Postcode = 3078.0\n",
"*************************\n",
" Suburb count\n",
"115 Alphington 52\n",
"116 Fairfield 57\n",
"--------------------------------------------------\n",
"Postcode = 3079.0\n",
"*************************\n",
" Suburb count\n",
"117 Ivanhoe 136\n",
"118 Ivanhoe East 38\n",
"--------------------------------------------------\n",
"Postcode = 3081.0\n",
"*************************\n",
" Suburb count\n",
"119 Bellfield 21\n",
"120 Heidelberg Heights 97\n",
"121 Heidelberg West 87\n",
"--------------------------------------------------\n",
"Postcode = 3082.0\n",
"*************************\n",
" Suburb count\n",
"122 Mill Park 116\n",
"--------------------------------------------------\n",
"Postcode = 3083.0\n",
"*************************\n",
" Suburb count\n",
"123 Bundoora 111\n",
"124 Kingsbury 34\n",
"--------------------------------------------------\n",
"Postcode = 3084.0\n",
"*************************\n",
" Suburb count\n",
"125 Eaglemont 28\n",
"126 Heidelberg 63\n",
"127 Rosanna 105\n",
"128 Viewbank 56\n",
"129 viewbank 1\n",
"--------------------------------------------------\n",
"Postcode = 3085.0\n",
"*************************\n",
" Suburb count\n",
"130 MacLeod 53\n",
"131 Yallambie 36\n",
"--------------------------------------------------\n",
"Postcode = 3087.0\n",
"*************************\n",
" Suburb count\n",
"132 Watsonia 67\n",
"133 Watsonia North 21\n",
"--------------------------------------------------\n",
"Postcode = 3088.0\n",
"*************************\n",
" Suburb count\n",
"134 Briar Hill 16\n",
"135 Greensborough 90\n",
"136 St Helena 9\n",
"--------------------------------------------------\n",
"Postcode = 3089.0\n",
"*************************\n",
" Suburb count\n",
"137 Diamond Creek 19\n",
"--------------------------------------------------\n",
"Postcode = 3093.0\n",
"*************************\n",
" Suburb count\n",
"138 Lower Plenty 17\n",
"--------------------------------------------------\n",
"Postcode = 3094.0\n",
"*************************\n",
" Suburb count\n",
"139 Montmorency 48\n",
"--------------------------------------------------\n",
"Postcode = 3095.0\n",
"*************************\n",
" Suburb count\n",
"140 Eltham 61\n",
"141 Eltham North 16\n",
"142 Research 4\n",
"--------------------------------------------------\n",
"Postcode = 3096.0\n",
"*************************\n",
" Suburb count\n",
"143 Wattle Glen 1\n",
"--------------------------------------------------\n",
"Postcode = 3099.0\n",
"*************************\n",
" Suburb count\n",
"144 Hurstbridge 2\n",
"--------------------------------------------------\n",
"Postcode = 3101.0\n",
"*************************\n",
" Suburb count\n",
"145 Kew 264\n",
"--------------------------------------------------\n",
"Postcode = 3102.0\n",
"*************************\n",
" Suburb count\n",
"146 Kew East 77\n",
"--------------------------------------------------\n",
"Postcode = 3103.0\n",
"*************************\n",
" Suburb count\n",
"147 Balwyn 187\n",
"148 Deepdene 6\n",
"--------------------------------------------------\n",
"Postcode = 3104.0\n",
"*************************\n",
" Suburb count\n",
"149 Balwyn North 245\n",
"--------------------------------------------------\n",
"Postcode = 3105.0\n",
"*************************\n",
" Suburb count\n",
"150 Bulleen 106\n",
"--------------------------------------------------\n",
"Postcode = 3106.0\n",
"*************************\n",
" Suburb count\n",
"151 Templestowe 54\n",
"--------------------------------------------------\n",
"Postcode = 3107.0\n",
"*************************\n",
" Suburb count\n",
"152 Templestowe Lower 141\n",
"--------------------------------------------------\n",
"Postcode = 3108.0\n",
"*************************\n",
" Suburb count\n",
"153 Doncaster 164\n",
"--------------------------------------------------\n",
"Postcode = 3109.0\n",
"*************************\n",
" Suburb count\n",
"154 Doncaster East 95\n",
"--------------------------------------------------\n",
"Postcode = 3111.0\n",
"*************************\n",
" Suburb count\n",
"155 Donvale 34\n",
"--------------------------------------------------\n",
"Postcode = 3113.0\n",
"*************************\n",
" Suburb count\n",
"156 North Warrandyte 5\n",
"157 Warrandyte 8\n",
"--------------------------------------------------\n",
"Postcode = 3115.0\n",
"*************************\n",
" Suburb count\n",
"158 Wonga Park 2\n",
"--------------------------------------------------\n",
"Postcode = 3116.0\n",
"*************************\n",
" Suburb count\n",
"159 Chirnside Park 8\n",
"--------------------------------------------------\n",
"Postcode = 3121.0\n",
"*************************\n",
" Suburb count\n",
"160 Burnley 10\n",
"161 Cremorne 31\n",
"162 Richmond 349\n",
"--------------------------------------------------\n",
"Postcode = 3122.0\n",
"*************************\n",
" Suburb count\n",
"163 Hawthorn 253\n",
"--------------------------------------------------\n",
"Postcode = 3123.0\n",
"*************************\n",
" Suburb count\n",
"164 Hawthorn East 159\n",
"--------------------------------------------------\n",
"Postcode = 3124.0\n",
"*************************\n",
" Suburb count\n",
"165 Camberwell 222\n",
"--------------------------------------------------\n",
"Postcode = 3125.0\n",
"*************************\n",
" Suburb count\n",
"166 Burwood 131\n",
"--------------------------------------------------\n",
"Postcode = 3126.0\n",
"*************************\n",
" Suburb count\n",
"167 Canterbury 67\n",
"--------------------------------------------------\n",
"Postcode = 3127.0\n",
"*************************\n",
" Suburb count\n",
"168 Mont Albert 63\n",
"169 Surrey Hills 162\n",
"--------------------------------------------------\n",
"Postcode = 3128.0\n",
"*************************\n",
" Suburb count\n",
"170 Box Hill 67\n",
"--------------------------------------------------\n",
"Postcode = 3130.0\n",
"*************************\n",
" Suburb count\n",
"171 Blackburn 49\n",
"172 Blackburn North 30\n",
"173 Blackburn South 32\n",
"--------------------------------------------------\n",
"Postcode = 3131.0\n",
"*************************\n",
" Suburb count\n",
"174 Forest Hill 44\n",
"175 Nunawading 41\n",
"--------------------------------------------------\n",
"Postcode = 3132.0\n",
"*************************\n",
" Suburb count\n",
"176 Mitcham 75\n",
"--------------------------------------------------\n",
"Postcode = 3133.0\n",
"*************************\n",
" Suburb count\n",
"177 Vermont 30\n",
"178 Vermont South 27\n",
"--------------------------------------------------\n",
"Postcode = 3134.0\n",
"*************************\n",
" Suburb count\n",
"179 Ringwood 44\n",
"180 Ringwood North 23\n",
"181 Warranwood 3\n",
"--------------------------------------------------\n",
"Postcode = 3135.0\n",
"*************************\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" Suburb count\n",
"182 Heathmont 32\n",
"183 Ringwood East 27\n",
"--------------------------------------------------\n",
"Postcode = 3136.0\n",
"*************************\n",
" Suburb count\n",
"184 Croydon 83\n",
"185 Croydon Hills 11\n",
"186 Croydon North 16\n",
"187 Croydon South 10\n",
"188 croydon 1\n",
"--------------------------------------------------\n",
"Postcode = 3137.0\n",
"*************************\n",
" Suburb count\n",
"189 Kilsyth 17\n",
"--------------------------------------------------\n",
"Postcode = 3138.0\n",
"*************************\n",
" Suburb count\n",
"190 Mooroolbark 29\n",
"--------------------------------------------------\n",
"Postcode = 3140.0\n",
"*************************\n",
" Suburb count\n",
"191 Lilydale 4\n",
"--------------------------------------------------\n",
"Postcode = 3141.0\n",
"*************************\n",
" Suburb count\n",
"192 South Yarra 276\n",
"--------------------------------------------------\n",
"Postcode = 3142.0\n",
"*************************\n",
" Suburb count\n",
"193 Toorak 135\n",
"--------------------------------------------------\n",
"Postcode = 3143.0\n",
"*************************\n",
" Suburb count\n",
"194 Armadale 118\n",
"--------------------------------------------------\n",
"Postcode = 3144.0\n",
"*************************\n",
" Suburb count\n",
"195 Kooyong 5\n",
"196 Malvern 100\n",
"--------------------------------------------------\n",
"Postcode = 3145.0\n",
"*************************\n",
" Suburb count\n",
"197 Caulfield East 15\n",
"198 Malvern East 183\n",
"--------------------------------------------------\n",
"Postcode = 3146.0\n",
"*************************\n",
" Suburb count\n",
"199 Glen Iris 255\n",
"--------------------------------------------------\n",
"Postcode = 3147.0\n",
"*************************\n",
" Suburb count\n",
"200 Ashburton 90\n",
"201 Ashwood 83\n",
"--------------------------------------------------\n",
"Postcode = 3148.0\n",
"*************************\n",
" Suburb count\n",
"202 Chadstone 61\n",
"--------------------------------------------------\n",
"Postcode = 3149.0\n",
"*************************\n",
" Suburb count\n",
"203 Mount Waverley 135\n",
"--------------------------------------------------\n",
"Postcode = 3150.0\n",
"*************************\n",
" Suburb count\n",
"204 Glen Waverley 116\n",
"205 Wheelers Hill 52\n",
"--------------------------------------------------\n",
"Postcode = 3151.0\n",
"*************************\n",
" Suburb count\n",
"206 Burwood East 33\n",
"--------------------------------------------------\n",
"Postcode = 3152.0\n",
"*************************\n",
" Suburb count\n",
"207 Wantirna 35\n",
"208 Wantirna South 35\n",
"--------------------------------------------------\n",
"Postcode = 3153.0\n",
"*************************\n",
" Suburb count\n",
"209 Bayswater 27\n",
"210 Bayswater North 17\n",
"--------------------------------------------------\n",
"Postcode = 3154.0\n",
"*************************\n",
" Suburb count\n",
"211 The Basin 3\n",
"--------------------------------------------------\n",
"Postcode = 3155.0\n",
"*************************\n",
" Suburb count\n",
"212 Boronia 33\n",
"--------------------------------------------------\n",
"Postcode = 3156.0\n",
"*************************\n",
" Suburb count\n",
"213 Ferntree Gully 46\n",
"214 Lysterfield 1\n",
"--------------------------------------------------\n",
"Postcode = 3158.0\n",
"*************************\n",
" Suburb count\n",
"215 Upwey 2\n",
"--------------------------------------------------\n",
"Postcode = 3160.0\n",
"*************************\n",
" Suburb count\n",
"216 Tecoma 1\n",
"--------------------------------------------------\n",
"Postcode = 3161.0\n",
"*************************\n",
" Suburb count\n",
"217 Caulfield North 48\n",
"--------------------------------------------------\n",
"Postcode = 3162.0\n",
"*************************\n",
" Suburb count\n",
"218 Caulfield 14\n",
"219 Caulfield South 63\n",
"--------------------------------------------------\n",
"Postcode = 3163.0\n",
"*************************\n",
" Suburb count\n",
"220 Carnegie 183\n",
"221 Glen Huntly 40\n",
"222 Murrumbeena 110\n",
"--------------------------------------------------\n",
"Postcode = 3165.0\n",
"*************************\n",
" Suburb count\n",
"223 Bentleigh East 405\n",
"--------------------------------------------------\n",
"Postcode = 3166.0\n",
"*************************\n",
" Suburb count\n",
"224 Hughesdale 42\n",
"225 Huntingdale 7\n",
"226 Oakleigh 64\n",
"227 Oakleigh East 23\n",
"--------------------------------------------------\n",
"Postcode = 3167.0\n",
"*************************\n",
" Suburb count\n",
"228 Oakleigh South 87\n",
"--------------------------------------------------\n",
"Postcode = 3168.0\n",
"*************************\n",
" Suburb count\n",
"229 Clayton 31\n",
"230 Notting Hill 7\n",
"--------------------------------------------------\n",
"Postcode = 3169.0\n",
"*************************\n",
" Suburb count\n",
"231 Clarinda 12\n",
"232 Clayton South 29\n",
"--------------------------------------------------\n",
"Postcode = 3170.0\n",
"*************************\n",
" Suburb count\n",
"233 Mulgrave 64\n",
"--------------------------------------------------\n",
"Postcode = 3171.0\n",
"*************************\n",
" Suburb count\n",
"234 Springvale 26\n",
"--------------------------------------------------\n",
"Postcode = 3172.0\n",
"*************************\n",
" Suburb count\n",
"235 Dingley Village 36\n",
"236 Springvale South 8\n",
"--------------------------------------------------\n",
"Postcode = 3173.0\n",
"*************************\n",
" Suburb count\n",
"237 Keysborough 23\n",
"--------------------------------------------------\n",
"Postcode = 3174.0\n",
"*************************\n",
" Suburb count\n",
"238 Noble Park 41\n",
"--------------------------------------------------\n",
"Postcode = 3175.0\n",
"*************************\n",
" Suburb count\n",
"239 Dandenong 33\n",
"240 Dandenong North 33\n",
"--------------------------------------------------\n",
"Postcode = 3177.0\n",
"*************************\n",
" Suburb count\n",
"241 Doveton 15\n",
"242 Eumemmerring 1\n",
"--------------------------------------------------\n",
"Postcode = 3178.0\n",
"*************************\n",
" Suburb count\n",
"243 Rowville 25\n",
"--------------------------------------------------\n",
"Postcode = 3179.0\n",
"*************************\n",
" Suburb count\n",
"244 Scoresby 9\n",
"--------------------------------------------------\n",
"Postcode = 3180.0\n",
"*************************\n",
" Suburb count\n",
"245 Knoxfield 14\n",
"--------------------------------------------------\n",
"Postcode = 3181.0\n",
"*************************\n",
" Suburb count\n",
"246 Prahran 156\n",
"247 Windsor 63\n",
"--------------------------------------------------\n",
"Postcode = 3182.0\n",
"*************************\n",
" Suburb count\n",
"248 St Kilda 234\n",
"--------------------------------------------------\n",
"Postcode = 3183.0\n",
"*************************\n",
" Suburb count\n",
"249 Balaclava 47\n",
"--------------------------------------------------\n",
"Postcode = 3184.0\n",
"*************************\n",
" Suburb count\n",
"250 Elwood 162\n",
"--------------------------------------------------\n",
"Postcode = 3185.0\n",
"*************************\n",
" Suburb count\n",
"251 Elsternwick 77\n",
"252 Gardenvale 10\n",
"253 Ripponlea 12\n",
"--------------------------------------------------\n",
"Postcode = 3186.0\n",
"*************************\n",
" Suburb count\n",
"254 Brighton 254\n",
"--------------------------------------------------\n",
"Postcode = 3187.0\n",
"*************************\n",
" Suburb count\n",
"255 Brighton East 213\n",
"--------------------------------------------------\n",
"Postcode = 3188.0\n",
"*************************\n",
" Suburb count\n",
"256 Hampton 179\n",
"257 Hampton East 67\n",
"--------------------------------------------------\n",
"Postcode = 3189.0\n",
"*************************\n",
" Suburb count\n",
"258 Moorabbin 88\n",
"--------------------------------------------------\n",
"Postcode = 3190.0\n",
"*************************\n",
" Suburb count\n",
"259 Highett 69\n",
"--------------------------------------------------\n",
"Postcode = 3191.0\n",
"*************************\n",
" Suburb count\n",
"260 Sandringham 59\n",
"--------------------------------------------------\n",
"Postcode = 3192.0\n",
"*************************\n",
" Suburb count\n",
"261 Cheltenham 120\n",
"--------------------------------------------------\n",
"Postcode = 3193.0\n",
"*************************\n",
" Suburb count\n",
"262 Beaumaris 69\n",
"263 Black Rock 19\n",
"--------------------------------------------------\n",
"Postcode = 3194.0\n",
"*************************\n",
" Suburb count\n",
"264 Mentone 74\n",
"--------------------------------------------------\n",
"Postcode = 3195.0\n",
"*************************\n",
" Suburb count\n",
"265 Aspendale 25\n",
"266 Aspendale Gardens 10\n",
"267 Mordialloc 46\n",
"268 Parkdale 62\n",
"269 Waterways 1\n",
"--------------------------------------------------\n",
"Postcode = 3196.0\n",
"*************************\n",
" Suburb count\n",
"270 Bonbeach 10\n",
"271 Chelsea 26\n",
"272 Chelsea Heights 14\n",
"273 Edithvale 25\n",
"--------------------------------------------------\n",
"Postcode = 3197.0\n",
"*************************\n",
" Suburb count\n",
"274 Carrum 21\n",
"275 Patterson Lakes 5\n",
"--------------------------------------------------\n",
"Postcode = 3198.0\n",
"*************************\n",
" Suburb count\n",
"276 Seaford 34\n",
"--------------------------------------------------\n",
"Postcode = 3199.0\n",
"*************************\n",
" Suburb count\n",
"277 Frankston 71\n",
"278 Frankston South 34\n",
"--------------------------------------------------\n",
"Postcode = 3200.0\n",
"*************************\n",
" Suburb count\n",
"279 Frankston North 13\n",
"--------------------------------------------------\n",
"Postcode = 3201.0\n",
"*************************\n",
" Suburb count\n",
"280 Carrum Downs 20\n",
"--------------------------------------------------\n",
"Postcode = 3202.0\n",
"*************************\n",
" Suburb count\n",
"281 Heatherton 2\n",
"--------------------------------------------------\n",
"Postcode = 3204.0\n",
"*************************\n",
" Suburb count\n",
"282 Bentleigh 223\n",
"283 McKinnon 22\n",
"284 Ormond 102\n",
"--------------------------------------------------\n",
"Postcode = 3205.0\n",
"*************************\n",
" Suburb count\n",
"285 South Melbourne 118\n",
"--------------------------------------------------\n",
"Postcode = 3206.0\n",
"*************************\n",
" Suburb count\n",
"286 Albert Park 79\n",
"287 Middle Park 45\n",
"--------------------------------------------------\n",
"Postcode = 3207.0\n",
"*************************\n",
" Suburb count\n",
"288 Port Melbourne 214\n",
"--------------------------------------------------\n",
"Postcode = 3335.0\n",
"*************************\n",
" Suburb count\n",
"289 Plumpton 6\n",
"290 Rockbank 3\n",
"--------------------------------------------------\n",
"Postcode = 3337.0\n",
"*************************\n",
" Suburb count\n",
"291 Kurunjang 13\n",
"292 Melton 25\n",
"293 Melton West 26\n",
"--------------------------------------------------\n",
"Postcode = 3338.0\n",
"*************************\n",
" Suburb count\n",
"294 Brookfield 3\n",
"295 Melton South 42\n",
"--------------------------------------------------\n",
"Postcode = 3340.0\n",
"*************************\n",
" Suburb count\n",
"296 Bacchus Marsh 3\n",
"297 Darley 1\n",
"--------------------------------------------------\n",
"Postcode = 3427.0\n",
"*************************\n",
" Suburb count\n",
"298 Diggers Rest 6\n",
"--------------------------------------------------\n",
"Postcode = 3428.0\n",
"*************************\n",
" Suburb count\n",
"299 Bulla 1\n",
"--------------------------------------------------\n",
"Postcode = 3429.0\n",
"*************************\n",
" Suburb count\n",
"300 Sunbury 112\n",
"--------------------------------------------------\n",
"Postcode = 3431.0\n",
"*************************\n",
" Suburb count\n",
"301 Riddells Creek 1\n",
"--------------------------------------------------\n",
"Postcode = 3437.0\n",
"*************************\n",
" Suburb count\n",
"302 Bullengarook 4\n",
"303 Gisborne 19\n",
"304 Gisborne South 2\n",
"--------------------------------------------------\n",
"Postcode = 3438.0\n",
"*************************\n",
" Suburb count\n",
"305 New Gisborne 3\n",
"--------------------------------------------------\n",
"Postcode = 3750.0\n",
"*************************\n",
" Suburb count\n",
"306 Wollert 39\n",
"--------------------------------------------------\n",
"Postcode = 3752.0\n",
"*************************\n",
" Suburb count\n",
"307 South Morang 71\n",
"--------------------------------------------------\n",
"Postcode = 3754.0\n",
"*************************\n",
" Suburb count\n",
"308 Doreen 28\n",
"309 Mernda 66\n",
"--------------------------------------------------\n",
"Postcode = 3756.0\n",
"*************************\n",
" Suburb count\n",
"310 Wallan 8\n",
"--------------------------------------------------\n",
"Postcode = 3757.0\n",
"*************************\n",
" Suburb count\n",
"311 Whittlesea 4\n",
"--------------------------------------------------\n",
"Postcode = 3775.0\n",
"*************************\n",
" Suburb count\n",
"312 Yarra Glen 1\n",
"--------------------------------------------------\n",
"Postcode = 3777.0\n",
"*************************\n",
" Suburb count\n",
"313 Healesville 1\n",
"--------------------------------------------------\n",
"Postcode = 3782.0\n",
"*************************\n",
" Suburb count\n",
"314 Emerald 3\n",
"--------------------------------------------------\n",
"Postcode = 3786.0\n",
"*************************\n",
" Suburb count\n",
"315 Ferny Creek 1\n",
"--------------------------------------------------\n",
"Postcode = 3793.0\n",
"*************************\n",
" Suburb count\n",
"316 Monbulk 1\n",
"--------------------------------------------------\n",
"Postcode = 3795.0\n",
"*************************\n",
" Suburb count\n",
"317 Silvan 2\n",
"--------------------------------------------------\n",
"Postcode = 3796.0\n",
"*************************\n",
" Suburb count\n",
"318 Mount Evelyn 3\n",
"--------------------------------------------------\n",
"Postcode = 3802.0\n",
"*************************\n",
" Suburb count\n",
"319 Endeavour Hills 16\n",
"--------------------------------------------------\n",
"Postcode = 3803.0\n",
"*************************\n",
" Suburb count\n",
"320 Hallam 7\n",
"--------------------------------------------------\n",
"Postcode = 3805.0\n",
"*************************\n",
" Suburb count\n",
"321 Narre Warren 14\n",
"--------------------------------------------------\n",
"Postcode = 3806.0\n",
"*************************\n",
" Suburb count\n",
"322 Berwick 34\n",
"--------------------------------------------------\n",
"Postcode = 3807.0\n",
"*************************\n",
" Suburb count\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"323 Beaconsfield 2\n",
"--------------------------------------------------\n",
"Postcode = 3808.0\n",
"*************************\n",
" Suburb count\n",
"324 Beaconsfield Upper 2\n",
"--------------------------------------------------\n",
"Postcode = 3809.0\n",
"*************************\n",
" Suburb count\n",
"325 Officer 3\n",
"--------------------------------------------------\n",
"Postcode = 3810.0\n",
"*************************\n",
" Suburb count\n",
"326 Pakenham 10\n",
"--------------------------------------------------\n",
"Postcode = 3910.0\n",
"*************************\n",
" Suburb count\n",
"327 Langwarrin 14\n",
"--------------------------------------------------\n",
"Postcode = 3975.0\n",
"*************************\n",
" Suburb count\n",
"328 Lynbrook 2\n",
"--------------------------------------------------\n",
"Postcode = 3976.0\n",
"*************************\n",
" Suburb count\n",
"329 Hampton Park 5\n",
"--------------------------------------------------\n",
"Postcode = 3977.0\n",
"*************************\n",
" Suburb count\n",
"330 Cranbourne 9\n",
"331 Cranbourne East 1\n",
"332 Cranbourne North 3\n",
"333 Cranbourne West 4\n",
"334 Sandhurst 1\n",
"335 Skye 3\n",
"--------------------------------------------------\n",
"Postcode = 3978.0\n",
"*************************\n",
" Suburb count\n",
"336 Clyde North 3\n",
"--------------------------------------------------\n"
]
}
],
"source": [
"for postcode , group in df_postcode_suburb.groupby(['Postcode']):\n",
" print(\"Postcode = %s\" %(postcode))\n",
" print(25*'*')\n",
" print(group[['Suburb', 'count']])\n",
" print(50*'-')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Is Postalcode an important feature in predicting Price ?"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 3600x720 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"gr = sns.catplot(data=training_set, x=\"Postcode\", y=\"Price\", kind=\"bar\", ci=None,height=10, aspect=5)\n",
"plt.xlabel(\"Postcode\")\n",
"plt.ylabel(\"Average Price\")\n",
"l1 = gr.set_yticklabels(rotation = 45)\n",
"l2 = gr.set_xticklabels(rotation = 45)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 3600x720 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"gr = sns.catplot(data=training_set, x=\"Suburb\", y=\"Price\", kind=\"bar\", ci=None,height=10, aspect=5)\n",
"plt.xlabel(\"Suburb\")\n",
"plt.ylabel(\"Average Price\")\n",
"l1 = gr.set_yticklabels(rotation = 45)\n",
"l2 = gr.set_xticklabels(rotation = 45)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Drop \"CouncilArea\", \"Regionname\", \"Lattitude\", \"Suburb\", \"Longtitude\""
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"training_set.drop([\"Address\", \"CouncilArea\", \"Regionname\", \"Lattitude\", \"Suburb\", \"Longtitude\"], axis=1, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Rooms', 'Type', 'Price', 'Method', 'SellerG', 'Date', 'Distance',\n",
" 'Postcode', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea',\n",
" 'YearBuilt', 'Propertycount'],\n",
" dtype='object')"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.columns"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Rooms</th>\n",
" <th>Type</th>\n",
" <th>Price</th>\n",
" <th>Method</th>\n",
" <th>SellerG</th>\n",
" <th>Date</th>\n",
" <th>Distance</th>\n",
" <th>Postcode</th>\n",
" <th>Bedroom2</th>\n",
" <th>Bathroom</th>\n",
" <th>Car</th>\n",
" <th>Landsize</th>\n",
" <th>BuildingArea</th>\n",
" <th>YearBuilt</th>\n",
" <th>Propertycount</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>17957</th>\n",
" <td>2</td>\n",
" <td>u</td>\n",
" <td>580000.0</td>\n",
" <td>VB</td>\n",
" <td>Nelson</td>\n",
" <td>26/08/2017</td>\n",
" <td>4.0</td>\n",
" <td>3057.0</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>61.0</td>\n",
" <td>1970.0</td>\n",
" <td>5533.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6753</th>\n",
" <td>3</td>\n",
" <td>t</td>\n",
" <td>1025000.0</td>\n",
" <td>SP</td>\n",
" <td>Jellis</td>\n",
" <td>3/12/2016</td>\n",
" <td>2.6</td>\n",
" <td>3121.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>14949.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1833</th>\n",
" <td>2</td>\n",
" <td>t</td>\n",
" <td>380000.0</td>\n",
" <td>VB</td>\n",
" <td>Nelson</td>\n",
" <td>7/11/2016</td>\n",
" <td>5.9</td>\n",
" <td>3055.0</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>85.0</td>\n",
" <td>1970.0</td>\n",
" <td>7082.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19795</th>\n",
" <td>2</td>\n",
" <td>u</td>\n",
" <td>456000.0</td>\n",
" <td>S</td>\n",
" <td>Brad</td>\n",
" <td>21/10/2017</td>\n",
" <td>8.5</td>\n",
" <td>3044.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>7485.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17054</th>\n",
" <td>3</td>\n",
" <td>h</td>\n",
" <td>1300000.0</td>\n",
" <td>S</td>\n",
" <td>hockingstuart</td>\n",
" <td>19/08/2017</td>\n",
" <td>16.7</td>\n",
" <td>3150.0</td>\n",
" <td>3.0</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>733.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>15321.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Rooms Type Price Method SellerG Date Distance \\\n",
"17957 2 u 580000.0 VB Nelson 26/08/2017 4.0 \n",
"6753 3 t 1025000.0 SP Jellis 3/12/2016 2.6 \n",
"1833 2 t 380000.0 VB Nelson 7/11/2016 5.9 \n",
"19795 2 u 456000.0 S Brad 21/10/2017 8.5 \n",
"17054 3 h 1300000.0 S hockingstuart 19/08/2017 16.7 \n",
"\n",
" Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt \\\n",
"17957 3057.0 2.0 1.0 1.0 NaN 61.0 1970.0 \n",
"6753 3121.0 NaN NaN NaN NaN NaN NaN \n",
"1833 3055.0 2.0 1.0 1.0 0.0 85.0 1970.0 \n",
"19795 3044.0 NaN NaN NaN NaN NaN NaN \n",
"17054 3150.0 3.0 2.0 3.0 733.0 NaN NaN \n",
"\n",
" Propertycount \n",
"17957 5533.0 \n",
"6753 14949.0 \n",
"1833 7082.0 \n",
"19795 7485.0 \n",
"17054 15321.0 "
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.head()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Type False\n",
"Method False\n",
"SellerG False\n",
"Date False\n",
"dtype: bool"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.select_dtypes(['object']).isnull().any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Date field has got date in string format, I think it is good to extract sold year as we have YearBuilt in the dataset"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"training_set.Date = pd.to_datetime(training_set.Date)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"training_set[\"sold_year\"] = training_set.Date.apply(lambda x: x.year)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"training_set.drop(\"Date\", axis=1, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### EDA (Statistical data analysis) of numerical columns and fixing problems."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Rooms False\n",
"Price False\n",
"Distance False\n",
"Postcode False\n",
"Bedroom2 True\n",
"Bathroom True\n",
"Car True\n",
"Landsize True\n",
"BuildingArea True\n",
"YearBuilt True\n",
"Propertycount True\n",
"sold_year False\n",
"dtype: bool"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.select_dtypes(['float64','int64']).isnull().any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# --------------------------------------------------------------------"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(5163, 15)"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set[training_set.Bedroom2.isnull()].shape"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x19bc14d7d68>"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"training_set.Bedroom2.hist(bins=20)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"training_set.Bedroom2.fillna(value=training_set.Bedroom2.mean(), inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# --------------------------------------------------------------------"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(5169, 15)"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set[training_set.Bathroom.isnull()].shape"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x19bc16ab2e8>"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD4CAYAAAAAczaOAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAAWeUlEQVR4nO3df4xdZZ3H8fdnqUjB1RaRG7ZtdmqcqGgDshOokpi71C0FjOUPSWq6MpAms390Fc0kbvGfZkESTESURMlObN3iumK3SmiEiE3hZrN/UPm5VKikI9R2bKXqlOrI+mP0u3/cZ+y9ZWbunbk/zvQ+n1cyued8z3POec7T28+9c+6ZexQRmJlZHv6q6A6YmVn3OPTNzDLi0Dczy4hD38wsIw59M7OMLCq6A7O54IILoq+vb97r//a3v+W8885rX4fOYB6Leh6PUzwW9XphPJ566qlfRsTbplu2oEO/r6+PJ598ct7rVyoVyuVy+zp0BvNY1PN4nOKxqNcL4yHppzMt8+kdM7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMLOi/yD1T9W15qJD9HrrzukL2a2ZnDr/TNzPLiEPfzCwjDn0zs4w49M3MMuLQNzPLiEPfzCwjTYW+pE9Lel7SjyR9S9I5klZK2ifpoKRvSzo7tX1jmh9Ny/tqtnNrqr8o6erOHJKZmc2kYehLWgZ8EhiIiPcCZwEbgM8Dd0dEP3AC2JRW2QSciIh3AHendki6OK33HmAd8FVJZ7X3cMzMbDbNnt5ZBCyWtAg4FzgGXAXsSst3ANen6fVpnrR8jSSl+v0R8fuIeBkYBS5v/RDMzKxZDf8iNyJ+JukLwGHg/4AfAE8Br0bEZGo2BixL08uAI2ndSUkngbem+uM1m65d5y8kDQFDAKVSiUqlMvejSiYmJlpaf76GV002btQBsx1rUWOxUHk8TvFY1Ov18WgY+pKWUn2XvhJ4Ffgv4JppmsbUKjMsm6leX4gYAUYABgYGopUbFBd1g+Obivoaho3lGZf1ws2e28njcYrHol6vj0czp3c+BLwcEb+IiD8C3wU+ACxJp3sAlgNH0/QYsAIgLX8LMF5bn2YdMzPrgmZC/zCwWtK56dz8GuAF4DHgo6nNIPBgmt6d5knLH42ISPUN6eqelUA/8MP2HIaZmTWjmXP6+yTtAp4GJoFnqJ5+eQi4X9LnUm1bWmUb8A1Jo1Tf4W9I23le0k6qLxiTwOaI+FObj8fMzGbR1FcrR8RWYOtp5ZeY5uqbiPgdcMMM27kDuGOOfTQzszbxX+SamWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZaRj6kt4p6dman19L+pSk8yXtkXQwPS5N7SXpHkmjkp6TdFnNtgZT+4OSBmfeq5mZdULD0I+IFyPi0oi4FPg74DXgAWALsDci+oG9aR7gGqr3v+0HhoB7ASSdT/XuW1dQvePW1qkXCjMz6465nt5ZA/wkIn4KrAd2pPoO4Po0vR64L6oeB5ZIugi4GtgTEeMRcQLYA6xr+QjMzKxpTd0jt8YG4FtpuhQRxwAi4pikC1N9GXCkZp2xVJupXkfSENXfECiVSlQqlTl28ZSJiYmW1p+v4VWTXd8nMOuxFjUWC5XH4xSPRb1eH4+mQ1/S2cBHgFsbNZ2mFrPU6wsRI8AIwMDAQJTL5Wa7+DqVSoVW1p+vm7Y81PV9AhzaWJ5xWVFjsVB5PE7xWNTr9fGYy+mda4CnI+KVNP9KOm1Dejye6mPAipr1lgNHZ6mbmVmXzCX0P8apUzsAu4GpK3AGgQdr6jemq3hWAyfTaaBHgLWSlqYPcNemmpmZdUlTp3cknQv8A/BPNeU7gZ2SNgGHgRtS/WHgWmCU6pU+NwNExLik24EnUrvbImK85SMwM7OmNRX6EfEa8NbTar+iejXP6W0D2DzDdrYD2+feTTMza4e5Xr1jC1jfLB8gD6+a7OgHzIfuvK5j2zaz9vHXMJiZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWkaZCX9ISSbsk/VjSAUnvl3S+pD2SDqbHpamtJN0jaVTSc5Iuq9nOYGp/UNLgzHs0M7NOaPad/peB70fEu4BLgAPAFmBvRPQDe9M8VO+l259+hoB7ASSdD2wFrgAuB7ZOvVCYmVl3NAx9SW8GPghsA4iIP0TEq8B6YEdqtgO4Pk2vB+6LqseBJenG6VcDeyJiPCJOAHuAdW09GjMzm1Uzd856O/AL4OuSLgGeAm4BSumG50TEMUkXpvbLgCM164+l2kz1OpKGqP6GQKlUolKpzOV46kxMTLS0/nwNr5rs+j4bKS3ubL+KGOdWFPXcWIg8FvV6fTyaCf1FwGXAJyJin6Qvc+pUznQ0TS1mqdcXIkaAEYCBgYEol8tNdHF6lUqFVtafr07elnC+hldNctf+zt0d89DGcse23QlFPTcWIo9FvV4fj2bO6Y8BYxGxL83vovoi8Eo6bUN6PF7TfkXN+suBo7PUzcysSxqGfkT8HDgi6Z2ptAZ4AdgNTF2BMwg8mKZ3Azemq3hWAyfTaaBHgLWSlqYPcNemmpmZdUmzv+9/AvimpLOBl4Cbqb5g7JS0CTgM3JDaPgxcC4wCr6W2RMS4pNuBJ1K72yJivC1HYWZmTWkq9CPiWWBgmkVrpmkbwOYZtrMd2D6XDpqZWfv4L3LNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMNBX6kg5J2i/pWUlPptr5kvZIOpgel6a6JN0jaVTSc5Iuq9nOYGp/UNLgTPszM7POmMs7/b+PiEsjYuoOWluAvRHRD+xN8wDXAP3pZwi4F6ovEsBW4ArgcmDr1AuFmZl1Ryund9YDO9L0DuD6mvp9UfU4sETSRcDVwJ6IGI+IE8AeYF0L+zczszlq9sboAfxAUgD/FhEjQCkijgFExDFJF6a2y4AjNeuOpdpM9TqShqj+hkCpVKJSqTR/NKeZmJhoaf35Gl412fV9NlJa3Nl+FTHOrSjqubEQeSzq9fp4NBv6V0bE0RTseyT9eJa2mqYWs9TrC9UXlBGAgYGBKJfLTXbx9SqVCq2sP183bXmo6/tsZHjVJHftb/afe+4ObSx3bNudUNRzYyHyWNTr9fFo6vRORBxNj8eBB6iek38lnbYhPR5PzceAFTWrLweOzlI3M7MuaRj6ks6T9NdT08Ba4EfAbmDqCpxB4ME0vRu4MV3Fsxo4mU4DPQKslbQ0fYC7NtXMzKxLmvl9vwQ8IGmq/X9GxPclPQHslLQJOAzckNo/DFwLjAKvATcDRMS4pNuBJ1K72yJivG1HYmZmDTUM/Yh4CbhkmvqvgDXT1APYPMO2tgPb595NMzNrB/9FrplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWWk6dCXdJakZyR9L82vlLRP0kFJ35Z0dqq/Mc2PpuV9Ndu4NdVflHR1uw/GzMxmN5d3+rcAB2rmPw/cHRH9wAlgU6pvAk5ExDuAu1M7JF0MbADeA6wDvirprNa6b2Zmc9FU6EtaDlwHfC3NC7gK2JWa7ACuT9Pr0zxp+ZrUfj1wf0T8PiJepno7xcvbcRBmZtacZt/pfwn4DPDnNP9W4NWImEzzY8CyNL0MOAKQlp9M7f9Sn2YdMzPrgob3yJX0YeB4RDwlqTxVnqZpNFg22zq1+xsChgBKpRKVSqVRF2c0MTHR0vrzNbxqsnGjList7my/ihjnVhT13FiIPBb1en08GoY+cCXwEUnXAucAb6b6zn+JpEXp3fxy4GhqPwasAMYkLQLeAozX1KfUrvMXETECjAAMDAxEuVyex2FVVSoVWll/vm7a8lDX99nI8KpJ7trfzD/3/BzaWO7YtjuhqOfGQuSxqNfr49Hw9E5E3BoRyyOij+oHsY9GxEbgMeCjqdkg8GCa3p3mScsfjYhI9Q3p6p6VQD/ww7YdiZmZNdTKW79/Ae6X9DngGWBbqm8DviFplOo7/A0AEfG8pJ3AC8AksDki/tTC/s3MbI7mFPoRUQEqafolprn6JiJ+B9www/p3AHfMtZNmZtYe/otcM7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw1DX9I5kn4o6X8lPS/pX1N9paR9kg5K+raks1P9jWl+NC3vq9nWran+oqSrO3VQZmY2vWbe6f8euCoiLgEuBdZJWg18Hrg7IvqBE8Cm1H4TcCIi3gHcndoh6WKqt058D7AO+Kqks9p5MGZmNrtmboweETGRZt+QfgK4CtiV6juA69P0+jRPWr5GklL9/oj4fUS8DIwyze0Wzcysc5q6R256R/4U8A7gK8BPgFcjYjI1GQOWpellwBGAiJiUdBJ4a6o/XrPZ2nVq9zUEDAGUSiUqlcrcjqjGxMRES+vP1/CqycaNuqy0uLP9KmKcW1HUc2Mh8ljU6/XxaCr0I+JPwKWSlgAPAO+erll61AzLZqqfvq8RYARgYGAgyuVyM12cVqVSoZX15+umLQ91fZ+NDK+a5K79Tf1zz8uhjeWObbsTinpuLEQei3q9Ph5zunonIl4FKsBqYImkqRRZDhxN02PACoC0/C3AeG19mnXMzKwLmrl6523pHT6SFgMfAg4AjwEfTc0GgQfT9O40T1r+aEREqm9IV/esBPqBH7brQMzMrLFmft+/CNiRzuv/FbAzIr4n6QXgfkmfA54BtqX224BvSBql+g5/A0BEPC9pJ/ACMAlsTqeNzMysSxqGfkQ8B7xvmvpLTHP1TUT8Drhhhm3dAdwx926amVk7+C9yzcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCPN3DlrhaTHJB2Q9LykW1L9fEl7JB1Mj0tTXZLukTQq6TlJl9VsazC1PyhpcKZ9mplZZzTzTn8SGI6Id1O9N+5mSRcDW4C9EdEP7E3zANdQvRViPzAE3AvVFwlgK3AF1ZuvbJ16oTAzs+5oGPoRcSwink7Tv6F6f9xlwHpgR2q2A7g+Ta8H7ouqx6neQP0i4GpgT0SMR8QJYA+wrq1HY2Zms5rTOX1JfVRvnbgPKEXEMai+MAAXpmbLgCM1q42l2kx1MzPrkmZujA6ApDcB3wE+FRG/ljRj02lqMUv99P0MUT0tRKlUolKpNNvF15mYmGhp/fkaXjXZ9X02Ulrc2X4VMc6tKOq5sRB5LOr1+ng0FfqS3kA18L8ZEd9N5VckXRQRx9Lpm+OpPgasqFl9OXA01cun1Sun7ysiRoARgIGBgSiXy6c3aVqlUqGV9efrpi0PdX2fjQyvmuSu/U2/xs/ZoY3ljm27E4p6bixEHot6vT4ezVy9I2AbcCAivlizaDcwdQXOIPBgTf3GdBXPauBkOv3zCLBW0tL0Ae7aVDMzsy5p5q3flcDHgf2Snk21zwJ3AjslbQIOAzekZQ8D1wKjwGvAzQARMS7pduCJ1O62iBhvy1GYmVlTGoZ+RPwP05+PB1gzTfsANs+wre3A9rl00MzM2sd/kWtmlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRjp3KyXLSl9Bdws7dOd1hezX7Ezld/pmZhlp5naJ2yUdl/Sjmtr5kvZIOpgel6a6JN0jaVTSc5Iuq1lnMLU/KGlwun2ZmVlnNfNO/9+BdafVtgB7I6If2JvmAa4B+tPPEHAvVF8kgK3AFcDlwNapFwozM+uehqEfEf8NnH4v2/XAjjS9A7i+pn5fVD0OLJF0EXA1sCcixiPiBLCH17+QmJlZh833g9xSRBwDiIhjki5M9WXAkZp2Y6k2U/11JA1R/S2BUqlEpVKZZxdhYmKipfXna3jVZNf32Uhp8cLsV6vm++9b1HNjIfJY1Ov18Wj31TvT3UA9Zqm/vhgxAowADAwMRLlcnndnKpUKraw/XzcVdCXLbIZXTXLX/t67WOvQxvK81ivqubEQeSzq9fp4zPfqnVfSaRvS4/FUHwNW1LRbDhydpW5mZl0039DfDUxdgTMIPFhTvzFdxbMaOJlOAz0CrJW0NH2AuzbVzMysixr+vi/pW0AZuEDSGNWrcO4EdkraBBwGbkjNHwauBUaB14CbASJiXNLtwBOp3W0RcfqHw2Zm1mENQz8iPjbDojXTtA1g8wzb2Q5sn1PvzMysrfwXuWZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpaR3rtpqmWlb573Ix5eNdnyvYwP3XldS+ubFaHr7/QlrZP0oqRRSVu6vX8zs5x1NfQlnQV8BbgGuBj4mKSLu9kHM7Ocdfv0zuXAaES8BCDpfmA98EIndrb/Zydb/hXebCbzPbXUKp9WslZ0O/SXAUdq5seAK2obSBoChtLshKQXW9jfBcAvW1i/Z3zSY1HnTB4Pfb7tmzxjx6JDemE8/namBd0OfU1Ti7qZiBFgpC07k56MiIF2bOtM57Go5/E4xWNRr9fHo9sf5I4BK2rmlwNHu9wHM7NsdTv0nwD6Ja2UdDawAdjd5T6YmWWrq6d3ImJS0j8DjwBnAdsj4vkO7rItp4l6hMeinsfjFI9FvZ4eD0VE41ZmZtYT/DUMZmYZceibmWWkJ0PfX/VwiqQVkh6TdEDS85JuKbpPRZN0lqRnJH2v6L4UTdISSbsk/Tg9R95fdJ+KJOnT6f/JjyR9S9I5Rfep3Xou9P1VD68zCQxHxLuB1cDmzMcD4BbgQNGdWCC+DHw/It4FXELG4yJpGfBJYCAi3kv1YpMNxfaq/Xou9Kn5qoeI+AMw9VUPWYqIYxHxdJr+DdX/1MuK7VVxJC0HrgO+VnRfiibpzcAHgW0AEfGHiHi12F4VbhGwWNIi4Fx68O+IejH0p/uqh2xDrpakPuB9wL5ie1KoLwGfAf5cdEcWgLcDvwC+nk53fU3SeUV3qigR8TPgC8Bh4BhwMiJ+UGyv2q8XQ7/hVz3kSNKbgO8An4qIXxfdnyJI+jBwPCKeKrovC8Qi4DLg3oh4H/BbINvPwCQtpXpWYCXwN8B5kv6x2F61Xy+Gvr/q4TSS3kA18L8ZEd8tuj8FuhL4iKRDVE/7XSXpP4rtUqHGgLGImPrNbxfVF4FcfQh4OSJ+ERF/BL4LfKDgPrVdL4a+v+qhhiRRPWd7ICK+WHR/ihQRt0bE8ojoo/q8eDQieu6dXLMi4ufAEUnvTKU1dOhrzs8Qh4HVks5N/2/W0IMfbPfc7RIL+KqHhe5K4OPAfknPptpnI+LhAvtkC8cngG+mN0gvATcX3J/CRMQ+SbuAp6le9fYMPfiVDP4aBjOzjPTi6R0zM5uBQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfTOzjPw/o2MWUIhcE2QAAAAASUVORK5CYII=\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"training_set.Bathroom.hist(bins=10)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.0"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.Bathroom.mode()[0]"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"training_set.Bathroom.fillna(value=training_set.Bathroom.mode()[0], inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0, 15)"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set[training_set.Bathroom.isnull()].shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# --------------------------------------------------------------------"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(5472, 15)"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set[training_set.Car.isnull()].shape"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x19bc051ad30>"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD4CAYAAAAAczaOAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAAaJElEQVR4nO3df5Bd5X3f8fcnYEBmHWll7C2VNBWONW4AjYm0A0rceu4iVxI4Y9EOdOTRhA1RR81UTnFHnSA3Q+TyYyoaK9QwMe0m0kQ4qhdCTKXBOFgj2PHwhwALY8QPEy2g4LUUqfEKkTWyk3W//eM8ay7LvXfv7v2xe/N8XjM795zvec4533P27veefe6591FEYGZmefiF2U7AzMzax0XfzCwjLvpmZhlx0Tczy4iLvplZRs6d7QRqueiii2Lp0qUzXv/HP/4xF154YfMSahHn2Vydkid0Tq7Os7lanefhw4f/NiI+VHFhRMzZn5UrV0YjnnjiiYbWbxfn2VydkmdE5+TqPJur1XkC34kqddXdO2ZmGXHRNzPLiIu+mVlGXPTNzDLiom9mlhEXfTOzjLjom5llxEXfzCwjLvpmZhmZ01/D0KmWbvvGtNpvXT7Ob05znUqO7fh0w9sws3/cfKVvZpYRF30zs4y46JuZZcRF38wsIy76ZmYZcdE3M8uIi76ZWUbqKvqS/pOkFyW9IOlrki6QdImkpyQdlfSApPNS2/PT/HBavrRsO19I8VckrW3NIZmZWTVTFn1Ji4D/CPRGxOXAOcAG4C7g7ohYBpwGNqVVNgGnI+KjwN2pHZIuTetdBqwDviLpnOYejpmZ1VJv9865wDxJ5wLvB04AVwMPpeV7gOvS9Po0T1q+WpJSfDAifhoRrwPDwJWNH4KZmdVLxRi6UzSSbgbuBM4C3wJuBg6lq3kkLQG+GRGXS3oBWBcRI2nZq8BVwBfTOn+W4rvSOg9N2tdmYDNAT0/PysHBwRkf3NjYGF1dXTNef6aO/PDMtNr3zIOTZxvf7/JF8xvfSA2zdT6nq1PyhM7J1Xk2V6vz7OvrOxwRvZWWTfndO5K6Ka7SLwHeBP4cuKZC04lXD1VZVi3+7kDEADAA0NvbG6VSaaoUqxoaGqKR9Wdqut+js3X5ODuPNP41SMc2lhreRi2zdT6nq1PyhM7J1Xk212zmWU/3zqeA1yPi/0bEPwBfB34NWJC6ewAWA8fT9AiwBCAtnw+MlscrrGNmZm1QT9F/A1gl6f2pb3418BLwBHB9atMP7EvT+9M8afnjUfQh7Qc2pLt7LgGWAU835zDMzKweU/YpRMRTkh4CngXGge9SdL98AxiUdEeK7Uqr7AK+KmmY4gp/Q9rOi5IepHjBGAe2RMTPmnw8ZmZWQ10dyRGxHdg+KfwaFe6+iYifADdU2c6dFG8Im5nZLPAncs3MMuKib2aWERd9M7OMuOibmWXERd/MLCMu+mZmGXHRNzPLiIu+mVlGXPTNzDLiom9mlhEXfTOzjLjom5llxEXfzCwjLvpmZhlx0Tczy8iURV/SxyQ9V/bzlqTPS1oo6YCko+mxO7WXpHskDUt6XtKKsm31p/ZHJfVX36uZmbXClEU/Il6JiCsi4gpgJfA28DCwDTgYEcuAg2keikHTl6WfzcB9AJIWUgzEchXF4CvbJ14ozMysPabbvbMaeDUi/hpYD+xJ8T3AdWl6PXB/FA5RDKB+MbAWOBARoxFxGjgArGv4CMzMrG7TLfobgK+l6Z6IOAGQHj+c4ouAH5StM5Ji1eJmZtYmioj6GkrnAceByyLipKQ3I2JB2fLTEdEt6RvAf4uIJ1P8IPC7wNXA+RFxR4rfCrwdETsn7WczRbcQPT09KwcHB2d8cGNjY3R1dc14/Zk68sMz02rfMw9Onm18v8sXzW98IzXM1vmcrk7JEzonV+fZXK3Os6+v73BE9FZaVtfA6Mk1wLMRcTLNn5R0cUScSN03p1J8BFhStt5iiheLEaA0KT40eScRMQAMAPT29kapVJrcpG737t3Hzid/POP1Z246pxW2Lh9n55HprVPJsY2lhrdRy9DQEI38PtqlU/KEzsnVeTbXbOY5ne6dz/JO1w7AfmDiDpx+YF9Z/MZ0F88q4Ezq/nkMWCOpO72BuybFzMysTeq6vJT0fuBfAf++LLwDeFDSJuAN4IYUfxS4FhimuNPnJoCIGJV0O/BMandbRIw2fARmZla3uop+RLwNfHBS7EcUd/NMbhvAlirb2Q3snn6aZmbWDP5ErplZRlz0zcwy4qJvZpYRF30zs4y46JuZZcRF38wsIy76ZmYZcdE3M8uIi76ZWUZc9M3MMuKib2aWERd9M7OMuOibmWXERd/MLCMu+mZmGXHRNzPLSF1FX9ICSQ9J+r6klyX9qqSFkg5IOpoeu1NbSbpH0rCk5yWtKNtOf2p/VFJ/9T2amVkr1Hul/2XgLyPinwMfB14GtgEHI2IZcDDNQzGA+rL0sxm4D0DSQmA7cBVwJbB94oXCzMzaY8qiL+kXgU8CuwAi4u8j4k1gPbAnNdsDXJem1wP3R+EQsEDSxcBa4EBEjEbEaeAAsK6pR2NmZjWpGNK2RgPpCmAAeIniKv8wcDPww4hYUNbudER0S3oE2BERT6b4QeAWoARcEBF3pPitwNmI+NKk/W2m+A+Bnp6elYODgzM+uFOjZzh5dsart03PPJqS5/JF8xvfSA1jY2N0dXW1dB/N0Cl5Qufk6jybq9V59vX1HY6I3krL6hkY/VxgBfA7EfGUpC/zTldOJaoQixrxdwciBiheZOjt7Y1SqVRHipXdu3cfO4/UNfb7rNq6fLwpeR7bWGo8mRqGhoZo5PfRLp2SJ3ROrs6zuWYzz3r69EeAkYh4Ks0/RPEicDJ125AeT5W1X1K2/mLgeI24mZm1yZRFPyL+BviBpI+l0GqKrp79wMQdOP3AvjS9H7gx3cWzCjgTESeAx4A1krrTG7hrUszMzNqk3j6F3wH2SjoPeA24ieIF40FJm4A3gBtS20eBa4Fh4O3UlogYlXQ78Exqd1tEjDblKMzMrC51Ff2IeA6o9KbA6gptA9hSZTu7gd3TSdDMzJrHn8g1M8uIi76ZWUZc9M3MMuKib2aWERd9M7OMuOibmWXERd/MLCMu+mZmGXHRNzPLiIu+mVlGXPTNzDLiom9mlhEXfTOzjLjom5llxEXfzCwjdRV9ScckHZH0nKTvpNhCSQckHU2P3SkuSfdIGpb0vKQVZdvpT+2PSuqvtj8zM2uN6Vzp90XEFWUjrG8DDkbEMuAg7wyWfg2wLP1sBu6D4kUC2A5cBVwJbJ94oTAzs/ZopHtnPbAnTe8BriuL3x+FQ8CCNHD6WuBARIxGxGngALCugf2bmdk0qRjdcIpG0uvAaSCA/xURA5LejIgFZW1OR0S3pEeAHRHxZIofBG4BSsAFEXFHit8KnI2IL03a12aK/xDo6elZOTg4OOODOzV6hpNnZ7x62/TMoyl5Ll80v/GN1DA2NkZXV1dL99EMnZIndE6uzrO5Wp1nX1/f4bJemXepd2D0T0TEcUkfBg5I+n6NtqoQixrxdwciBoABgN7e3iiVSnWm+F737t3HziP1HuLs2bp8vCl5HttYajyZGoaGhmjk99EunZIndE6uzrO5ZjPPurp3IuJ4ejwFPEzRJ38ydduQHk+l5iPAkrLVFwPHa8TNzKxNpiz6ki6U9IGJaWAN8AKwH5i4A6cf2Jem9wM3prt4VgFnIuIE8BiwRlJ3egN3TYqZmVmb1NOn0AM8LGmi/f+OiL+U9AzwoKRNwBvADan9o8C1wDDwNnATQESMSrodeCa1uy0iRpt2JGZmNqUpi35EvAZ8vEL8R8DqCvEAtlTZ1m5g9/TTNDOzZvAncs3MMuKib2aWERd9M7OMuOibmWXERd/MLCMu+mZmGXHRNzPLiIu+mVlGXPTNzDLiom9mlhEXfTOzjLjom5llxEXfzCwjLvpmZhlx0Tczy4iLvplZRuou+pLOkfRdSY+k+UskPSXpqKQHJJ2X4uen+eG0fGnZNr6Q4q9IWtvsgzEzs9qmc6V/M/By2fxdwN0RsQw4DWxK8U3A6Yj4KHB3aoekS4ENwGXAOuArks5pLH0zM5uOuoq+pMXAp4E/SfMCrgYeSk32ANel6fVpnrR8dWq/HhiMiJ9GxOsUY+he2YyDMDOz+tQzMDrA/wB+F/hAmv8g8GZEjKf5EWBRml4E/AAgIsYlnUntFwGHyrZZvs7PSdoMbAbo6elhaGio3mN5j555sHX5+NQNZ1mz8mzkXNVjbGys5ftohk7JEzonV+fZXLOZ55RFX9KvA6ci4rCk0kS4QtOYYlmtdd4JRAwAAwC9vb1RKpUmN6nbvXv3sfNIva9rs2fr8vGm5HlsY6nxZGoYGhqikd9Hu3RKntA5uTrP5prNPOupNJ8APiPpWuAC4BcprvwXSDo3Xe0vBo6n9iPAEmBE0rnAfGC0LD6hfB0zM2uDKfv0I+ILEbE4IpZSvBH7eERsBJ4Ark/N+oF9aXp/mictfzwiIsU3pLt7LgGWAU837UjMzGxKjfQp3AIMSroD+C6wK8V3AV+VNExxhb8BICJelPQg8BIwDmyJiJ81sH8zM5umaRX9iBgChtL0a1S4+yYifgLcUGX9O4E7p5ukmZk1hz+Ra2aWERd9M7OMuOibmWXERd/MLCMu+mZmGXHRNzPLiIu+mVlGXPTNzDLiom9mlhEXfTOzjLjom5llxEXfzCwjLvpmZhlx0Tczy4iLvplZRqYs+pIukPS0pO9JelHSf03xSyQ9JemopAcknZfi56f54bR8adm2vpDir0ha26qDMjOzyuq50v8pcHVEfBy4AlgnaRVwF3B3RCwDTgObUvtNwOmI+Chwd2qHpEspRtG6DFgHfEXSOc08GDMzq62eMXIjIsbS7PvSTwBXAw+l+B7gujS9Ps2Tlq+WpBQfjIifRsTrwDAVRt4yM7PWUTFm+RSNiivyw8BHgT8C/gA4lK7mkbQE+GZEXC7pBWBdRIykZa8CVwFfTOv8WYrvSus8NGlfm4HNAD09PSsHBwdnfHCnRs9w8uyMV2+bnnk0Jc/li+Y3vpEaxsbG6Orqauk+mqFT8oTOydV5Nler8+zr6zscEb2VltU1Rm4awPwKSQuAh4FfrtQsParKsmrxyfsaAAYAent7o1Qq1ZNiRffu3cfOI42M/d4eW5ePNyXPYxtLjSdTw9DQEI38PtqlU/KEzsnVeTbXbOY5rbt3IuJNioHRVwELJE1UqsXA8TQ9AiwBSMvnA6Pl8QrrmJlZG9Rz986H0hU+kuYBnwJeBp4Ark/N+oF9aXp/mictfzyKPqT9wIZ0d88lwDLg6WYdiJmZTa2ePoWLgT2pX/8XgAcj4hFJLwGDku4AvgvsSu13AV+VNExxhb8BICJelPQg8BIwDmxJ3UZmZtYmUxb9iHge+JUK8deocPdNRPwEuKHKtu4E7px+mmZm1gz+RK6ZWUZc9M3MMuKib2aWERd9M7OMuOibmWXERd/MLCMu+mZmGXHRNzPLiIu+mVlGXPTNzDLiom9mlhEXfTOzjLjom5llxEXfzCwjLvpmZhlx0Tczy0g9wyUukfSEpJclvSjp5hRfKOmApKPpsTvFJekeScOSnpe0omxb/an9UUn91fZpZmatUc+V/jiwNSJ+mWJA9C2SLgW2AQcjYhlwMM0DXEMx/u0yYDNwHxQvEsB24CqKEbe2T7xQmJlZe0xZ9CPiREQ8m6b/jmJQ9EXAemBParYHuC5Nrwfuj8IhYIGki4G1wIGIGI2I08ABYF1Tj8bMzGpSRNTfWFoKfBu4HHgjIhaULTsdEd2SHgF2RMSTKX4QuAUoARdExB0pfitwNiK+NGkfmyn+Q6Cnp2fl4ODgjA/u1OgZTp6d8ept0zOPpuS5fNH8xjdSw9jYGF1dXS3dRzN0Sp7QObk6z+ZqdZ59fX2HI6K30rIpB0afIKkL+Avg8xHxlqSqTSvEokb83YGIAWAAoLe3N0qlUr0pvse9e/ex80jdhzhrti4fb0qexzaWGk+mhqGhIRr5fbRLp+QJnZOr82yu2cyzrrt3JL2PouDvjYivp/DJ1G1DejyV4iPAkrLVFwPHa8TNzKxN6rl7R8Au4OWI+MOyRfuBiTtw+oF9ZfEb0108q4AzEXECeAxYI6k7vYG7JsXMzKxN6ulT+ATwG8ARSc+l2H8BdgAPStoEvAHckJY9ClwLDANvAzcBRMSopNuBZ1K72yJitClHYWZmdZmy6Kc3ZKt14K+u0D6ALVW2tRvYPZ0EzcysefyJXDOzjLjom5llxEXfzCwjLvpmZhlx0Tczy4iLvplZRlz0zcwy4qJvZpYRF30zs4y46JuZZcRF38wsIy76ZmYZcdE3M8uIi76ZWUZc9M3MMlLPyFm7JZ2S9EJZbKGkA5KOpsfuFJekeyQNS3pe0oqydfpT+6OS+ivty8zMWqueK/0/BdZNim0DDkbEMuBgmge4BliWfjYD90HxIgFsB64CrgS2T7xQmJlZ+0xZ9CPi28DkYQ3XA3vS9B7gurL4/VE4BCxIg6avBQ5ExGhEnAYO8N4XEjMzazEVoxtO0UhaCjwSEZen+TcjYkHZ8tMR0S3pEWBHGmIRSQeBW4AScEFE3JHitwJnI+JLFfa1meK/BHp6elYODg7O+OBOjZ7h5NkZr942PfNoSp7LF81vfCM1jI2N0dXV1dJ9NEOn5Amdk6vzbK5W59nX13c4InorLatnYPTpqDSWbtSIvzcYMQAMAPT29kapVJpxMvfu3cfOI80+xObbuny8KXke21hqPJkahoaGaOT30S6dkid0Tq7Os7lmM8+Z3r1zMnXbkB5PpfgIsKSs3WLgeI24mZm10UyL/n5g4g6cfmBfWfzGdBfPKuBMRJwAHgPWSOpOb+CuSTEzM2ujKfsUJH2Nok/+IkkjFHfh7AAelLQJeAO4ITV/FLgWGAbeBm4CiIhRSbcDz6R2t0XE5DeHzcysxaYs+hHx2SqLVldoG8CWKtvZDeyeVnY2LUu3faOl29+6fJzfrLKPYzs+3dJ9m1lz+BO5ZmYZcdE3M8uIi76ZWUZc9M3MMuKib2aWERd9M7OMuOibmWXERd/MLCMu+mZmGXHRNzPLiIu+mVlGXPTNzDLiom9mlhEXfTOzjLjom5llxEXfzCwjbR81XNI64MvAOcCfRMSOdudgzdfqAVyq8eAtZtPT1it9SecAfwRcA1wKfFbSpe3MwcwsZ+2+0r8SGI6I1wAkDQLrgZfanIf9I1HpP4xawzo2k//LsE6kYljbNu1Muh5YFxH/Ls3/BnBVRHyurM1mYHOa/RjwSgO7vAj42wbWbxfn2Vydkid0Tq7Os7lanec/i4gPVVrQ7it9VYi961UnIgaAgabsTPpORPQ2Y1ut5Dybq1PyhM7J1Xk212zm2e67d0aAJWXzi4Hjbc7BzCxb7S76zwDLJF0i6TxgA7C/zTmYmWWrrd07ETEu6XPAYxS3bO6OiBdbuMumdBO1gfNsrk7JEzonV+fZXLOWZ1vfyDUzs9nlT+SamWXERd/MLCMdX/QlrZP0iqRhSdsqLD9f0gNp+VOSlrY/S5C0RNITkl6W9KKkmyu0KUk6I+m59PP7s5TrMUlHUg7fqbBcku5J5/R5SStmIcePlZ2n5yS9Jenzk9rM2vmUtFvSKUkvlMUWSjog6Wh67K6ybn9qc1RS/yzk+QeSvp9+tw9LWlBl3ZrPkzbk+UVJPyz7/V5bZd2aNaINeT5QluMxSc9VWbc95zMiOvaH4s3gV4GPAOcB3wMundTmPwD/M01vAB6YpVwvBlak6Q8Af1Uh1xLwyBw4r8eAi2osvxb4JsXnLlYBT82B58HfUHwgZU6cT+CTwArghbLYfwe2peltwF0V1lsIvJYeu9N0d5vzXAOcm6bvqpRnPc+TNuT5ReA/1/HcqFkjWp3npOU7gd+fzfPZ6Vf6P/9ah4j4e2Diax3KrQf2pOmHgNWSKn1IrKUi4kREPJum/w54GVjU7jyaZD1wfxQOAQskXTyL+awGXo2Iv57FHN4lIr4NjE4Klz8X9wDXVVh1LXAgIkYj4jRwAFjXzjwj4lsRMZ5mD1F8nmZWVTmf9ainRjRNrTxT3fm3wNdatf96dHrRXwT8oGx+hPcW0p+3SU/kM8AH25JdFamL6VeApyos/lVJ35P0TUmXtTWxdwTwLUmH09diTFbPeW+nDVT/Q5oL53NCT0ScgOIiAPhwhTZz7dz+FsV/dZVM9Txph8+lbqjdVbrL5tL5/JfAyYg4WmV5W85npxf9Kb/Woc42bSOpC/gL4PMR8dakxc9SdFF8HLgX+D/tzi/5RESsoPg21C2SPjlp+Zw5p+lDfp8B/rzC4rlyPqdjLp3b3wPGgb1Vmkz1PGm1+4BfAq4ATlB0nUw2Z84n8FlqX+W35Xx2etGv52sdft5G0rnAfGb2b2LDJL2PouDvjYivT14eEW9FxFiafhR4n6SL2pwmEXE8PZ4CHqb4F7ncXPo6jWuAZyPi5OQFc+V8ljk50Q2WHk9VaDMnzm16A/nXgY2ROpwnq+N50lIRcTIifhYR/w/44yr7nyvn81zg3wAPVGvTrvPZ6UW/nq912A9M3AFxPfB4tSdxK6X+vF3AyxHxh1Xa/JOJ9xskXUnx+/lR+7IESRdK+sDENMWbei9MarYfuDHdxbMKODPRbTELql49zYXzOUn5c7Ef2FehzWPAGkndqbtiTYq1jYqBjm4BPhMRb1dpU8/zpKUmvY/0r6vsf6589cungO9HxEilhW09n61+p7jVPxR3kvwVxTv0v5dit1E8YQEuoPjXfxh4GvjILOX5Lyj+rXweeC79XAv8NvDbqc3ngBcp7jA4BPzaLOT5kbT/76VcJs5peZ6iGAznVeAI0DtL5/T9FEV8fllsTpxPiheiE8A/UFxtbqJ4L+kgcDQ9LkxteylGkZtY97fS83UYuGkW8hym6AefeJ5O3P32T4FHaz1P2pznV9Pz73mKQn7x5DzT/HtqRDvzTPE/nXhelrWdlfPpr2EwM8tIp3fvmJnZNLjom5llxEXfzCwjLvpmZhlx0Tczy4iLvplZRlz0zcwy8v8B1daN1gJOKGkAAAAASUVORK5CYII=\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"training_set.Car.hist(bins=10)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"training_set.Car.fillna(value=training_set.Car.median(), inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# --------------------------------------------------------------------"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(7443, 15)"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set[training_set.Landsize.isnull()].shape"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x19bc1268208>"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"training_set.Landsize.hist(bins=20)"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 14353.000000\n",
"mean 570.962447\n",
"std 1956.911573\n",
"min 0.000000\n",
"25% 220.000000\n",
"50% 513.000000\n",
"75% 664.000000\n",
"max 146699.000000\n",
"Name: Landsize, dtype: float64"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.Landsize.describe()"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1540, 15)"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set[training_set.Landsize == 0].shape"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3778, 15)"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set[training_set.Landsize > 656].shape"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3571, 15)"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set[training_set.Landsize < 219].shape"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0, 15)"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set[training_set.Landsize < 0].shape"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(12813, 15)"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set[~training_set.Landsize.isnull() & training_set.Landsize > 0].shape"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [],
"source": [
"training_set[\"Landsize_log\"] = np.log(training_set[~training_set.Landsize.isnull() & training_set.Landsize > 0]['Landsize'])"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 12813.000000\n",
"mean 6.127162\n",
"std 0.705189\n",
"min 0.000000\n",
"25% 5.717028\n",
"50% 6.315358\n",
"75% 6.530878\n",
"max 11.896138\n",
"Name: Landsize_log, dtype: float64"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set[\"Landsize_log\"].describe()"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x19bc1661ba8>"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD4CAYAAAAAczaOAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAASeklEQVR4nO3dcZBdZXnH8e8jUUFiSSi6Q5NMl44ZK5IKuANpmelsQCGAY/hDZuJQDTad/IMtdjIjoR2HVqGDo4h1qrQZSY1KjQzKkAEVM4Edx5mCELEEiDQppBCSEm1CdAG1a5/+cd+1S9jN3t179967+34/Mzv3nPe855z34d787rnnnnuIzESSVIfXdHsAkqTOMfQlqSKGviRVxNCXpIoY+pJUkXndHsCxnHLKKdnf3z/t9V988UVOPPHE9g2oi6ylN1lLb5pLtcDU69mxY8dPM/NN4y3r6dDv7+/n4Ycfnvb6Q0NDDA4Otm9AXWQtvclaetNcqgWmXk9E/OdEyzy9I0kVMfQlqSKGviRVxNCXpIoY+pJUEUNfkipi6EtSRQx9SaqIoS9JFenpX+RKc0H/hnvaur29N17a1u2pLh7pS1JFDH1JqoihL0kVMfQlqSKGviRVxNCXpIoY+pJUEUNfkipi6EtSRQx9SaqIoS9JFTH0Jakihr4kVcTQl6SKGPqSVBFDX5IqYuhLUkUMfUmqiKEvSRUx9CWpIk2FfkTsjYidEfGjiHi4tJ0cEdsiYnd5XFjaIyI+FxF7IuLRiDh7zHbWlP67I2LNzJQkSZrIVI70V2TmmZk5UOY3ANszcymwvcwDXAwsLX/rgFug8SYBXAecC5wDXDf6RiFJ6oxWTu+sAjaX6c3AZWPav5wNDwALIuJU4CJgW2YeyszDwDZgZQv7lyRNUWTm5J0ingYOAwn8U2ZujIgXMnPBmD6HM3NhRNwN3JiZ3y/t24FrgEHg+My8vrR/DHg5Mz991L7W0fiEQF9f3zu3bNky7eKGh4eZP3/+tNfvJdbSm5qpZedzR9q6z2WLTmrr9kbV9rzMJlOtZ8WKFTvGnJV5hXlNbuO8zNwfEW8GtkXEj4/RN8Zpy2O0v7IhcyOwEWBgYCAHBwebHOKrDQ0N0cr6vcRaelMztVy54Z627nPvFcfe33TV9rzMJu2sp6nTO5m5vzweBO6kcU7++XLahvJ4sHTfBywZs/piYP8x2iVJHTJp6EfEiRHxxtFp4ELgMWArMHoFzhrgrjK9FfhguYpnOXAkMw8A9wIXRsTC8gXuhaVNktQhzZze6QPujIjR/v+Smd+JiIeA2yNiLfAMcHnp/y3gEmAP8BLwIYDMPBQRnwAeKv0+npmH2laJJGlSk4Z+Zj4FvGOc9v8GLhinPYGrJtjWJmDT1IcpSWoHf5ErSRUx9CWpIoa+JFXE0Jekihj6klQRQ1+SKmLoS1JFDH1JqoihL0kVMfQlqSKGviRVxNCXpIoY+pJUEUNfkipi6EtSRQx9SaqIoS9JFTH0Jakihr4kVcTQl6SKGPqSVBFDX5IqYuhLUkUMfUmqyLxuD0DqNf0b7mm67/plI1w5hf5St3mkL0kVMfQlqSJNh35EHBcRj0TE3WX+tIh4MCJ2R8TXI+J1pf31ZX5PWd4/ZhvXlvYnI+KidhcjSTq2qRzpXw3sGjP/SeDmzFwKHAbWlva1wOHMfAtwc+lHRJwOrAbeDqwEvhARx7U2fEnSVDQV+hGxGLgU+GKZD+B84I7SZTNwWZleVeYpyy8o/VcBWzLzl5n5NLAHOKcdRUiSmtPs1TufBT4KvLHM/zbwQmaOlPl9wKIyvQh4FiAzRyLiSOm/CHhgzDbHrvMbEbEOWAfQ19fH0NBQs7W8yvDwcEvr9xJr6Zz1y0Ym71T0nTC1/u0wU//tev15mYq5VAu0t55JQz8i3gMczMwdETE42jxO15xk2bHW+f+GzI3ARoCBgYEcHBw8ukvThoaGaGX9XmItnTOVSzDXLxvhpp2dvfJ57xWDM7LdXn9epmIu1QLtraeZV+t5wHsj4hLgeOC3aBz5L4iIeeVofzGwv/TfBywB9kXEPOAk4NCY9lFj15EkdcCk5/Qz89rMXJyZ/TS+iL0vM68A7gfeV7qtAe4q01vLPGX5fZmZpX11ubrnNGAp8IO2VSJJmlQrn0uvAbZExPXAI8Ctpf1W4CsRsYfGEf5qgMx8PCJuB54ARoCrMvPXLexfkjRFUwr9zBwChsr0U4xz9U1m/gK4fIL1bwBumOogJUnt4S9yJakihr4kVcTQl6SKGPqSVBFDX5IqYuhLUkUMfUmqiKEvSRUx9CWpIoa+JFXE0Jekihj6klQRQ1+SKmLoS1JFDH1JqoihL0kVMfQlqSKGviRVxNCXpIoY+pJUEUNfkipi6EtSRQx9SaqIoS9JFTH0Jakihr4kVcTQl6SKGPqSVJFJQz8ijo+IH0TEv0XE4xHxt6X9tIh4MCJ2R8TXI+J1pf31ZX5PWd4/ZlvXlvYnI+KimSpKkjS+Zo70fwmcn5nvAM4EVkbEcuCTwM2ZuRQ4DKwt/dcChzPzLcDNpR8RcTqwGng7sBL4QkQc185iJEnHNmnoZ8NwmX1t+UvgfOCO0r4ZuKxMryrzlOUXRESU9i2Z+cvMfBrYA5zTliokSU2JzJy8U+OIfAfwFuDzwKeAB8rRPBGxBPh2Zp4REY8BKzNzX1n2H8C5wN+Udb5a2m8t69xx1L7WAesA+vr63rlly5ZpFzc8PMz8+fOnvX4vsZbO2fnckab79p0Az788g4MZx7JFJ83Idnv9eZmKuVQLTL2eFStW7MjMgfGWzWtmA5n5a+DMiFgA3Am8bbxu5TEmWDZR+9H72ghsBBgYGMjBwcFmhjiuoaEhWlm/l1hL51y54Z6m+65fNsJNO5v6Z9Q2e68YnJHt9vrzMhVzqRZobz1TunonM18AhoDlwIKIGH21Lwb2l+l9wBKAsvwk4NDY9nHWkSR1QDNX77ypHOETEScA7wJ2AfcD7yvd1gB3lemtZZ6y/L5snEPaCqwuV/ecBiwFftCuQiRJk2vmc+mpwOZyXv81wO2ZeXdEPAFsiYjrgUeAW0v/W4GvRMQeGkf4qwEy8/GIuB14AhgBriqnjSRJHTJp6Gfmo8BZ47Q/xThX32TmL4DLJ9jWDcANUx+mJKkd/EWuJFXE0Jekihj6klQRQ1+SKmLoS1JFDH1JqoihL0kVMfQlqSKGviRVxNCXpIoY+pJUEUNfkipi6EtSRQx9SaqIoS9JFTH0Jakihr4kVcTQl6SKGPqSVBFDX5IqYuhLUkUMfUmqiKEvSRUx9CWpIoa+JFXE0Jekiszr9gCkVvRvuKfbQ5BmlUmP9CNiSUTcHxG7IuLxiLi6tJ8cEdsiYnd5XFjaIyI+FxF7IuLRiDh7zLbWlP67I2LNzJUlSRpPM6d3RoD1mfk2YDlwVUScDmwAtmfmUmB7mQe4GFha/tYBt0DjTQK4DjgXOAe4bvSNQpLUGZOGfmYeyMwflumfA7uARcAqYHPpthm4rEyvAr6cDQ8ACyLiVOAiYFtmHsrMw8A2YGVbq5EkHdOUvsiNiH7gLOBBoC8zD0DjjQF4c+m2CHh2zGr7SttE7ZKkDmn6i9yImA98A/hIZv4sIibsOk5bHqP96P2so3FaiL6+PoaGhpod4qsMDw+3tH4vsZbxrV820pbtTFffCZ0fw0y9DnyN9a521tNU6EfEa2kE/m2Z+c3S/HxEnJqZB8rpm4OlfR+wZMzqi4H9pX3wqPaho/eVmRuBjQADAwM5ODh4dJemDQ0N0cr6vcRaxndll6/eWb9shJt2dvYiuL1XDM7Idn2N9a521tPM1TsB3ArsyszPjFm0FRi9AmcNcNeY9g+Wq3iWA0fK6Z97gQsjYmH5AvfC0iZJ6pBmDlHOAz4A7IyIH5W2vwJuBG6PiLXAM8DlZdm3gEuAPcBLwIcAMvNQRHwCeKj0+3hmHmpLFZKkpkwa+pn5fcY/Hw9wwTj9E7hqgm1tAjZNZYCSpPbxNgySVBFDX5IqYuhLUkUMfUmqiKEvSRUx9CWpIoa+JFXE0Jekihj6klQRQ1+SKmLoS1JFDH1JqoihL0kVMfQlqSKGviRVxNCXpIoY+pJUEUNfkipi6EtSRQx9SaqIoS9JFTH0Jakihr4kVcTQl6SKGPqSVBFDX5IqYuhLUkUMfUmqyKShHxGbIuJgRDw2pu3kiNgWEbvL48LSHhHxuYjYExGPRsTZY9ZZU/rvjog1M1OOJOlYmjnS/xKw8qi2DcD2zFwKbC/zABcDS8vfOuAWaLxJANcB5wLnANeNvlFIkjpn0tDPzO8Bh45qXgVsLtObgcvGtH85Gx4AFkTEqcBFwLbMPJSZh4FtvPqNRJI0wyIzJ+8U0Q/cnZlnlPkXMnPBmOWHM3NhRNwN3JiZ3y/t24FrgEHg+My8vrR/DHg5Mz89zr7W0fiUQF9f3zu3bNky7eKGh4eZP3/+tNfvJdYyvp3PHWnLdqar7wR4/uXO7nPZopNmZLu+xnrXVOtZsWLFjswcGG/ZvLaNqiHGactjtL+6MXMjsBFgYGAgBwcHpz2YoaEhWlm/l1jL+K7ccE9btjNd65eNcNPOdv8zOra9VwzOyHZ9jfWudtYz3at3ni+nbSiPB0v7PmDJmH6Lgf3HaJckddB0Q38rMHoFzhrgrjHtHyxX8SwHjmTmAeBe4MKIWFi+wL2wtEmSOmjSz6UR8TUa5+RPiYh9NK7CuRG4PSLWAs8Al5fu3wIuAfYALwEfAsjMQxHxCeCh0u/jmXn0l8OSpBk2aehn5vsnWHTBOH0TuGqC7WwCNk1pdJKktvIXuZJUEUNfkipi6EtSRQx9SaqIoS9JFTH0Jakihr4kVcTQl6SKGPqSVBFDX5IqYuhLUkUMfUmqiKEvSRUx9CWpIoa+JFWks/9zT1Wvf8M9rF820vX/t61UK4/0Jakihr4kVcTQl6SKGPqSVBFDX5IqYuhLUkW8ZFOaZfrbfLnr3hsvbev21Ns80pekihj6klQRQ1+SKmLoS1JF/CJXqtzoF8PtvCeSXw73ro4f6UfEyoh4MiL2RMSGTu9fkmrW0dCPiOOAzwMXA6cD74+I0zs5BkmqWadP75wD7MnMpwAiYguwCniiw+NQk9p9Tbjq0O3XTTdu3z1bTmlFZnZuZxHvA1Zm5p+V+Q8A52bmh8f0WQesK7NvBZ5sYZenAD9tYf1eYi29yVp601yqBaZez+9m5pvGW9DpI/0Yp+0V7zqZuRHY2JadRTycmQPt2Fa3WUtvspbeNJdqgfbW0+kvcvcBS8bMLwb2d3gMklStTof+Q8DSiDgtIl4HrAa2dngMklStjp7eycyRiPgwcC9wHLApMx+fwV225TRRj7CW3mQtvWku1QJtrKejX+RKkrrL2zBIUkUMfUmqyJwM/blyq4eIWBIR90fEroh4PCKu7vaYWhURx0XEIxFxd7fH0qqIWBARd0TEj8tz9IfdHtN0RcRfltfYYxHxtYg4vttjalZEbIqIgxHx2Ji2kyNiW0TsLo8LuznGZk1Qy6fKa+zRiLgzIha0so85F/pz7FYPI8D6zHwbsBy4ahbXMupqYFe3B9Emfw98JzN/H3gHs7SuiFgE/AUwkJln0LjIYnV3RzUlXwJWHtW2AdiemUuB7WV+NvgSr65lG3BGZv4B8O/Ata3sYM6FPmNu9ZCZvwJGb/Uw62Tmgcz8YZn+OY1QWdTdUU1fRCwGLgW+2O2xtCoifgv4Y+BWgMz8VWa+0N1RtWQecEJEzAPewCz6/Uxmfg84dFTzKmBzmd4MXNbRQU3TeLVk5nczc6TMPkDj903TNhdDfxHw7Jj5fczioBwVEf3AWcCD3R1JSz4LfBT4324PpA1+D/gJ8M/ldNUXI+LEbg9qOjLzOeDTwDPAAeBIZn63u6NqWV9mHoDGwRPw5i6Pp13+FPh2KxuYi6E/6a0eZpuImA98A/hIZv6s2+OZjoh4D3AwM3d0eyxtMg84G7glM88CXmT2nEJ4hXK+exVwGvA7wIkR8SfdHZWOFhF/TeOU722tbGcuhv6cutVDRLyWRuDflpnf7PZ4WnAe8N6I2EvjlNv5EfHV7g6pJfuAfZk5+snrDhpvArPRu4CnM/Mnmfk/wDeBP+rymFr1fEScClAeD3Z5PC2JiDXAe4ArssUfV83F0J8zt3qIiKBxznhXZn6m2+NpRWZem5mLM7OfxnNyX2bO2qPJzPwv4NmIeGtpuoDZe4vwZ4DlEfGG8pq7gFn6pfQYW4E1ZXoNcFcXx9KSiFgJXAO8NzNfanV7cy70yxceo7d62AXcPsO3ephJ5wEfoHFU/KPyd0m3B6Xf+HPgtoh4FDgT+Lsuj2dayqeVO4AfAjtp5MKsuY1BRHwN+FfgrRGxLyLWAjcC746I3cC7y3zPm6CWfwDeCGwrGfCPLe3D2zBIUj3m3JG+JGlihr4kVcTQl6SKGPqSVBFDX5IqYuhLUkUMfUmqyP8BvI2y2c0ksJsAAAAASUVORK5CYII=\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"training_set[\"Landsize_log\"].hist(bins=15)"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [],
"source": [
"Landsize_log_mean = training_set[\"Landsize_log\"].mean()\n",
"training_set[\"Landsize_log\"].fillna(value=Landsize_log_mean, inplace=True)\n",
"training_set[\"Landsize_log\"] = training_set[\"Landsize_log\"].apply(lambda x: Landsize_log_mean if x == 0 else x) "
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
"training_set.drop('Landsize', axis=1, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# --------------------------------------------------------------------"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
"training_set.fillna(value= training_set.mean()[[\"BuildingArea\", \"YearBuilt\", \"Propertycount\"]], inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# --------------------------------------------------------------------"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Rooms False\n",
"Type False\n",
"Price False\n",
"Method False\n",
"SellerG False\n",
"Distance False\n",
"Postcode False\n",
"Bedroom2 False\n",
"Bathroom False\n",
"Car False\n",
"BuildingArea False\n",
"YearBuilt False\n",
"Propertycount False\n",
"sold_year False\n",
"Landsize_log False\n",
"dtype: bool"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.isnull().any()"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Rooms int64\n",
"Type object\n",
"Price float64\n",
"Method object\n",
"SellerG object\n",
"Distance float64\n",
"Postcode float64\n",
"Bedroom2 float64\n",
"Bathroom float64\n",
"Car float64\n",
"BuildingArea float64\n",
"YearBuilt float64\n",
"Propertycount float64\n",
"sold_year int64\n",
"Landsize_log float64\n",
"dtype: object"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interested in knowing more about catagorical/text columns. Once after analysis we have to convert the data to numerical.\n",
" * Before we convert catagorical data to numbers, we must see if there is any effect of the \"individual catagories\" on the mean of the target variable.\n",
" * Example if there are 10 houses beloning to Method **\"S\"**, 100 houses belonging to Method **\"SP\"**.\n",
" Total Price of \"S\" houses in 100000, Total Price of \"SP\" houses is 1000000. **The avarage (mean) Price of \"S\" and \"SP\" houses is same - 10000**. That means \"statistically\" **there is no direct effect of Method \"S\", \"SP\" on Price variable** (<font color='red'>note: in business point of view, there might be an effect. Need to check statistical understanding with business analyst</font>). In this scenario we should go with **One-Hot-Encoding or Dummy Variable** creation. Otherwise we can go with **Label Encoding**."
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Type', 'Method', 'SellerG'], dtype='object')"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.select_dtypes(['object']).columns"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"S 13955\n",
"SP 2882\n",
"PI 2626\n",
"VB 2178\n",
"SA 155\n",
"Name: Method, dtype: int64"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set['Method'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Is there a significant effect of different categories in Method on Price"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Method\n",
"PI 1.122780e+06\n",
"S 1.049267e+06\n",
"SA 9.966541e+05\n",
"SP 8.802131e+05\n",
"VB 1.193416e+06\n",
"Name: Price, dtype: float64"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.groupby(['Method'])['Price'].mean()"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(-21.325000000000003, 0.5, 'Mean Price')"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 360x360 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.catplot(data=training_set, x=\"Method\", y=\"Price\", kind=\"bar\", ci=None)\n",
"plt.ylabel(\"Mean Price\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Above visualization depects that there is a little difference in mean price for different catagories of Method. Need to do hypothesis testing to make sure that this holds true in population"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ANOVA: Tukey's Honestly Singnificant Difference test"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Multiple Comparison of Means - Tukey HSD, FWER=0.05 \n",
"==================================================================\n",
"group1 group2 meandiff p-adj lower upper reject\n",
"------------------------------------------------------------------\n",
" PI S -73512.4251 0.001 -110261.7338 -36763.1163 True\n",
" PI SA -126125.7769 0.1129 -268930.766 16679.2122 False\n",
" PI SP -242566.7525 0.001 -289174.5642 -195958.9408 True\n",
" PI VB 70636.5441 0.0011 20566.1169 120706.9713 True\n",
" S SA -52613.3518 0.8216 -192150.1757 86923.4721 False\n",
" S SP -169054.3274 0.001 -204403.2795 -133705.3754 True\n",
" S VB 144148.9692 0.001 104345.6027 183952.3356 True\n",
" SA SP -116440.9756 0.1688 -258892.0255 26010.0743 False\n",
" SA VB 196762.321 0.0017 53141.0788 340383.5631 True\n",
" SP VB 313203.2966 0.001 264151.4424 362255.1507 True\n",
"------------------------------------------------------------------\n"
]
}
],
"source": [
"import statsmodels.stats.multicomp as multi\n",
"\n",
"mc1 = multi.MultiComparison(training_set[\"Price\"],training_set[\"Method\"])\n",
"res1 = mc1.tukeyhsd()\n",
"print(res1.summary())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Most of the pair wise ANOVA tests show that there is a significance \"Price\" difference for different catagories of \"Mehod\". Hence we can do LavelEncoding"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [],
"source": [
"lst_all_method_cats = ['S', 'SP','PI','PN','SN','NB','VB','W','SA','SS','N/A']"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"S 13955\n",
"SP 2882\n",
"PI 2626\n",
"VB 2178\n",
"SA 155\n",
"N/A 0\n",
"SS 0\n",
"W 0\n",
"NB 0\n",
"SN 0\n",
"PN 0\n",
"Name: Method, dtype: int64"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.Method = pd.Categorical(training_set.Method, categories=lst_all_method_cats)\n",
"training_set.Method.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [],
"source": [
"training_set = pd.get_dummies(training_set, columns=[\"Method\"], prefix=[\"Method\"], drop_first=True)"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Method_SP',\n",
" 'Method_PI',\n",
" 'Method_PN',\n",
" 'Method_SN',\n",
" 'Method_NB',\n",
" 'Method_VB',\n",
" 'Method_W',\n",
" 'Method_SA',\n",
" 'Method_SS',\n",
" 'Method_N/A']"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[col for col in training_set.columns if 'Method' in col ]"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Method_SP</th>\n",
" <th>Method_PI</th>\n",
" <th>Method_PN</th>\n",
" <th>Method_SN</th>\n",
" <th>Method_NB</th>\n",
" <th>Method_VB</th>\n",
" <th>Method_W</th>\n",
" <th>Method_SA</th>\n",
" <th>Method_SS</th>\n",
" <th>Method_N/A</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>17957</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6753</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1833</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19795</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17054</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Method_SP Method_PI Method_PN Method_SN Method_NB Method_VB \\\n",
"17957 0 0 0 0 0 1 \n",
"6753 1 0 0 0 0 0 \n",
"1833 0 0 0 0 0 1 \n",
"19795 0 0 0 0 0 0 \n",
"17054 0 0 0 0 0 0 \n",
"\n",
" Method_W Method_SA Method_SS Method_N/A \n",
"17957 0 0 0 0 \n",
"6753 0 0 0 0 \n",
"1833 0 0 0 0 \n",
"19795 0 0 0 0 \n",
"17054 0 0 0 0 "
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set[[col for col in training_set.columns if 'Method' in col]].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# --------------------------------------------------------------------"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"h 14781\n",
"u 4733\n",
"t 2282\n",
"Name: Type, dtype: int64"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set['Type'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Type\n",
"h 1.203264e+06\n",
"t 9.375948e+05\n",
"u 6.246408e+05\n",
"Name: Price, dtype: float64"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.groupby(['Type'])['Price'].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### We see significant difference, hence we can go with Label Encoding"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 14781\n",
"2 4733\n",
"1 2282\n",
"Name: Type, dtype: int64"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set['Type'] = pd.Categorical(training_set['Type'])\n",
"training_set['Type'] = training_set['Type'].cat.codes\n",
"training_set['Type'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# --------------------------------------------------------------------"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(330,)"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set['SellerG'].value_counts().shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Check the behaviour of the top sellers."
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [],
"source": [
"top_sellers = list(training_set['SellerG'].value_counts()[training_set['SellerG'].value_counts().values > 100].index)"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Nelson', 'Jellis', 'Barry', 'hockingstuart', 'Ray', 'Buxton', 'Marshall', 'Biggin', 'Fletchers', 'Brad', 'Woodards', 'McGrath', 'Greg', 'YPA', 'Noel', 'Jas', 'Stockdale', 'Miles', 'Sweeney', 'RT', 'Harcourts', 'Gary', 'Hodges', 'Raine', 'HAR', 'Love', 'RW', 'Kay', \"O'Brien\", 'Williams', 'Village', 'Douglas', 'C21', 'Chisholm', 'Purplebricks']\n"
]
}
],
"source": [
"print(top_sellers)"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [],
"source": [
"top_sellers_train_set = training_set[training_set['SellerG'].isin(top_sellers)]"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 720x720 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"top_sel = sns.catplot(data=top_sellers_train_set, x=\"SellerG\", y=\"Price\", kind=\"bar\", height=10, ci=None)\n",
"plt.ylabel(\"Mean Price\")\n",
"l1 = top_sel.set_yticklabels(rotation = 45)\n",
"l2 = top_sel.set_xticklabels(rotation = 45)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### We see significant difference, hence we can go with Label Encoding"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"199 2174\n",
"138 2047\n",
"21 1919\n",
"319 1694\n",
"245 1263\n",
"Name: SellerG, dtype: int64"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set['SellerG'] = pd.Categorical(training_set['SellerG'])\n",
"training_set['SellerG'] = training_set['SellerG'].cat.codes\n",
"training_set['SellerG'].value_counts().head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Now data is ready, lets build a Linear Regression model with SGDRegressor"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Rooms int64\n",
"Type int8\n",
"Price float64\n",
"SellerG int16\n",
"Distance float64\n",
"Postcode float64\n",
"Bedroom2 float64\n",
"Bathroom float64\n",
"Car float64\n",
"BuildingArea float64\n",
"YearBuilt float64\n",
"Propertycount float64\n",
"sold_year int64\n",
"Landsize_log float64\n",
"Method_SP uint8\n",
"Method_PI uint8\n",
"Method_PN uint8\n",
"Method_SN uint8\n",
"Method_NB uint8\n",
"Method_VB uint8\n",
"Method_W uint8\n",
"Method_SA uint8\n",
"Method_SS uint8\n",
"Method_N/A uint8\n",
"dtype: object"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_set.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [],
"source": [
"input_features = [x for x in training_set.columns if x not in ['Price']]"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Rooms',\n",
" 'Type',\n",
" 'SellerG',\n",
" 'Distance',\n",
" 'Postcode',\n",
" 'Bedroom2',\n",
" 'Bathroom',\n",
" 'Car',\n",
" 'BuildingArea',\n",
" 'YearBuilt',\n",
" 'Propertycount',\n",
" 'sold_year',\n",
" 'Landsize_log',\n",
" 'Method_SP',\n",
" 'Method_PI',\n",
" 'Method_PN',\n",
" 'Method_SN',\n",
" 'Method_NB',\n",
" 'Method_VB',\n",
" 'Method_W',\n",
" 'Method_SA',\n",
" 'Method_SS',\n",
" 'Method_N/A']"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"input_features"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [],
"source": [
"X_train = training_set[input_features].values\n",
"y_train = training_set['Price'].values"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [],
"source": [
"lr = LinearRegression()\n",
"lr_model = lr.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [],
"source": [
"y_train_pred = lr_model.predict(X_train)"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.48467536024042435\n"
]
}
],
"source": [
"from sklearn.metrics import r2_score\n",
"r2 = r2_score(y_train, y_train_pred)\n",
"print(r2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Error analysis and finetuning the model"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [],
"source": [
"residuals = y_train - y_train_pred"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [],
"source": [
"def standard_scale(val):\n",
" mean_vals = np.mean(val)\n",
" std_vals = np.std(val)\n",
" z_vals = [(x - mean_vals)/std_vals for x in val]\n",
" return z_vals"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [],
"source": [
"def plotScatterPlot(residuals, y):\n",
" std_residuals = standard_scale(residuals)\n",
" std_predicted = standard_scale(y)\n",
" plt.scatter(std_predicted, std_residuals, alpha = 0.6, color='y')\n",
" l = plt.axhline(y=0, color='r')\n",
" plt.xlabel('STD Predicted values')\n",
" plt.ylabel('STD Residual values')\n",
" plt.axis([-6, 6, -10, 10])\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### plotScatterPlot(residuals, y_train_pred)"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plotScatterPlot(residuals, y_train_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### We found heteroscedasticity in the dataset, one of the solution is the apply log transormation on y variable, build model. Let's try that."
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"numpy.ndarray"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(y_train)"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [],
"source": [
"y_train_log = np.log(y_train)"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {},
"outputs": [],
"source": [
"lr_model = lr_model.fit(X_train, y_train_log)"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {},
"outputs": [],
"source": [
"y_train_log_pred = lr_model.predict(X_train)"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.5775719437885163\n"
]
}
],
"source": [
"r2 = r2_score(y_train_log, y_train_log_pred)\n",
"print(r2)"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [],
"source": [
"res = y_train_log - y_train_log_pred"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plotScatterPlot(res, y_train_log_pred)"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.ensemble import GradientBoostingRegressor\n",
"gbrt = GradientBoostingRegressor(max_depth=4, n_estimators=300, learning_rate=0.1, random_state=42)\n",
"gbrt.fit(X_train, y_train)\n",
"y_pred_gbrt = gbrt.predict(X_train)"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.8431136902758644\n"
]
}
],
"source": [
"r2 = r2_score(y_train, y_pred_gbrt)\n",
"print(r2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Apply model on TEST set and finetune model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Apply EDA outcome on TEST set"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [],
"source": [
"test_set = test_set.drop(test_set[test_set['Postcode'].isnull()].index)\n",
"\n",
"test_set.drop([\"Address\", \"CouncilArea\", \"Regionname\", \"Lattitude\", \"Suburb\", \"Longtitude\"], axis=1, inplace=True)\n",
"\n",
"test_set.Date = pd.to_datetime(test_set.Date)\n",
"test_set[\"sold_year\"] = test_set.Date.apply(lambda x: x.year)\n",
"test_set.drop(\"Date\", axis=1, inplace=True)\n",
"\n",
"test_set.Bedroom2.fillna(value=test_set.Bedroom2.mean(), inplace=True)\n",
"test_set.Bathroom.fillna(value=test_set.Bathroom.mean(), inplace=True)\n",
"test_set.Car.fillna(value=test_set.Car.median(), inplace=True)\n",
"\n",
"test_set[\"Landsize_log\"] = np.log(test_set[~test_set.Landsize.isnull() & test_set.Landsize > 0]['Landsize'])\n",
"Landsize_log_mean = test_set[\"Landsize_log\"].mean()\n",
"test_set[\"Landsize_log\"].fillna(value=Landsize_log_mean, inplace=True)\n",
"test_set[\"Landsize_log\"] = test_set[\"Landsize_log\"].apply(lambda x: Landsize_log_mean if x == 0 else x) \n",
"test_set.drop('Landsize', axis=1, inplace=True)\n",
"\n",
"test_set.fillna(value= test_set.mean()[[\"BuildingArea\", \"YearBuilt\", \"Propertycount\"]], inplace=True)\n",
"\n",
"lst_all_method_cats = ['S', 'SP','PI','PN','SN','NB','VB','W','SA','SS','N/A']\n",
"test_set.Method = pd.Categorical(test_set.Method, categories=lst_all_method_cats)\n",
"test_set = pd.get_dummies(test_set, columns=[\"Method\"], prefix=[\"Method\"], drop_first=True)\n",
"\n",
"test_set['Type'] = pd.Categorical(test_set['Type'])\n",
"test_set['Type'] = test_set['Type'].cat.codes\n",
"\n",
"test_set['SellerG'] = pd.Categorical(test_set['SellerG'])\n",
"test_set['SellerG'] = test_set['SellerG'].cat.codes"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [],
"source": [
"input_features = [x for x in training_set.columns if x not in ['Price']]\n",
"\n",
"X_test = test_set[input_features].values\n",
"y_test = test_set['Price'].values"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [],
"source": [
"y_test_pred_gbrt = gbrt.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.7504911229602481\n"
]
}
],
"source": [
"r2 = r2_score(y_test, y_test_pred_gbrt)\n",
"print(r2)"
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"GridSearchCV(cv=3, error_score='raise-deprecating',\n",
" estimator=GradientBoostingRegressor(alpha=0.9,\n",
" criterion='friedman_mse',\n",
" init=None, learning_rate=0.1,\n",
" loss='ls', max_depth=3,\n",
" max_features=None,\n",
" max_leaf_nodes=None,\n",
" min_impurity_decrease=0.0,\n",
" min_impurity_split=None,\n",
" min_samples_leaf=1,\n",
" min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0,\n",
" n_estimators=100,\n",
" n_iter_no_change=None,\n",
" presort='auto',\n",
" random_state=15, subsample=1.0,\n",
" tol=0.0001,\n",
" validation_fraction=0.1,\n",
" verbose=0, warm_start=False),\n",
" iid='warn', n_jobs=None,\n",
" param_grid=[{'learning_rate': [0.009, 0.1, 0.11, 0.12],\n",
" 'max_depth': [6, 7, 8], 'n_estimators': [300, 350]}],\n",
" pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n",
" scoring='neg_mean_squared_error', verbose=0)"
]
},
"execution_count": 127,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import GridSearchCV\n",
"param_grid = [\n",
" {'max_depth':[6,7,8], \n",
" 'n_estimators':[300, 350], \n",
" 'learning_rate':[0.09, 0.1, 0.11, 0.12]} ]\n",
"grd_gbr_model = GradientBoostingRegressor(random_state=15)\n",
"grid_search = GridSearchCV(grd_gbr_model, param_grid, cv=3,\n",
" scoring='neg_mean_squared_error')\n",
"grid_search.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 300}"
]
},
"execution_count": 128,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grid_search.best_params_"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [],
"source": [
"best_y_pred_gbrt = grid_search.best_estimator_.predict(X_train)"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.9320388408375453\n"
]
}
],
"source": [
"r2 = r2_score(y_train, best_y_pred_gbrt)\n",
"print(r2)"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [],
"source": [
"best_y_test_pred_gbrt = grid_search.best_estimator_.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.7575570409693281\n"
]
}
],
"source": [
"r2 = r2_score(y_test, best_y_test_pred_gbrt)\n",
"print(r2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Use Log transformed Y - Variable to train"
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"GridSearchCV(cv=3, error_score='raise-deprecating',\n",
" estimator=GradientBoostingRegressor(alpha=0.9,\n",
" criterion='friedman_mse',\n",
" init=None, learning_rate=0.1,\n",
" loss='ls', max_depth=3,\n",
" max_features=None,\n",
" max_leaf_nodes=None,\n",
" min_impurity_decrease=0.0,\n",
" min_impurity_split=None,\n",
" min_samples_leaf=1,\n",
" min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0,\n",
" n_estimators=100,\n",
" n_iter_no_change=None,\n",
" presort='auto',\n",
" random_state=15, subsample=1.0,\n",
" tol=0.0001,\n",
" validation_fraction=0.1,\n",
" verbose=0, warm_start=False),\n",
" iid='warn', n_jobs=None,\n",
" param_grid=[{'learning_rate': [0.009, 0.1, 0.11, 0.12],\n",
" 'max_depth': [6, 7, 8], 'n_estimators': [300, 350]}],\n",
" pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n",
" scoring='neg_mean_squared_error', verbose=0)"
]
},
"execution_count": 133,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grd_gbr_model = GradientBoostingRegressor(random_state=15)\n",
"grid_search = GridSearchCV(grd_gbr_model, param_grid, cv=3,\n",
" scoring='neg_mean_squared_error')\n",
"grid_search.fit(X_train, y_train_log)"
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'learning_rate': 0.1, 'max_depth': 6, 'n_estimators': 350}"
]
},
"execution_count": 134,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grid_search.best_params_"
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {},
"outputs": [],
"source": [
"best_y_pred_gbrt_log = grid_search.best_estimator_.predict(X_train)"
]
},
{
"cell_type": "code",
"execution_count": 136,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.9206247970542722\n"
]
}
],
"source": [
"r2 = r2_score(y_train_log, best_y_pred_gbrt_log)\n",
"print(r2)"
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [],
"source": [
"y_test_log = np.log(y_test)"
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [],
"source": [
"best_y_test_pred_gbrt_log = grid_search.best_estimator_.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.8348759436378475\n"
]
}
],
"source": [
"r2 = r2_score(y_test_log, best_y_test_pred_gbrt_log)\n",
"print(r2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Play with model Hyper-parameters to fix overfitting"
]
},
{
"cell_type": "code",
"execution_count": 180,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.ensemble import GradientBoostingRegressor\n",
"gbrt1 = GradientBoostingRegressor(max_depth=2, n_estimators=550, learning_rate=0.1, random_state=42)\n",
"gbrt1.fit(X_train, y_train_log)\n",
"y_pred_gbrt_log1 = gbrt1.predict(X_train)"
]
},
{
"cell_type": "code",
"execution_count": 181,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.8448632507182897\n"
]
}
],
"source": [
"r2 = r2_score(y_train_log, y_pred_gbrt_log1)\n",
"print(r2)"
]
},
{
"cell_type": "code",
"execution_count": 182,
"metadata": {},
"outputs": [],
"source": [
"y_test_pred_gbrt_log1 = gbrt1.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 183,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.8202256090463653\n"
]
}
],
"source": [
"r2 = r2_score(y_test_log, y_test_pred_gbrt_log1)\n",
"print(r2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@thedatajango
Copy link
Author

first cut

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment