Skip to content

Instantly share code, notes, and snippets.

@Bhasha03
Created December 12, 2019 05:10
Show Gist options
  • Save Bhasha03/ff26bdd754c7ef2c1ae3c6780c1be462 to your computer and use it in GitHub Desktop.
Save Bhasha03/ff26bdd754c7ef2c1ae3c6780c1be462 to your computer and use it in GitHub Desktop.
Created on Cognitive Class Labs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
" <a href=\"https://www.bigdatauniversity.com\"><img src = \"https://ibm.box.com/shared/static/ugcqz6ohbvff804xp84y4kqnvvk3bq1g.png\" width = 300, align = \"center\"></a>\n",
"\n",
"<h1 align=center><font size = 5>Data Analysis with Python</font></h1>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# House Sales in King County, USA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>id</b> :a notation for a house\n",
"\n",
"<b> date</b>: Date house was sold\n",
"\n",
"\n",
"<b>price</b>: Price is prediction target\n",
"\n",
"\n",
"<b>bedrooms</b>: Number of Bedrooms/House\n",
"\n",
"\n",
"<b>bathrooms</b>: Number of bathrooms/bedrooms\n",
"\n",
"<b>sqft_living</b>: square footage of the home\n",
"\n",
"<b>sqft_lot</b>: square footage of the lot\n",
"\n",
"\n",
"<b>floors</b> :Total floors (levels) in house\n",
"\n",
"\n",
"<b>waterfront</b> :House which has a view to a waterfront\n",
"\n",
"\n",
"<b>view</b>: Has been viewed\n",
"\n",
"\n",
"<b>condition</b> :How good the condition is Overall\n",
"\n",
"<b>grade</b>: overall grade given to the housing unit, based on King County grading system\n",
"\n",
"\n",
"<b>sqft_above</b> :square footage of house apart from basement\n",
"\n",
"\n",
"<b>sqft_basement</b>: square footage of the basement\n",
"\n",
"<b>yr_built</b> :Built Year\n",
"\n",
"\n",
"<b>yr_renovated</b> :Year when house was renovated\n",
"\n",
"<b>zipcode</b>:zip code\n",
"\n",
"\n",
"<b>lat</b>: Latitude coordinate\n",
"\n",
"<b>long</b>: Longitude coordinate\n",
"\n",
"<b>sqft_living15</b> :Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area\n",
"\n",
"\n",
"<b>sqft_lot15</b> :lotSize area in 2015(implies-- some renovations)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# runnin'"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import seaborn as sns\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.preprocessing import StandardScaler,PolynomialFeatures\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You will require the following libraries "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1.0 Importing the Data "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Load the csv: "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"file_name='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/kc_house_data_NaN.csv'\n",
"df=pd.read_csv(file_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"we use the method <code>head</code> to display the first 5 columns of the dataframe."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>id</th>\n",
" <th>date</th>\n",
" <th>price</th>\n",
" <th>bedrooms</th>\n",
" <th>bathrooms</th>\n",
" <th>sqft_living</th>\n",
" <th>sqft_lot</th>\n",
" <th>floors</th>\n",
" <th>waterfront</th>\n",
" <th>...</th>\n",
" <th>grade</th>\n",
" <th>sqft_above</th>\n",
" <th>sqft_basement</th>\n",
" <th>yr_built</th>\n",
" <th>yr_renovated</th>\n",
" <th>zipcode</th>\n",
" <th>lat</th>\n",
" <th>long</th>\n",
" <th>sqft_living15</th>\n",
" <th>sqft_lot15</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>7129300520</td>\n",
" <td>20141013T000000</td>\n",
" <td>221900.0</td>\n",
" <td>3.0</td>\n",
" <td>1.00</td>\n",
" <td>1180</td>\n",
" <td>5650</td>\n",
" <td>1.0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>7</td>\n",
" <td>1180</td>\n",
" <td>0</td>\n",
" <td>1955</td>\n",
" <td>0</td>\n",
" <td>98178</td>\n",
" <td>47.5112</td>\n",
" <td>-122.257</td>\n",
" <td>1340</td>\n",
" <td>5650</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>6414100192</td>\n",
" <td>20141209T000000</td>\n",
" <td>538000.0</td>\n",
" <td>3.0</td>\n",
" <td>2.25</td>\n",
" <td>2570</td>\n",
" <td>7242</td>\n",
" <td>2.0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>7</td>\n",
" <td>2170</td>\n",
" <td>400</td>\n",
" <td>1951</td>\n",
" <td>1991</td>\n",
" <td>98125</td>\n",
" <td>47.7210</td>\n",
" <td>-122.319</td>\n",
" <td>1690</td>\n",
" <td>7639</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>5631500400</td>\n",
" <td>20150225T000000</td>\n",
" <td>180000.0</td>\n",
" <td>2.0</td>\n",
" <td>1.00</td>\n",
" <td>770</td>\n",
" <td>10000</td>\n",
" <td>1.0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>6</td>\n",
" <td>770</td>\n",
" <td>0</td>\n",
" <td>1933</td>\n",
" <td>0</td>\n",
" <td>98028</td>\n",
" <td>47.7379</td>\n",
" <td>-122.233</td>\n",
" <td>2720</td>\n",
" <td>8062</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>2487200875</td>\n",
" <td>20141209T000000</td>\n",
" <td>604000.0</td>\n",
" <td>4.0</td>\n",
" <td>3.00</td>\n",
" <td>1960</td>\n",
" <td>5000</td>\n",
" <td>1.0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>7</td>\n",
" <td>1050</td>\n",
" <td>910</td>\n",
" <td>1965</td>\n",
" <td>0</td>\n",
" <td>98136</td>\n",
" <td>47.5208</td>\n",
" <td>-122.393</td>\n",
" <td>1360</td>\n",
" <td>5000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>1954400510</td>\n",
" <td>20150218T000000</td>\n",
" <td>510000.0</td>\n",
" <td>3.0</td>\n",
" <td>2.00</td>\n",
" <td>1680</td>\n",
" <td>8080</td>\n",
" <td>1.0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>8</td>\n",
" <td>1680</td>\n",
" <td>0</td>\n",
" <td>1987</td>\n",
" <td>0</td>\n",
" <td>98074</td>\n",
" <td>47.6168</td>\n",
" <td>-122.045</td>\n",
" <td>1800</td>\n",
" <td>7503</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 22 columns</p>\n",
"</div>"
],
"text/plain": [
" Unnamed: 0 id date price bedrooms bathrooms \\\n",
"0 0 7129300520 20141013T000000 221900.0 3.0 1.00 \n",
"1 1 6414100192 20141209T000000 538000.0 3.0 2.25 \n",
"2 2 5631500400 20150225T000000 180000.0 2.0 1.00 \n",
"3 3 2487200875 20141209T000000 604000.0 4.0 3.00 \n",
"4 4 1954400510 20150218T000000 510000.0 3.0 2.00 \n",
"\n",
" sqft_living sqft_lot floors waterfront ... grade sqft_above \\\n",
"0 1180 5650 1.0 0 ... 7 1180 \n",
"1 2570 7242 2.0 0 ... 7 2170 \n",
"2 770 10000 1.0 0 ... 6 770 \n",
"3 1960 5000 1.0 0 ... 7 1050 \n",
"4 1680 8080 1.0 0 ... 8 1680 \n",
"\n",
" sqft_basement yr_built yr_renovated zipcode lat long \\\n",
"0 0 1955 0 98178 47.5112 -122.257 \n",
"1 400 1951 1991 98125 47.7210 -122.319 \n",
"2 0 1933 0 98028 47.7379 -122.233 \n",
"3 910 1965 0 98136 47.5208 -122.393 \n",
"4 0 1987 0 98074 47.6168 -122.045 \n",
"\n",
" sqft_living15 sqft_lot15 \n",
"0 1340 5650 \n",
"1 1690 7639 \n",
"2 2720 8062 \n",
"3 1360 5000 \n",
"4 1800 7503 \n",
"\n",
"[5 rows x 22 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Question 1 \n",
"Display the data types of each column using the attribute dtype, then take a screenshot and submit it, include your code in the image. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Unnamed: 0 int64\n",
"id int64\n",
"date object\n",
"price float64\n",
"bedrooms float64\n",
"bathrooms float64\n",
"sqft_living int64\n",
"sqft_lot int64\n",
"floors float64\n",
"waterfront int64\n",
"view int64\n",
"condition int64\n",
"grade int64\n",
"sqft_above int64\n",
"sqft_basement int64\n",
"yr_built int64\n",
"yr_renovated int64\n",
"zipcode int64\n",
"lat float64\n",
"long float64\n",
"sqft_living15 int64\n",
"sqft_lot15 int64\n",
"dtype: object"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use the method describe to obtain a statistical summary of the dataframe."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>id</th>\n",
" <th>price</th>\n",
" <th>bedrooms</th>\n",
" <th>bathrooms</th>\n",
" <th>sqft_living</th>\n",
" <th>sqft_lot</th>\n",
" <th>floors</th>\n",
" <th>waterfront</th>\n",
" <th>view</th>\n",
" <th>...</th>\n",
" <th>grade</th>\n",
" <th>sqft_above</th>\n",
" <th>sqft_basement</th>\n",
" <th>yr_built</th>\n",
" <th>yr_renovated</th>\n",
" <th>zipcode</th>\n",
" <th>lat</th>\n",
" <th>long</th>\n",
" <th>sqft_living15</th>\n",
" <th>sqft_lot15</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>21613.00000</td>\n",
" <td>2.161300e+04</td>\n",
" <td>2.161300e+04</td>\n",
" <td>21600.000000</td>\n",
" <td>21603.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>2.161300e+04</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>...</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>10806.00000</td>\n",
" <td>4.580302e+09</td>\n",
" <td>5.400881e+05</td>\n",
" <td>3.372870</td>\n",
" <td>2.115736</td>\n",
" <td>2079.899736</td>\n",
" <td>1.510697e+04</td>\n",
" <td>1.494309</td>\n",
" <td>0.007542</td>\n",
" <td>0.234303</td>\n",
" <td>...</td>\n",
" <td>7.656873</td>\n",
" <td>1788.390691</td>\n",
" <td>291.509045</td>\n",
" <td>1971.005136</td>\n",
" <td>84.402258</td>\n",
" <td>98077.939805</td>\n",
" <td>47.560053</td>\n",
" <td>-122.213896</td>\n",
" <td>1986.552492</td>\n",
" <td>12768.455652</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>6239.28002</td>\n",
" <td>2.876566e+09</td>\n",
" <td>3.671272e+05</td>\n",
" <td>0.926657</td>\n",
" <td>0.768996</td>\n",
" <td>918.440897</td>\n",
" <td>4.142051e+04</td>\n",
" <td>0.539989</td>\n",
" <td>0.086517</td>\n",
" <td>0.766318</td>\n",
" <td>...</td>\n",
" <td>1.175459</td>\n",
" <td>828.090978</td>\n",
" <td>442.575043</td>\n",
" <td>29.373411</td>\n",
" <td>401.679240</td>\n",
" <td>53.505026</td>\n",
" <td>0.138564</td>\n",
" <td>0.140828</td>\n",
" <td>685.391304</td>\n",
" <td>27304.179631</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>0.00000</td>\n",
" <td>1.000102e+06</td>\n",
" <td>7.500000e+04</td>\n",
" <td>1.000000</td>\n",
" <td>0.500000</td>\n",
" <td>290.000000</td>\n",
" <td>5.200000e+02</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>1.000000</td>\n",
" <td>290.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1900.000000</td>\n",
" <td>0.000000</td>\n",
" <td>98001.000000</td>\n",
" <td>47.155900</td>\n",
" <td>-122.519000</td>\n",
" <td>399.000000</td>\n",
" <td>651.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>5403.00000</td>\n",
" <td>2.123049e+09</td>\n",
" <td>3.219500e+05</td>\n",
" <td>3.000000</td>\n",
" <td>1.750000</td>\n",
" <td>1427.000000</td>\n",
" <td>5.040000e+03</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>7.000000</td>\n",
" <td>1190.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1951.000000</td>\n",
" <td>0.000000</td>\n",
" <td>98033.000000</td>\n",
" <td>47.471000</td>\n",
" <td>-122.328000</td>\n",
" <td>1490.000000</td>\n",
" <td>5100.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>10806.00000</td>\n",
" <td>3.904930e+09</td>\n",
" <td>4.500000e+05</td>\n",
" <td>3.000000</td>\n",
" <td>2.250000</td>\n",
" <td>1910.000000</td>\n",
" <td>7.618000e+03</td>\n",
" <td>1.500000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>7.000000</td>\n",
" <td>1560.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1975.000000</td>\n",
" <td>0.000000</td>\n",
" <td>98065.000000</td>\n",
" <td>47.571800</td>\n",
" <td>-122.230000</td>\n",
" <td>1840.000000</td>\n",
" <td>7620.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>16209.00000</td>\n",
" <td>7.308900e+09</td>\n",
" <td>6.450000e+05</td>\n",
" <td>4.000000</td>\n",
" <td>2.500000</td>\n",
" <td>2550.000000</td>\n",
" <td>1.068800e+04</td>\n",
" <td>2.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>8.000000</td>\n",
" <td>2210.000000</td>\n",
" <td>560.000000</td>\n",
" <td>1997.000000</td>\n",
" <td>0.000000</td>\n",
" <td>98118.000000</td>\n",
" <td>47.678000</td>\n",
" <td>-122.125000</td>\n",
" <td>2360.000000</td>\n",
" <td>10083.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>21612.00000</td>\n",
" <td>9.900000e+09</td>\n",
" <td>7.700000e+06</td>\n",
" <td>33.000000</td>\n",
" <td>8.000000</td>\n",
" <td>13540.000000</td>\n",
" <td>1.651359e+06</td>\n",
" <td>3.500000</td>\n",
" <td>1.000000</td>\n",
" <td>4.000000</td>\n",
" <td>...</td>\n",
" <td>13.000000</td>\n",
" <td>9410.000000</td>\n",
" <td>4820.000000</td>\n",
" <td>2015.000000</td>\n",
" <td>2015.000000</td>\n",
" <td>98199.000000</td>\n",
" <td>47.777600</td>\n",
" <td>-121.315000</td>\n",
" <td>6210.000000</td>\n",
" <td>871200.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8 rows × 21 columns</p>\n",
"</div>"
],
"text/plain": [
" Unnamed: 0 id price bedrooms bathrooms \\\n",
"count 21613.00000 2.161300e+04 2.161300e+04 21600.000000 21603.000000 \n",
"mean 10806.00000 4.580302e+09 5.400881e+05 3.372870 2.115736 \n",
"std 6239.28002 2.876566e+09 3.671272e+05 0.926657 0.768996 \n",
"min 0.00000 1.000102e+06 7.500000e+04 1.000000 0.500000 \n",
"25% 5403.00000 2.123049e+09 3.219500e+05 3.000000 1.750000 \n",
"50% 10806.00000 3.904930e+09 4.500000e+05 3.000000 2.250000 \n",
"75% 16209.00000 7.308900e+09 6.450000e+05 4.000000 2.500000 \n",
"max 21612.00000 9.900000e+09 7.700000e+06 33.000000 8.000000 \n",
"\n",
" sqft_living sqft_lot floors waterfront view \\\n",
"count 21613.000000 2.161300e+04 21613.000000 21613.000000 21613.000000 \n",
"mean 2079.899736 1.510697e+04 1.494309 0.007542 0.234303 \n",
"std 918.440897 4.142051e+04 0.539989 0.086517 0.766318 \n",
"min 290.000000 5.200000e+02 1.000000 0.000000 0.000000 \n",
"25% 1427.000000 5.040000e+03 1.000000 0.000000 0.000000 \n",
"50% 1910.000000 7.618000e+03 1.500000 0.000000 0.000000 \n",
"75% 2550.000000 1.068800e+04 2.000000 0.000000 0.000000 \n",
"max 13540.000000 1.651359e+06 3.500000 1.000000 4.000000 \n",
"\n",
" ... grade sqft_above sqft_basement yr_built \\\n",
"count ... 21613.000000 21613.000000 21613.000000 21613.000000 \n",
"mean ... 7.656873 1788.390691 291.509045 1971.005136 \n",
"std ... 1.175459 828.090978 442.575043 29.373411 \n",
"min ... 1.000000 290.000000 0.000000 1900.000000 \n",
"25% ... 7.000000 1190.000000 0.000000 1951.000000 \n",
"50% ... 7.000000 1560.000000 0.000000 1975.000000 \n",
"75% ... 8.000000 2210.000000 560.000000 1997.000000 \n",
"max ... 13.000000 9410.000000 4820.000000 2015.000000 \n",
"\n",
" yr_renovated zipcode lat long sqft_living15 \\\n",
"count 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 \n",
"mean 84.402258 98077.939805 47.560053 -122.213896 1986.552492 \n",
"std 401.679240 53.505026 0.138564 0.140828 685.391304 \n",
"min 0.000000 98001.000000 47.155900 -122.519000 399.000000 \n",
"25% 0.000000 98033.000000 47.471000 -122.328000 1490.000000 \n",
"50% 0.000000 98065.000000 47.571800 -122.230000 1840.000000 \n",
"75% 0.000000 98118.000000 47.678000 -122.125000 2360.000000 \n",
"max 2015.000000 98199.000000 47.777600 -121.315000 6210.000000 \n",
"\n",
" sqft_lot15 \n",
"count 21613.000000 \n",
"mean 12768.455652 \n",
"std 27304.179631 \n",
"min 651.000000 \n",
"25% 5100.000000 \n",
"50% 7620.000000 \n",
"75% 10083.000000 \n",
"max 871200.000000 \n",
"\n",
"[8 rows x 21 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2.0 Data Wrangling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Question 2 \n",
"Drop the columns <code>\"id\"</code> and <code>\"Unnamed: 0\"</code> from axis 1 using the method <code>drop()</code>, then use the method <code>describe()</code> to obtain a statistical summary of the data. Take a screenshot and submit it, make sure the inplace parameter is set to <code>True</code>"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>price</th>\n",
" <th>bedrooms</th>\n",
" <th>bathrooms</th>\n",
" <th>sqft_living</th>\n",
" <th>sqft_lot</th>\n",
" <th>floors</th>\n",
" <th>waterfront</th>\n",
" <th>view</th>\n",
" <th>condition</th>\n",
" <th>grade</th>\n",
" <th>sqft_above</th>\n",
" <th>sqft_basement</th>\n",
" <th>yr_built</th>\n",
" <th>yr_renovated</th>\n",
" <th>zipcode</th>\n",
" <th>lat</th>\n",
" <th>long</th>\n",
" <th>sqft_living15</th>\n",
" <th>sqft_lot15</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>2.161300e+04</td>\n",
" <td>21600.000000</td>\n",
" <td>21603.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>2.161300e+04</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" <td>21613.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>5.400881e+05</td>\n",
" <td>3.372870</td>\n",
" <td>2.115736</td>\n",
" <td>2079.899736</td>\n",
" <td>1.510697e+04</td>\n",
" <td>1.494309</td>\n",
" <td>0.007542</td>\n",
" <td>0.234303</td>\n",
" <td>3.409430</td>\n",
" <td>7.656873</td>\n",
" <td>1788.390691</td>\n",
" <td>291.509045</td>\n",
" <td>1971.005136</td>\n",
" <td>84.402258</td>\n",
" <td>98077.939805</td>\n",
" <td>47.560053</td>\n",
" <td>-122.213896</td>\n",
" <td>1986.552492</td>\n",
" <td>12768.455652</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>3.671272e+05</td>\n",
" <td>0.926657</td>\n",
" <td>0.768996</td>\n",
" <td>918.440897</td>\n",
" <td>4.142051e+04</td>\n",
" <td>0.539989</td>\n",
" <td>0.086517</td>\n",
" <td>0.766318</td>\n",
" <td>0.650743</td>\n",
" <td>1.175459</td>\n",
" <td>828.090978</td>\n",
" <td>442.575043</td>\n",
" <td>29.373411</td>\n",
" <td>401.679240</td>\n",
" <td>53.505026</td>\n",
" <td>0.138564</td>\n",
" <td>0.140828</td>\n",
" <td>685.391304</td>\n",
" <td>27304.179631</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>7.500000e+04</td>\n",
" <td>1.000000</td>\n",
" <td>0.500000</td>\n",
" <td>290.000000</td>\n",
" <td>5.200000e+02</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>290.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1900.000000</td>\n",
" <td>0.000000</td>\n",
" <td>98001.000000</td>\n",
" <td>47.155900</td>\n",
" <td>-122.519000</td>\n",
" <td>399.000000</td>\n",
" <td>651.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>3.219500e+05</td>\n",
" <td>3.000000</td>\n",
" <td>1.750000</td>\n",
" <td>1427.000000</td>\n",
" <td>5.040000e+03</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>3.000000</td>\n",
" <td>7.000000</td>\n",
" <td>1190.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1951.000000</td>\n",
" <td>0.000000</td>\n",
" <td>98033.000000</td>\n",
" <td>47.471000</td>\n",
" <td>-122.328000</td>\n",
" <td>1490.000000</td>\n",
" <td>5100.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>4.500000e+05</td>\n",
" <td>3.000000</td>\n",
" <td>2.250000</td>\n",
" <td>1910.000000</td>\n",
" <td>7.618000e+03</td>\n",
" <td>1.500000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>3.000000</td>\n",
" <td>7.000000</td>\n",
" <td>1560.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1975.000000</td>\n",
" <td>0.000000</td>\n",
" <td>98065.000000</td>\n",
" <td>47.571800</td>\n",
" <td>-122.230000</td>\n",
" <td>1840.000000</td>\n",
" <td>7620.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>6.450000e+05</td>\n",
" <td>4.000000</td>\n",
" <td>2.500000</td>\n",
" <td>2550.000000</td>\n",
" <td>1.068800e+04</td>\n",
" <td>2.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>4.000000</td>\n",
" <td>8.000000</td>\n",
" <td>2210.000000</td>\n",
" <td>560.000000</td>\n",
" <td>1997.000000</td>\n",
" <td>0.000000</td>\n",
" <td>98118.000000</td>\n",
" <td>47.678000</td>\n",
" <td>-122.125000</td>\n",
" <td>2360.000000</td>\n",
" <td>10083.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>7.700000e+06</td>\n",
" <td>33.000000</td>\n",
" <td>8.000000</td>\n",
" <td>13540.000000</td>\n",
" <td>1.651359e+06</td>\n",
" <td>3.500000</td>\n",
" <td>1.000000</td>\n",
" <td>4.000000</td>\n",
" <td>5.000000</td>\n",
" <td>13.000000</td>\n",
" <td>9410.000000</td>\n",
" <td>4820.000000</td>\n",
" <td>2015.000000</td>\n",
" <td>2015.000000</td>\n",
" <td>98199.000000</td>\n",
" <td>47.777600</td>\n",
" <td>-121.315000</td>\n",
" <td>6210.000000</td>\n",
" <td>871200.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" price bedrooms bathrooms sqft_living sqft_lot \\\n",
"count 2.161300e+04 21600.000000 21603.000000 21613.000000 2.161300e+04 \n",
"mean 5.400881e+05 3.372870 2.115736 2079.899736 1.510697e+04 \n",
"std 3.671272e+05 0.926657 0.768996 918.440897 4.142051e+04 \n",
"min 7.500000e+04 1.000000 0.500000 290.000000 5.200000e+02 \n",
"25% 3.219500e+05 3.000000 1.750000 1427.000000 5.040000e+03 \n",
"50% 4.500000e+05 3.000000 2.250000 1910.000000 7.618000e+03 \n",
"75% 6.450000e+05 4.000000 2.500000 2550.000000 1.068800e+04 \n",
"max 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 \n",
"\n",
" floors waterfront view condition grade \\\n",
"count 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 \n",
"mean 1.494309 0.007542 0.234303 3.409430 7.656873 \n",
"std 0.539989 0.086517 0.766318 0.650743 1.175459 \n",
"min 1.000000 0.000000 0.000000 1.000000 1.000000 \n",
"25% 1.000000 0.000000 0.000000 3.000000 7.000000 \n",
"50% 1.500000 0.000000 0.000000 3.000000 7.000000 \n",
"75% 2.000000 0.000000 0.000000 4.000000 8.000000 \n",
"max 3.500000 1.000000 4.000000 5.000000 13.000000 \n",
"\n",
" sqft_above sqft_basement yr_built yr_renovated zipcode \\\n",
"count 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 \n",
"mean 1788.390691 291.509045 1971.005136 84.402258 98077.939805 \n",
"std 828.090978 442.575043 29.373411 401.679240 53.505026 \n",
"min 290.000000 0.000000 1900.000000 0.000000 98001.000000 \n",
"25% 1190.000000 0.000000 1951.000000 0.000000 98033.000000 \n",
"50% 1560.000000 0.000000 1975.000000 0.000000 98065.000000 \n",
"75% 2210.000000 560.000000 1997.000000 0.000000 98118.000000 \n",
"max 9410.000000 4820.000000 2015.000000 2015.000000 98199.000000 \n",
"\n",
" lat long sqft_living15 sqft_lot15 \n",
"count 21613.000000 21613.000000 21613.000000 21613.000000 \n",
"mean 47.560053 -122.213896 1986.552492 12768.455652 \n",
"std 0.138564 0.140828 685.391304 27304.179631 \n",
"min 47.155900 -122.519000 399.000000 651.000000 \n",
"25% 47.471000 -122.328000 1490.000000 5100.000000 \n",
"50% 47.571800 -122.230000 1840.000000 7620.000000 \n",
"75% 47.678000 -122.125000 2360.000000 10083.000000 \n",
"max 47.777600 -121.315000 6210.000000 871200.000000 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.drop(columns=[\"id\",\"Unnamed: 0\"],axis=1,inplace=True)\n",
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"we can see we have missing values for the columns <code> bedrooms</code> and <code> bathrooms </code>"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"number of NaN values for the column bedrooms : 13\n",
"number of NaN values for the column bathrooms : 10\n"
]
}
],
"source": [
"print(\"number of NaN values for the column bedrooms :\", df['bedrooms'].isnull().sum())\n",
"print(\"number of NaN values for the column bathrooms :\", df['bathrooms'].isnull().sum())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"We can replace the missing values of the column <code>'bedrooms'</code> with the mean of the column <code>'bedrooms' </code> using the method replace. Don't forget to set the <code>inplace</code> parameter top <code>True</code>"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"mean=df['bedrooms'].mean()\n",
"df['bedrooms'].replace(np.nan,mean, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"We also replace the missing values of the column <code>'bathrooms'</code> with the mean of the column <code>'bedrooms' </codse> using the method replace.Don't forget to set the <code> inplace </code> parameter top <code> Ture </code>"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"mean=df['bathrooms'].mean()\n",
"df['bathrooms'].replace(np.nan,mean, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"number of NaN values for the column bedrooms : 0\n",
"number of NaN values for the column bathrooms : 0\n"
]
}
],
"source": [
"print(\"number of NaN values for the column bedrooms :\", df['bedrooms'].isnull().sum())\n",
"print(\"number of NaN values for the column bathrooms :\", df['bathrooms'].isnull().sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3.0 Exploratory data analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Question 3\n",
"Use the method value_counts to count the number of houses with unique floor values, use the method .to_frame() to convert it to a dataframe.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>floors</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1.0</th>\n",
" <td>10680</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2.0</th>\n",
" <td>8241</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1.5</th>\n",
" <td>1910</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3.0</th>\n",
" <td>613</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2.5</th>\n",
" <td>161</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3.5</th>\n",
" <td>8</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" floors\n",
"1.0 10680\n",
"2.0 8241\n",
"1.5 1910\n",
"3.0 613\n",
"2.5 161\n",
"3.5 8"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['floors'].value_counts().to_frame()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 4\n",
"Use the function <code>boxplot</code> in the seaborn library to determine whether houses with a waterfront view or without a waterfront view have more price outliers ."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0, 0.5, 'price')"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"var1=sns.boxplot(data=[df['waterfront'],df['price']])\n",
"var1.set_xlabel('waterfront')\n",
"var1.set_ylabel('price')\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x7f62bc561cc0>"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.regplot(df['sqft_above'],df['price'],scatter=True,data = df, ci = None)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 5\n",
"Use the function <code> regplot</code> in the seaborn library to determine if the feature <code>sqft_above</code> is negatively or positively correlated with price."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"We can use the Pandas method <code>corr()</code> to find the feature other than price that is most correlated with price."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"zipcode -0.053203\n",
"id -0.016762\n",
"long 0.021626\n",
"Unnamed: 0 0.027372\n",
"condition 0.036362\n",
"yr_built 0.054012\n",
"sqft_lot15 0.082447\n",
"sqft_lot 0.089661\n",
"yr_renovated 0.126434\n",
"floors 0.256794\n",
"waterfront 0.266369\n",
"lat 0.307003\n",
"bedrooms 0.308890\n",
"sqft_basement 0.323816\n",
"view 0.397293\n",
"bathrooms 0.525885\n",
"sqft_living15 0.585379\n",
"sqft_above 0.605567\n",
"grade 0.667434\n",
"sqft_living 0.702035\n",
"price 1.000000\n",
"Name: price, dtype: float64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.corr()['price'].sort_values()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Module 4: Model Development"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import libraries "
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"from sklearn.linear_model import LinearRegression as lr\n",
"import sklearn as sk\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"We can Fit a linear regression model using the longitude feature <code> 'long'</code> and caculate the R^2."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.00046769430149007363"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = df[['long']]\n",
"Y = df['price']\n",
"lm = LinearRegression()\n",
"lm\n",
"lm.fit(X,Y)\n",
"lm.score(X, Y)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"y_data = df['price']\n",
"x_data=df.drop('price',axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"number of test samples : 3242\n",
"number of training samples: 18371\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"\n",
"x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.15, random_state=1)\n",
"\n",
"\n",
"print(\"number of test samples :\", x_test.shape[0])\n",
"print(\"number of training samples:\",x_train.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 6\n",
"Fit a linear regression model to predict the <code>'price'</code> using the feature 'sqft_living' then calculate the R^2. Take a screenshot of your code and the value of the R^2."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"features =[\"floors\", \"waterfront\",\"lat\" ,\"bedrooms\" ,\"sqft_basement\" ,\"view\" ,\"bathrooms\",\"sqft_living15\",\"sqft_above\",\"grade\",\"sqft_living\"] "
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"lre=LinearRegression()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n",
" normalize=False)"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lre.fit(x_train[['floors']], y_train)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"Question 7¶\n",
"Fit a linear regression model to predict the 'price' using the list of features:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"the calculate the R^2. Take a screenshot of your code"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.05890918979465243"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lre.score(x_test[['floors']], y_test)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-21511180274589.004"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sk.metrics.r2_score(y_pred=Y, y_true=X)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### this will help with Question 8\n",
"\n",
"Create a list of tuples, the first element in the tuple contains the name of the estimator:\n",
"\n",
"<code>'scale'</code>\n",
"\n",
"<code>'polynomial'</code>\n",
"\n",
"<code>'model'</code>\n",
"\n",
"The second element in the tuple contains the model constructor \n",
"\n",
"<code>StandardScaler()</code>\n",
"\n",
"<code>PolynomialFeatures(include_bias=False)</code>\n",
"\n",
"<code>LinearRegression()</code>\n"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"Input=[('scale',StandardScaler()),('polynomial', PolynomialFeatures(include_bias=False)),('model',LinearRegression())]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 8\n",
"Use the list to create a pipeline object, predict the 'price', fit the object using the features in the list <code> features </code>, then fit the model and calculate the R^2"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(memory=None,\n",
" steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('polynomial', PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)), ('model', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n",
" normalize=False))])"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe=Pipeline(Input)\n",
"pipe"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(memory=None,\n",
" steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('polynomial', PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)), ('model', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n",
" normalize=False))])"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe.fit(X,Y)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0033607985166381744"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe.score(X,Y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Module 5: MODEL EVALUATION AND REFINEMENT"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"import the necessary modules "
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"done\n"
]
}
],
"source": [
"from sklearn.model_selection import cross_val_score\n",
"from sklearn.model_selection import train_test_split\n",
"print(\"done\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"we will split the data into training and testing set"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"number of test samples : 3240\n",
"number of training samples: 18357\n"
]
}
],
"source": [
"features =[\"floors\", \"waterfront\",\"lat\" ,\"bedrooms\" ,\"sqft_basement\" ,\"view\" ,\"bathrooms\",\"sqft_living15\",\"sqft_above\",\"grade\",\"sqft_living\"] \n",
"X = df[features ]\n",
"Y = df['price']\n",
"\n",
"x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=1)\n",
"\n",
"\n",
"print(\"number of test samples :\", x_test.shape[0])\n",
"print(\"number of training samples:\",x_train.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 9\n",
"Create and fit a Ridge regression object using the training data, setting the regularization parameter to 0.1 and calculate the R^2 using the test data. \n"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import Ridge\n",
"from sklearn.metrics import r2_score"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ridge(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=None,\n",
" normalize=False, random_state=None, solver='auto', tol=0.001)"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r = Ridge(alpha=0.01)\n",
"r"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.0"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# X=X.replace(np.nan,0)\n",
"# Y=Y.replace(np.nan,0)\n",
"y_data=df['price']\n",
"x_data=df.drop('price',axis=1)\n",
"X_train,X_test,y_train,y_test=train_test_split(x_data,y_data,test_size=0.3,random_state=3)\n",
"\n",
"r2_score(y_true=y_data,y_pred=y_data)"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" date bedrooms bathrooms sqft_living sqft_lot floors \\\n",
"4131 20150418T000000 3.0 1.75 1600 9579 1.0 \n",
"17459 20140506T000000 5.0 4.00 4510 15175 2.0 \n",
"2192 20141002T000000 4.0 2.75 3190 9023 2.0 \n",
"12418 20141208T000000 1.0 1.00 620 8261 1.0 \n",
"15773 20141002T000000 4.0 1.00 1430 27153 1.5 \n",
"2945 20150423T000000 3.0 1.75 1710 5000 1.0 \n",
"2911 20140508T000000 2.0 1.00 720 5040 1.0 \n",
"8071 20150506T000000 3.0 1.75 1930 6000 1.0 \n",
"17127 20150319T000000 3.0 2.50 1670 2575 2.0 \n",
"8961 20140520T000000 4.0 2.50 2800 8494 2.0 \n",
"5294 20140915T000000 3.0 1.75 1270 4800 1.0 \n",
"2871 20140529T000000 2.0 1.00 900 423838 1.0 \n",
"13174 20150107T000000 4.0 2.50 2050 3784 2.0 \n",
"6261 20150326T000000 5.0 1.75 2290 4320 2.0 \n",
"20720 20150211T000000 3.0 2.50 1440 2500 2.0 \n",
"4703 20150225T000000 4.0 2.50 2410 4708 2.0 \n",
"8315 20150212T000000 4.0 1.00 1540 5000 1.5 \n",
"845 20140509T000000 3.0 2.25 1640 11050 1.0 \n",
"20179 20150225T000000 3.0 3.50 2460 14155 2.0 \n",
"18111 20141111T000000 5.0 3.25 4010 12110 2.0 \n",
"6677 20140919T000000 3.0 2.25 2070 7207 1.0 \n",
"20119 20141029T000000 3.0 3.25 1680 1478 2.0 \n",
"6321 20141016T000000 5.0 2.50 2344 8000 1.0 \n",
"174 20140929T000000 4.0 2.25 2590 8190 2.0 \n",
"4912 20140624T000000 2.0 1.00 2034 13392 1.0 \n",
"10694 20150225T000000 3.0 1.00 1140 6120 1.5 \n",
"3953 20150226T000000 3.0 2.50 2100 7587 2.0 \n",
"20328 20150423T000000 3.0 2.50 2283 3996 2.0 \n",
"4782 20141230T000000 5.0 2.50 2580 11250 1.0 \n",
"15507 20150416T000000 4.0 2.25 2640 8800 1.0 \n",
"... ... ... ... ... ... ... \n",
"13714 20141020T000000 4.0 2.75 2600 19275 1.0 \n",
"10098 20150225T000000 3.0 2.25 2960 8330 1.0 \n",
"10061 20141114T000000 4.0 2.50 3240 22795 2.0 \n",
"17643 20150427T000000 3.0 1.75 1350 4000 1.5 \n",
"942 20141022T000000 4.0 3.00 1490 6766 1.5 \n",
"17817 20150422T000000 3.0 2.00 2740 101930 1.0 \n",
"8917 20141007T000000 3.0 1.75 2560 8400 1.0 \n",
"12034 20140715T000000 2.0 2.00 1440 213008 2.0 \n",
"20416 20140618T000000 3.0 2.25 1584 2800 2.0 \n",
"16590 20140607T000000 4.0 2.50 2460 4200 2.0 \n",
"18909 20140611T000000 3.0 2.00 1640 5280 1.5 \n",
"13879 20141110T000000 3.0 1.00 1070 10563 1.0 \n",
"16553 20150407T000000 4.0 2.50 3130 5200 2.0 \n",
"8293 20150306T000000 3.0 2.50 2980 43301 1.0 \n",
"16003 20141010T000000 4.0 3.00 2410 8284 1.0 \n",
"5010 20150416T000000 2.0 1.00 620 4455 1.0 \n",
"13408 20141021T000000 4.0 2.75 2540 4400 1.5 \n",
"6197 20150514T000000 5.0 3.00 3320 5354 2.0 \n",
"7813 20140805T000000 4.0 3.25 3100 3900 2.0 \n",
"5355 20140514T000000 3.0 2.50 3030 30007 1.5 \n",
"4699 20150108T000000 4.0 4.00 4050 9517 2.0 \n",
"17008 20141226T000000 4.0 2.50 2650 5706 2.0 \n",
"13883 20141022T000000 4.0 2.50 2050 9143 2.0 \n",
"13127 20141208T000000 2.0 2.25 1390 1222 3.0 \n",
"2828 20150305T000000 4.0 3.50 3070 4440 2.0 \n",
"11035 20140730T000000 4.0 1.75 1820 13600 1.5 \n",
"18521 20150330T000000 4.0 2.75 2220 5310 1.0 \n",
"1777 20140805T000000 3.0 2.00 1210 7136 1.0 \n",
"4261 20140508T000000 4.0 1.50 2130 8800 1.0 \n",
"19582 20140801T000000 2.0 1.00 900 5413 1.0 \n",
"\n",
" waterfront view condition grade sqft_above sqft_basement \\\n",
"4131 0 0 3 8 1180 420 \n",
"17459 0 0 3 10 4510 0 \n",
"2192 0 0 3 9 3190 0 \n",
"12418 0 0 3 5 620 0 \n",
"15773 0 0 4 5 1430 0 \n",
"2945 0 0 4 7 1110 600 \n",
"2911 0 0 3 6 720 0 \n",
"8071 0 0 3 8 1130 800 \n",
"17127 0 0 3 8 1670 0 \n",
"8961 0 0 3 8 2800 0 \n",
"5294 0 0 3 7 1270 0 \n",
"2871 0 2 5 6 900 0 \n",
"13174 0 0 3 8 2050 0 \n",
"6261 0 0 3 7 1980 310 \n",
"20720 0 0 3 7 1440 0 \n",
"4703 0 0 3 8 2410 0 \n",
"8315 0 0 4 7 1090 450 \n",
"845 0 0 4 8 1640 0 \n",
"20179 0 0 3 8 1900 560 \n",
"18111 0 0 3 11 4010 0 \n",
"6677 0 0 3 8 1720 350 \n",
"20119 0 0 3 8 1360 320 \n",
"6321 0 0 4 8 1560 784 \n",
"174 0 0 4 8 2590 0 \n",
"4912 1 4 5 7 1159 875 \n",
"10694 0 0 3 7 1140 0 \n",
"3953 0 0 3 9 2100 0 \n",
"20328 0 0 3 8 2283 0 \n",
"4782 0 0 3 7 1410 1170 \n",
"15507 0 0 3 8 1620 1020 \n",
"... ... ... ... ... ... ... \n",
"13714 0 0 3 8 1620 980 \n",
"10098 0 3 4 10 2260 700 \n",
"10061 0 0 3 8 3240 0 \n",
"17643 0 0 4 7 1350 0 \n",
"942 0 1 5 7 1490 0 \n",
"17817 0 2 3 9 2740 0 \n",
"8917 0 0 3 7 1970 590 \n",
"12034 0 0 4 7 1440 0 \n",
"20416 0 0 3 7 1584 0 \n",
"16590 0 0 3 8 2460 0 \n",
"18909 0 0 5 6 1640 0 \n",
"13879 0 0 3 7 1070 0 \n",
"16553 0 0 3 7 3130 0 \n",
"8293 0 0 4 8 1930 1050 \n",
"16003 0 0 5 7 1210 1200 \n",
"5010 0 0 3 6 620 0 \n",
"13408 0 0 5 7 1630 910 \n",
"6197 0 0 3 9 3320 0 \n",
"7813 0 2 5 9 2090 1010 \n",
"5355 0 0 4 10 3030 0 \n",
"4699 0 0 3 11 3360 690 \n",
"17008 0 0 3 9 2650 0 \n",
"13883 0 0 3 8 2050 0 \n",
"13127 0 0 3 7 1340 50 \n",
"2828 0 0 3 9 2030 1040 \n",
"11035 0 0 3 7 1120 700 \n",
"18521 0 0 5 7 1170 1050 \n",
"1777 0 0 3 7 1210 0 \n",
"4261 0 0 3 7 1100 1030 \n",
"19582 0 0 3 7 900 0 \n",
"\n",
" yr_built yr_renovated zipcode lat long sqft_living15 \\\n",
"4131 1977 0 98072 47.7662 -122.159 1750 \n",
"17459 1969 2002 98040 47.5309 -122.228 3510 \n",
"2192 2005 0 98075 47.5898 -121.989 3159 \n",
"12418 1939 0 98106 47.5138 -122.364 1180 \n",
"15773 1934 0 98065 47.5372 -121.744 1880 \n",
"2945 1944 0 98107 47.6689 -122.370 920 \n",
"2911 1955 0 98106 47.5267 -122.360 1357 \n",
"8071 1956 0 98119 47.6516 -122.375 1870 \n",
"17127 2000 0 98027 47.5310 -122.047 1670 \n",
"8961 2004 0 98038 47.3521 -122.009 3740 \n",
"5294 1953 0 98117 47.6930 -122.361 1490 \n",
"2871 1925 0 98022 47.2280 -122.088 1810 \n",
"13174 2001 0 98034 47.7189 -122.181 2050 \n",
"6261 1928 0 98105 47.6640 -122.310 2870 \n",
"20720 2008 0 98146 47.5123 -122.358 1440 \n",
"4703 2002 0 98092 47.3229 -122.182 2517 \n",
"8315 1922 0 98115 47.6754 -122.294 1590 \n",
"845 1972 0 98059 47.4723 -122.121 1870 \n",
"20179 2014 0 98155 47.7743 -122.279 2440 \n",
"18111 2003 0 98059 47.5228 -122.151 4010 \n",
"6677 1973 0 98177 47.7735 -122.371 2350 \n",
"20119 2009 0 98126 47.5674 -122.369 1530 \n",
"6321 1976 0 98023 47.3185 -122.377 2344 \n",
"174 1980 0 98006 47.5619 -122.125 2260 \n",
"4912 1947 0 98070 47.3312 -122.503 1156 \n",
"10694 1926 0 98115 47.6822 -122.309 1800 \n",
"3953 1990 0 98023 47.3072 -122.391 2330 \n",
"20328 2008 0 98031 47.4221 -122.192 1760 \n",
"4782 1964 0 98198 47.3970 -122.313 2240 \n",
"15507 1980 0 98072 47.7552 -122.148 2500 \n",
"... ... ... ... ... ... ... \n",
"13714 1978 0 98006 47.5523 -122.162 2230 \n",
"10098 1953 0 98177 47.7035 -122.385 2960 \n",
"10061 1998 0 98053 47.6329 -121.969 2570 \n",
"17643 1925 0 98103 47.6581 -122.354 1880 \n",
"942 1915 0 98136 47.5446 -122.382 1990 \n",
"17817 1999 0 98045 47.5056 -121.770 2140 \n",
"8917 1959 0 98011 47.7668 -122.197 1970 \n",
"12034 1990 0 98070 47.3604 -122.457 1630 \n",
"20416 2012 0 98002 47.3454 -122.214 1584 \n",
"16590 1998 0 98074 47.6041 -122.020 2460 \n",
"18909 1910 0 98002 47.3089 -122.213 1160 \n",
"13879 1969 0 98072 47.7687 -122.166 1840 \n",
"16553 2005 0 98042 47.3828 -122.098 3020 \n",
"8293 1978 0 98077 47.7631 -122.093 2890 \n",
"16003 1969 0 98034 47.7202 -122.220 2050 \n",
"5010 1927 0 98117 47.6877 -122.395 1180 \n",
"13408 1925 0 98103 47.6832 -122.343 1560 \n",
"6197 2004 0 98103 47.6542 -122.331 2330 \n",
"7813 1923 0 98109 47.6385 -122.348 2110 \n",
"5355 1992 0 98077 47.7430 -122.036 3360 \n",
"4699 1990 0 98040 47.5769 -122.215 3330 \n",
"17008 2005 0 98042 47.3515 -122.164 2760 \n",
"13883 1992 0 98166 47.4597 -122.355 1510 \n",
"13127 2009 0 98052 47.6754 -122.121 1480 \n",
"2828 1922 2007 98116 47.5732 -122.411 1780 \n",
"11035 1959 0 98032 47.3743 -122.295 1810 \n",
"18521 1951 0 98144 47.5801 -122.294 1540 \n",
"1777 2003 0 98031 47.3996 -122.203 1210 \n",
"4261 1962 0 98032 47.3830 -122.288 1480 \n",
"19582 1947 0 98125 47.7047 -122.307 1280 \n",
"\n",
" sqft_lot15 \n",
"4131 9829 \n",
"17459 13500 \n",
"2192 5615 \n",
"12418 8244 \n",
"15773 27153 \n",
"2945 5000 \n",
"2911 5120 \n",
"8071 6000 \n",
"17127 2897 \n",
"8961 8494 \n",
"5294 4800 \n",
"2871 94960 \n",
"13174 3366 \n",
"6261 4320 \n",
"20720 5000 \n",
"4703 5290 \n",
"8315 5000 \n",
"845 11050 \n",
"20179 14080 \n",
"18111 12334 \n",
"6677 7980 \n",
"20119 2753 \n",
"6321 8000 \n",
"174 8335 \n",
"4912 15961 \n",
"10694 4080 \n",
"3953 8119 \n",
"20328 3992 \n",
"4782 11780 \n",
"15507 11700 \n",
"... ... \n",
"13714 10119 \n",
"10098 8840 \n",
"10061 29761 \n",
"17643 4000 \n",
"942 6526 \n",
"17817 83635 \n",
"8917 8400 \n",
"12034 161172 \n",
"20416 2800 \n",
"16590 4200 \n",
"18909 7875 \n",
"13879 9638 \n",
"16553 5200 \n",
"8293 35915 \n",
"16003 7940 \n",
"5010 5000 \n",
"13408 3920 \n",
"6197 4040 \n",
"7813 3900 \n",
"5355 34983 \n",
"4699 9436 \n",
"17008 5749 \n",
"13883 9484 \n",
"13127 1369 \n",
"2828 4800 \n",
"11035 11970 \n",
"18521 4200 \n",
"1777 5765 \n",
"4261 8120 \n",
"19582 6380 \n",
"\n",
"[6484 rows x 19 columns] date bedrooms bathrooms sqft_living sqft_lot floors \\\n",
"14614 20150411T000000 4.0 2.50 2050 6705 1.0 \n",
"5805 20150120T000000 3.0 2.50 2510 5544 2.0 \n",
"20435 20140812T000000 5.0 3.50 2815 4900 2.0 \n",
"13849 20150417T000000 3.0 2.75 2190 7258 2.0 \n",
"6932 20140811T000000 2.0 1.00 720 5000 1.0 \n",
"19953 20140929T000000 3.0 2.50 1427 4337 2.0 \n",
"11899 20140605T000000 3.0 2.50 1610 6176 2.0 \n",
"7707 20140917T000000 3.0 1.50 2040 6750 1.0 \n",
"6520 20141223T000000 4.0 2.50 3580 97574 2.0 \n",
"10916 20140909T000000 4.0 2.00 1750 8116 1.0 \n",
"961 20150413T000000 3.0 2.00 1010 7380 1.0 \n",
"7201 20140929T000000 3.0 1.75 2260 8512 1.0 \n",
"8550 20140627T000000 5.0 1.75 2330 14322 1.0 \n",
"5310 20141104T000000 3.0 1.50 1650 8676 1.0 \n",
"7407 20140709T000000 4.0 2.75 2100 4480 1.5 \n",
"1020 20140729T000000 4.0 2.75 3130 21810 2.0 \n",
"20033 20140618T000000 4.0 2.50 3250 5000 2.0 \n",
"21210 20140516T000000 2.0 2.50 1310 1500 2.0 \n",
"180 20140725T000000 3.0 2.50 1670 5797 2.0 \n",
"8915 20141104T000000 3.0 1.75 1380 4590 1.0 \n",
"16356 20150220T000000 3.0 1.00 1510 8760 1.0 \n",
"8800 20141009T000000 1.0 1.00 730 5005 1.0 \n",
"13564 20140728T000000 3.0 2.50 1590 3121 2.0 \n",
"9838 20140709T000000 4.0 2.50 3020 7465 2.0 \n",
"11066 20150506T000000 2.0 1.00 850 5000 1.0 \n",
"12956 20150226T000000 3.0 2.25 1470 8682 1.0 \n",
"17596 20150401T000000 4.0 3.25 4200 210394 2.0 \n",
"2387 20141024T000000 3.0 1.50 1180 7000 1.0 \n",
"20081 20141103T000000 3.0 1.75 1650 1180 3.0 \n",
"15490 20141028T000000 2.0 1.00 1210 7040 1.0 \n",
"... ... ... ... ... ... ... \n",
"11435 20140922T000000 3.0 2.50 2120 2374 2.0 \n",
"10759 20150423T000000 2.0 2.25 1370 1248 2.0 \n",
"4926 20141017T000000 3.0 2.50 2980 18935 1.5 \n",
"14843 20140812T000000 4.0 2.50 2007 4968 2.0 \n",
"9834 20150320T000000 5.0 2.50 2900 6650 1.0 \n",
"6557 20140804T000000 2.0 1.00 940 8384 1.0 \n",
"4718 20140710T000000 3.0 2.75 2220 4000 2.0 \n",
"7016 20140731T000000 3.0 1.50 1510 6600 1.0 \n",
"858 20140627T000000 3.0 1.00 1520 213444 1.5 \n",
"4019 20140619T000000 3.0 2.00 2140 7200 1.0 \n",
"12929 20140610T000000 3.0 2.00 2180 4976 1.5 \n",
"11580 20150407T000000 3.0 2.00 1840 8140 1.0 \n",
"8127 20140722T000000 4.0 2.75 1760 9222 1.0 \n",
"18690 20140502T000000 5.0 2.50 2210 9655 1.0 \n",
"2710 20140708T000000 2.0 2.50 2560 2500 2.0 \n",
"1498 20150316T000000 4.0 3.00 3470 4750 2.0 \n",
"3610 20140821T000000 3.0 2.50 2060 8893 2.0 \n",
"19960 20141001T000000 3.0 3.00 1290 1112 3.0 \n",
"6542 20141121T000000 3.0 2.50 1990 4936 2.0 \n",
"1447 20140825T000000 3.0 2.25 1500 7308 1.0 \n",
"7061 20140618T000000 3.0 2.00 1380 8682 1.0 \n",
"18089 20140922T000000 3.0 2.50 1900 5194 2.0 \n",
"11115 20140528T000000 3.0 2.50 1970 23180 1.0 \n",
"19091 20150116T000000 5.0 4.00 4720 493534 2.0 \n",
"11261 20140604T000000 3.0 1.75 1670 9600 1.0 \n",
"6400 20141016T000000 3.0 1.75 710 5050 1.0 \n",
"15288 20150413T000000 3.0 2.00 1500 5200 1.0 \n",
"11513 20140825T000000 3.0 2.50 2140 7715 2.0 \n",
"1688 20140717T000000 2.0 1.75 1210 131115 1.5 \n",
"5994 20140828T000000 2.0 1.00 610 4000 1.0 \n",
"\n",
" waterfront view condition grade sqft_above sqft_basement \\\n",
"14614 0 0 4 7 1230 820 \n",
"5805 0 0 3 7 2510 0 \n",
"20435 0 0 3 9 2815 0 \n",
"13849 0 0 3 8 2190 0 \n",
"6932 0 0 5 6 720 0 \n",
"19953 0 0 3 7 1427 0 \n",
"11899 0 0 3 7 1610 0 \n",
"7707 0 0 3 7 1280 760 \n",
"6520 0 0 3 9 3580 0 \n",
"10916 0 0 4 5 1750 0 \n",
"961 0 0 3 7 1010 0 \n",
"7201 0 0 3 7 1130 1130 \n",
"8550 0 0 4 7 1180 1150 \n",
"5310 0 0 4 8 1130 520 \n",
"7407 0 0 4 7 1780 320 \n",
"1020 0 0 4 10 3130 0 \n",
"20033 0 0 3 8 3250 0 \n",
"21210 0 0 3 8 1160 150 \n",
"180 0 0 3 7 1670 0 \n",
"8915 0 0 2 7 930 450 \n",
"16356 0 0 4 6 1510 0 \n",
"8800 0 0 4 5 730 0 \n",
"13564 0 0 3 7 1590 0 \n",
"9838 0 0 3 9 3020 0 \n",
"11066 0 0 3 5 850 0 \n",
"12956 0 0 3 7 1160 310 \n",
"17596 0 0 4 10 4200 0 \n",
"2387 0 0 4 7 1180 0 \n",
"20081 0 0 3 8 1650 0 \n",
"15490 0 0 3 7 1210 0 \n",
"... ... ... ... ... ... ... \n",
"11435 0 0 3 8 1770 350 \n",
"10759 0 0 3 7 1200 170 \n",
"4926 0 0 3 11 2980 0 \n",
"14843 0 0 3 9 2007 0 \n",
"9834 0 0 3 7 1450 1450 \n",
"6557 0 0 3 5 940 0 \n",
"4718 0 0 3 8 1700 520 \n",
"7016 0 0 3 6 1510 0 \n",
"858 0 3 5 8 1520 0 \n",
"4019 0 0 4 8 1480 660 \n",
"12929 0 2 4 8 1680 500 \n",
"11580 0 0 4 7 1040 800 \n",
"8127 0 0 3 7 1140 620 \n",
"18690 0 0 3 8 1460 750 \n",
"2710 0 0 5 8 1690 870 \n",
"1498 0 2 3 9 2370 1100 \n",
"3610 0 0 3 8 2060 0 \n",
"19960 0 0 3 7 1290 0 \n",
"6542 0 0 3 8 1990 0 \n",
"1447 0 0 4 7 1210 290 \n",
"7061 0 0 4 7 1380 0 \n",
"18089 0 0 3 7 1900 0 \n",
"11115 1 4 3 8 1100 870 \n",
"19091 0 0 5 9 3960 760 \n",
"11261 0 0 5 8 1670 0 \n",
"6400 0 0 4 6 710 0 \n",
"15288 0 0 3 7 1060 440 \n",
"11513 0 0 3 7 2140 0 \n",
"1688 0 0 5 7 1210 0 \n",
"5994 0 0 4 6 610 0 \n",
"\n",
" yr_built yr_renovated zipcode lat long sqft_living15 \\\n",
"14614 1973 0 98034 47.7242 -122.217 1610 \n",
"5805 2001 0 98053 47.6903 -122.042 2660 \n",
"20435 2011 0 98030 47.3424 -122.179 2798 \n",
"13849 2000 0 98003 47.3486 -122.301 2190 \n",
"6932 1951 0 98126 47.5195 -122.374 810 \n",
"19953 2009 0 98042 47.3857 -122.162 1443 \n",
"11899 1994 0 98030 47.3657 -122.173 1680 \n",
"7707 1950 0 98117 47.7013 -122.369 1970 \n",
"6520 2004 0 98038 47.3901 -122.071 2510 \n",
"10916 1943 0 98056 47.5097 -122.181 1440 \n",
"961 1982 0 98074 47.6273 -122.062 1650 \n",
"7201 1948 0 98125 47.7129 -122.304 2240 \n",
"8550 1968 0 98059 47.4768 -122.155 1690 \n",
"5310 1979 0 98133 47.7471 -122.352 1400 \n",
"7407 1928 0 98105 47.6691 -122.294 2050 \n",
"1020 1993 0 98077 47.7083 -122.073 3330 \n",
"20033 2008 0 98059 47.4988 -122.148 3230 \n",
"21210 2006 0 98122 47.6112 -122.309 1320 \n",
"180 1988 0 98030 47.3505 -122.179 1670 \n",
"8915 1950 0 98115 47.6841 -122.293 1320 \n",
"16356 1946 0 98002 47.3015 -122.216 1040 \n",
"8800 1945 0 98117 47.6992 -122.364 1630 \n",
"13564 1994 0 98118 47.5515 -122.284 1090 \n",
"9838 2004 0 98075 47.5982 -121.980 3100 \n",
"11066 1907 0 98055 47.4874 -122.207 910 \n",
"12956 1985 0 98003 47.2729 -122.299 1670 \n",
"17596 1993 0 98024 47.5607 -121.961 2370 \n",
"2387 1977 0 98023 47.2959 -122.373 1630 \n",
"20081 2014 0 98105 47.6638 -122.319 1650 \n",
"15490 1952 0 98155 47.7450 -122.297 1210 \n",
"... ... ... ... ... ... ... \n",
"11435 2005 0 98052 47.6740 -122.142 2480 \n",
"10759 2000 0 98102 47.6399 -122.320 1800 \n",
"4926 1990 0 98077 47.7133 -122.079 3670 \n",
"14843 2009 0 98092 47.3301 -122.191 2189 \n",
"9834 1964 0 98168 47.4935 -122.332 1600 \n",
"6557 1947 0 98146 47.5065 -122.364 1290 \n",
"4718 1914 2000 98122 47.6170 -122.291 1800 \n",
"7016 1938 0 98168 47.4821 -122.331 990 \n",
"858 1988 0 98027 47.5081 -122.093 2640 \n",
"4019 1966 0 98056 47.5084 -122.185 2070 \n",
"12929 1930 0 98126 47.5730 -122.380 1850 \n",
"11580 1975 0 98003 47.3106 -122.325 1600 \n",
"8127 1971 0 98023 47.3099 -122.362 1800 \n",
"18690 1976 0 98011 47.7698 -122.222 2080 \n",
"2710 1901 0 98112 47.6233 -122.300 1890 \n",
"1498 2014 0 98116 47.5917 -122.386 2420 \n",
"3610 1987 0 98006 47.5615 -122.165 2650 \n",
"19960 2008 0 98125 47.7282 -122.296 1230 \n",
"6542 2004 0 98075 47.5911 -122.018 2250 \n",
"1447 1968 0 98055 47.4621 -122.187 1480 \n",
"7061 1966 0 98148 47.4238 -122.322 1410 \n",
"18089 2004 0 98058 47.4391 -122.117 2230 \n",
"11115 1937 1998 98136 47.5495 -122.398 3030 \n",
"19091 1975 0 98027 47.4536 -122.009 2160 \n",
"11261 1961 0 98040 47.5754 -122.233 1900 \n",
"6400 1950 0 98126 47.5194 -122.375 900 \n",
"15288 1977 0 98042 47.3653 -122.090 1640 \n",
"11513 1991 0 98198 47.3840 -122.322 1990 \n",
"1688 1950 0 98070 47.4599 -122.450 2020 \n",
"5994 1918 0 98136 47.5469 -122.391 870 \n",
"\n",
" sqft_lot15 \n",
"14614 7292 \n",
"5805 5614 \n",
"20435 4900 \n",
"13849 8645 \n",
"6932 5000 \n",
"19953 4347 \n",
"11899 7414 \n",
"7707 6750 \n",
"6520 27068 \n",
"10916 7865 \n",
"961 9030 \n",
"7201 8040 \n",
"8550 10010 \n",
"5310 8499 \n",
"7407 4480 \n",
"1020 21810 \n",
"20033 5507 \n",
"21210 1581 \n",
"180 6183 \n",
"8915 4692 \n",
"16356 7828 \n",
"8800 5667 \n",
"13564 4900 \n",
"9838 5587 \n",
"11066 4815 \n",
"12956 8359 \n",
"17596 184694 \n",
"2387 7500 \n",
"20081 1960 \n",
"15490 7205 \n",
"... ... \n",
"11435 3043 \n",
"10759 3360 \n",
"4926 18225 \n",
"14843 5852 \n",
"9834 8246 \n",
"6557 8384 \n",
"4718 4000 \n",
"7016 6600 \n",
"858 213444 \n",
"4019 7220 \n",
"12929 5000 \n",
"11580 6720 \n",
"8127 9222 \n",
"18690 8633 \n",
"2710 5000 \n",
"1498 4761 \n",
"3610 8500 \n",
"19960 9000 \n",
"6542 4815 \n",
"1447 7400 \n",
"7061 10594 \n",
"18089 5194 \n",
"11115 34689 \n",
"19091 219542 \n",
"11261 9600 \n",
"6400 5050 \n",
"15288 5200 \n",
"11513 7628 \n",
"1688 185565 \n",
"5994 5160 \n",
"\n",
"[15129 rows x 19 columns]\n"
]
}
],
"source": [
"print(X_test,X_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 10\n",
"Perform a second order polynomial transform on both the training data and testing data. Create and fit a Ridge regression object using the training data, setting the regularisation parameter to 0.1. Calculate the R^2 utilising the test data provided. Take a screenshot of your code and the R^2."
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "Expected 2D array, got 1D array instead:\narray=[ 525000. 1870000. 750000. ... 269000. 260000. 375000.].\nReshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-98-c022f6900ae2>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;31m# X_train.reshape(1, -1)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mpr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit_transform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my_test\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/opt/conda/envs/Python36/lib/python3.6/site-packages/sklearn/base.py\u001b[0m in \u001b[0;36mfit_transform\u001b[0;34m(self, X, y, **fit_params)\u001b[0m\n\u001b[1;32m 462\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0my\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 463\u001b[0m \u001b[0;31m# fit method of arity 1 (unsupervised transformation)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 464\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mfit_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtransform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 465\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 466\u001b[0m \u001b[0;31m# fit method of arity 2 (supervised transformation)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/conda/envs/Python36/lib/python3.6/site-packages/sklearn/preprocessing/data.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, X, y)\u001b[0m\n\u001b[1;32m 1458\u001b[0m \u001b[0mself\u001b[0m \u001b[0;34m:\u001b[0m \u001b[0minstance\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1459\u001b[0m \"\"\"\n\u001b[0;32m-> 1460\u001b[0;31m \u001b[0mn_samples\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mn_features\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1461\u001b[0m combinations = self._combinations(n_features, self.degree,\n\u001b[1;32m 1462\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minteraction_only\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/conda/envs/Python36/lib/python3.6/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcheck_array\u001b[0;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)\u001b[0m\n\u001b[1;32m 550\u001b[0m \u001b[0;34m\"Reshape your data either using array.reshape(-1, 1) if \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 551\u001b[0m \u001b[0;34m\"your data has a single feature or array.reshape(1, -1) \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 552\u001b[0;31m \"if it contains a single sample.\".format(array))\n\u001b[0m\u001b[1;32m 553\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 554\u001b[0m \u001b[0;31m# in the future np.flexible dtypes will be handled like object dtypes\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mValueError\u001b[0m: Expected 2D array, got 1D array instead:\narray=[ 525000. 1870000. 750000. ... 269000. 260000. 375000.].\nReshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
]
}
],
"source": [
"pr = PolynomialFeatures(degree=2,interaction_only=False, include_bias=True)\n",
"# reshape(1, -1)\n",
"# X_train.reshape(1, -1)\n",
"\n",
"pr.fit_transform(y_test)"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [],
"source": [
"# poly = LinearRegression()\n",
"# poly.fit(x_train_pr, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>Once you complete your notebook you will have to share it. Select the icon on the top right a marked in red in the image below, a dialogue box should open, select the option all&nbsp;content excluding sensitive code cells.</p>\n",
" <p><img width=\"600\" src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/save_notebook.png\" alt=\"share notebook\" style=\"display: block; margin-left: auto; margin-right: auto;\"/></p>\n",
" <p></p>\n",
" <p>You can then share the notebook&nbsp; via a&nbsp; URL by scrolling down as shown in the following image:</p>\n",
" <p style=\"text-align: center;\"><img width=\"600\" src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/url_notebook.png\" alt=\"HTML\" style=\"display: block; margin-left: auto; margin-right: auto;\" /></p>\n",
" <p>&nbsp;</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>About the Authors:</h2> \n",
"\n",
"<a href=\"https://www.linkedin.com/in/joseph-s-50398b136/\">Joseph Santarcangelo</a> has a PhD in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Other contributors: <a href=\"https://www.linkedin.com/in/michelleccarey/\">Michelle Carey</a>, <a href=\"www.linkedin.com/in/jiahui-mavis-zhou-a4537814a\">Mavis Zhou</a> "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment