paulk-asert/HousePrices.ipynb

## HousePrices.ipynb
{"cells":[{"metadata":{},"cell_type":"markdown","source":"# House price prediction using regression\n\nThis example uses the [Kaggle house sale prices for King County](https://www.kaggle.com/harlfoxem/housesalesprediction/data) dataset.\nIt captures actual house price sales for about a year in that region as well\nas some interesting features of each house such as number of bedrooms,\nnumber of bathrooms, size of living area, etc.\n\nWe'll use [Tablesaw](https://tablesaw.tech/) to store and manipulate our data\nand the linear regression class from the [Smile](http://haifengl.github.io/)\nmachine learning library. So, we'll add those libraries to the classpath and define some imports to simplify access to the classes we need."},{"metadata":{"trusted":true},"cell_type":"code","source":"%%classpath add mvn\ntech.tablesaw tablesaw-beakerx 0.36.0\ncom.github.haifengl smile-core 1.5.3","execution_count":158,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"%import tech.tablesaw.api.*\n%import smile.regression.OLS","execution_count":159,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"We'll also enable a BeakerX display widget for Tablesaw tables."},{"metadata":{"trusted":true},"cell_type":"code","source":"tech.tablesaw.beakerx.TablesawDisplayer.register()\nOutputCell.HIDDEN","execution_count":160,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Exploring the data\n\nWe start by loading data and printing its shape and structure."},{"metadata":{"scrolled":false,"trusted":true},"cell_type":"code","source":"records = Table.read().csv(\"../resources/kc_house_data.csv\")\nrecords.shape()","execution_count":161,"outputs":[{"data":{"text/plain":"21613 rows X 21 cols"},"execution_count":161,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"records.structure()","execution_count":162,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"5285462e-c30b-44b5-a264-891c334b3a83","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{},"cell_type":"markdown","source":"We might want to explore the _number of bedrooms_ feature. We can display a summary of that feature and examine some outliers."},{"metadata":{"trusted":true},"cell_type":"code","source":"records.column(\"bedrooms\").summary().print()","execution_count":163,"outputs":[{"data":{"text/plain":"         Column: bedrooms          \n Measure   |        Value         |\n-----------------------------------\n        n  |               21613  |\n      sum  |               72854  |\n     Mean  |   3.370841623097218  |\n      Min  |                   0  |\n      Max  |                  33  |\n    Range  |                  33  |\n Variance  |  0.8650150097573497  |\n Std. Dev  |   0.930061831147451  |"},"execution_count":163,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"records.where(records.column(\"bedrooms\").isGreaterThan(10))","execution_count":164,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"3f623167-2b25-4043-bce7-980e5ba011a9","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{},"cell_type":"markdown","source":"We can remove the 33 bedroom record as an outlier:"},{"metadata":{"trusted":true},"cell_type":"code","source":"records = records.dropWhere(records.column(\"bedrooms\").isGreaterThan(30))\nrecords.shape()","execution_count":165,"outputs":[{"data":{"text/plain":"21612 rows X 21 cols"},"execution_count":165,"metadata":{},"output_type":"execute_result"}]},{"metadata":{},"cell_type":"markdown","source":"We might want to explore the _number of bathrooms_ feature. We can find the maximum number of bathrooms and display a histogram."},{"metadata":{"trusted":true},"cell_type":"code","source":"maxBathrooms = records.column('bathrooms').toList().max()","execution_count":166,"outputs":[{"data":{"text/plain":"8.0"},"execution_count":166,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"plot = new Histogram(title: 'Bathroom histogram',\n                     binCount: maxBathrooms,\n                     xLabel: '#bathrooms',\n                     yLabel: '#houses',\n                     data: records.column('bathrooms').toList())","execution_count":167,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"9acb423e-6a52-4515-8d59-5ee775e2c577","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{},"cell_type":"markdown","source":"### Linear regression\n\nWe might posture that the more bedrooms in a house, the higher the price. If the relationship is linear, linear regression will give us the line of best fit according to this assumption. Ordinary least squares finds such a line by minimising residual errors. Let's use that algorithm from the Smile library: "},{"metadata":{"trusted":true},"cell_type":"code","source":"cols = ['bedrooms', 'price']\npriceModel = new OLS(records.select(*cols).smile().numericDataset('price'))","execution_count":168,"outputs":[{"data":{"text/plain":"Linear Model:\n\nResiduals:\n\t       Min\t        1Q\t    Median\t        3Q\t       Max\n\t-993338.9867\t-203008.4337\t-65410.8645\t105991.5663\t6824398.8589\n\nCoefficients:\n            Estimate        Std. Error        t value        Pr(>|t|)\nIntercept110315.7263         9108.4603        12.1113          0.0000 ***\nbedrooms\t 127547.5691         2610.1285        48.8664          0.0000 ***\n---------------------------------------------------------------------\nSignificance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 348398.6399 on 21610 degrees of freedom\nMultiple R-squared: 0.0995,    Adjusted R-squared: 0.0995\nF-statistic: 2387.9246 on 1 and 21610 DF,  p-value: 0.000\n"},"execution_count":168,"metadata":{},"output_type":"execute_result"}]},{"metadata":{},"cell_type":"markdown","source":"We want the R-squared value as close to 1 as possible. The value below 0.1 indicates that bedrooms aren't a good indicator of price. We can explore bathrooms instead."},{"metadata":{"trusted":true},"cell_type":"code","source":"cols = ['bathrooms', 'price']\npriceModel = new OLS(records.select(*cols).smile().numericDataset('price'))","execution_count":169,"outputs":[{"data":{"text/plain":"Linear Model:\n\nResiduals:\n\t       Min\t        1Q\t    Median\t        3Q\t       Max\n\t-1438177.6355\t-184517.8475\t-41517.8475\t113231.1207\t5925318.2373\n\nCoefficients:\n            Estimate        Std. Error        t value        Pr(>|t|)\nIntercept 10687.9535         6210.8477         1.7209          0.0853 .\nbathrooms\t 250331.9576         2759.5824        90.7137          0.0000 ***\n---------------------------------------------------------------------\nSignificance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 312443.3785 on 21610 degrees of freedom\nMultiple R-squared: 0.2758,    Adjusted R-squared: 0.2757\nF-statistic: 8228.9774 on 1 and 21610 DF,  p-value: 0.000\n"},"execution_count":169,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"p0 = priceModel.predict([0] as double[])","execution_count":170,"outputs":[{"data":{"text/plain":"10687.953547666326"},"execution_count":170,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"pMax = priceModel.predict([maxBathrooms] as double[])","execution_count":171,"outputs":[{"data":{"text/plain":"2013343.6143124562"},"execution_count":171,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"scrolled":true,"trusted":true},"cell_type":"code","source":"plot = new Plot(title: 'Price x Bathrooms', xLabel: 'Bathrooms', yLabel: 'Price')\nplot << new Points(x: records.column('bathrooms').toList(),\n                   y: records.column('price').toList())\nplot << new Line(x: [0, maxBathrooms], y: [p0, pMax])","execution_count":172,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"51476a8b-8ae3-43be-b2b3-8ac21934fe17","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{},"cell_type":"markdown","source":"Bathrooms are a better indicator than bedrooms but still not a great indicator. Let's try the size of the living area."},{"metadata":{"trusted":true},"cell_type":"code","source":"maxSqftLiving = records.column('sqft_living').toList().max()\ncols = ['sqft_living', 'price']\npriceModel = new OLS(records.select(*cols).smile().numericDataset('price'))","execution_count":173,"outputs":[{"data":{"text/plain":"Linear Model:\n\nResiduals:\n\t       Min\t        1Q\t    Median\t        3Q\t       Max\n\t-1476117.9717\t-147471.4363\t-24025.9920\t106199.1818\t4362019.7516\n\nCoefficients:\n            Estimate        Std. Error        t value        Pr(>|t|)\nIntercept-43603.3525         4402.7891        -9.9036          0.0000 ***\nsqft_living\t    280.6293            1.9364       144.9217          0.0000 ***\n---------------------------------------------------------------------\nSignificance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 261454.2971 on 21610 degrees of freedom\nMultiple R-squared: 0.4929,    Adjusted R-squared: 0.4928\nF-statistic: 21002.3042 on 1 and 21610 DF,  p-value: 0.000\n"},"execution_count":173,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"p0 = priceModel.predict([0] as double[])\npMax = priceModel.predict([maxSqftLiving] as double[])\nplot = new Plot(title: 'Price x Sqft living', xLabel: 'Sqft living', yLabel: 'Price')\nplot << new Points(x: records.column('sqft_living').toList(),\n                   y: records.column('price').toList())\nplot << new Line(x: [0, maxSqftLiving], y: [p0, pMax])","execution_count":174,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"20a1997d-7b24-4a96-a2ae-cbf3e498b8a1","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{},"cell_type":"markdown","source":"This is much better but we can improve by considering multiple features."},{"metadata":{"trusted":true},"cell_type":"code","source":"cols = ['sqft_living', 'bathrooms', 'grade', 'view', 'bedrooms',\n        'sqft_above', 'yr_renovated', 'waterfront']\npriceModel = new OLS(records.select(*cols + ['price']).smile().numericDataset('price'))","execution_count":175,"outputs":[{"data":{"text/plain":"Linear Model:\n\nResiduals:\n\t       Min\t        1Q\t    Median\t        3Q\t       Max\n\t-1283979.2993\t-123010.7210\t-17886.3115\t95539.4141\t4574026.4384\n\nCoefficients:\n            Estimate        Std. Error        t value        Pr(>|t|)\nIntercept-502995.0577        14353.5849       -35.0432          0.0000 ***\nsqft_living\t    230.3461            4.4683        51.5509          0.0000 ***\nbathrooms\t -22264.7003         3261.1853        -6.8272          0.0000 ***\ngrade\t 101718.4289         2259.2452        45.0232          0.0000 ***\nview\t  59202.3221         2398.4641        24.6834          0.0000 ***\nbedrooms\t -31514.1662         2243.0584       -14.0496          0.0000 ***\nsqft_above\t    -47.6528            4.2197       -11.2929          0.0000 ***\nyr_renovated\t     64.8914            3.9634        16.3725          0.0000 ***\nwaterfront\t 566677.4369        19979.5034        28.3629          0.0000 ***\n---------------------------------------------------------------------\nSignificance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 231841.5286 on 21603 degrees of freedom\nMultiple R-squared: 0.6014,    Adjusted R-squared: 0.6012\nF-statistic: 4073.7645 on 8 and 21603 DF,  p-value: 0.000\n"},"execution_count":175,"metadata":{},"output_type":"execute_result"}]},{"metadata":{},"cell_type":"markdown","source":"We have an even better R-squared value but visualizing it is a little more difficult. This time we will plot the actual values vs the values predicted by our regression model. "},{"metadata":{"trusted":true},"cell_type":"code","source":"plot = new Plot(title: 'Actual vs predicted price', xLabel: 'Actual', yLabel: 'Predicted')\npredictions = cols.collect{ records.column(it).toList() }.transpose().collect{ priceModel.predict(it as double[]) }\nactuals = records.column('price').toList()\nplot << new Points(x: actuals, y: predictions)\nto = [actuals.max(), predictions.max()].min()\nfrom = [actuals.min(), predictions.min()].min()\nplot << new Line(x: [from, to], y: [from, to])","execution_count":177,"outputs":[{"output_type":"display_data","data":{"method":"display_data","application/vnd.jupyter.widget-view+json":{"version_minor":0,"model_id":"25a97fc3-2905-4d06-85dc-85e69b6984e6","version_major":2}},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]}],"metadata":{"anaconda-cloud":{},"kernelspec":{"name":"groovy","display_name":"Groovy","language":"groovy"},"language_info":{"nbconverter_exporter":"","codemirror_mode":"groovy","name":"Groovy","mimetype":"","file_extension":".groovy","version":"2.5.6"},"toc":{"nav_menu":{},"number_sections":false,"sideBar":false,"skip_h1_title":false,"base_numbering":1,"title_cell":"Table of Contents","title_sidebar":"Contents","toc_cell":false,"toc_position":{},"toc_section_display":false,"toc_window_display":false}},"nbformat":4,"nbformat_minor":2}