mdfarragher/blog.md Secret

## blog.md

      
    Raw
  

              blog.md
            
          
    Assignment: Load California housing data

In this assignment you're going to build an app that can load a dataset with the prices of houses in California. The data is not ready for training yet and needs a bit of processing.
The first thing you'll need is a data file with house prices. The data from the 1990 California cencus has exactly what we need. This is a CSV file with 17,000 records that looks like this:


The file contains information on 17k housing blocks all over the state of California:

Column 1: The longitude of the housing block
Column 2: The latitude of the housing block
Column 3: The median age of all the houses in the block
Column 4: The total number of rooms in all houses in the block
Column 5: The total number of bedrooms in all houses in the block
Column 6: The total number of people living in all houses in the block
Column 7: The total number of households in all houses in the block
Column 8: The median income of all people living in all houses in the block
Column 9: The median house value for all houses in the block

We can use this data to train an app to predict the value of any house in and outside the state of California.
Unfortunately we cannot train on this dataset directly. The data needs to be processed first to make it suitable for training. This is what you will do in this assignment.
Let's get started and install the NuGet packages we need:
https://gist.github.com/faf6dac3c232baf895a4502ea7a00db9
Installing package Microsoft.ML..............done!
Successfully added reference to package Microsoft.ML, version 1.3.1
Microsoft.ML is the Microsoft machine learning package. We will use to build all our applications in this course.
Now we're ready to add code. Let's start with a bunch of using statements:
https://gist.github.com/589e20835e18dc6137b0d38ca5c7dd31
Note the XPlot.Plotly. This is the awesome XPlot plotting library that Jupyter loads by default. We'll use it in this assignment to plot the data in our California Housing dataset.
Now we are ready to add classes. We're going to need one class to hold all the information for a single housing block:
https://gist.github.com/a3d11466453532769e3e10c3a5c74e95
The HouseBlockData class holds all the data for one single housing block. Note how each field is tagged with a LoadColumn attribute that will tell the CSV data loading code which column to import data from.
Now we need to load the data in memory:
https://gist.github.com/4c8fbc9689081ffa540c8f40b240baf4
This code calls the LoadFromTextFile method to load the CSV data in memory. Note the HouseBlockData type argument that tells the method which class to use to load the data.
So we have the data in memory as a data view. Now let's convert that to an enumeration of HouseBlockData instances:
https://gist.github.com/94003a728c5e193a6fd6e94dc95120e8
This code calls CreateEnumerable to convert the data view to an enumeration of HouseDataBlock instances.
Now we can plot the median house value by latitude and longitude. Let's see what happens:
https://gist.github.com/db5e2c618edc4674bd8f10680dfc9d89
Yup, that looks like California. Notice the two high-value areas around San Francisco and Los Angeles, and how the house value gradually drops as we move further eastward.
We're now going to search for a linear relationship between the median house value and any of the other input variables.  Let's start by creating a plot of the median house value as a function of median income and see what happens.
If there is a linear relationship between median house value and median income, we expect the plot to show a straight line. So let's check that now:
https://gist.github.com/1612dd467df28924cf8b6cf763bc52b3
As the median income increases, the median house value also increases. There's a big spread in the house values but a vague 'cigar' shape is visible which suggests a linear relationship between these two variables.
But look at the horizontal line at 500,000. What's that all about?
This is what clipping looks like. The creator of this dataset has clipped all housing blocks with a median house value above $500,000 to $500,000. We see this appear in the graph as a horizontal line that disrupts the linear cigar shape.
Let's start by using data scrubbing to get rid of these clipped records:
https://gist.github.com/91ecc33c3c983533a0370115239d0243
The FilterRowsByColumn method will keep only those records with a median house value of 500,000 or less, and remove all other records from the dataset.
Let's check if that worked:
https://gist.github.com/683669c431569b5fc9981f206c2c0f80
Much better! Notice how the horizontal line at $500k is gone now?
Now let's take a closer look at the CSV data:
https://gist.github.com/1c36de103a945bf75a3472589bc1d540
index Longitude Latitude HousingMedianAge TotalRooms TotalBedrooms Population Households MedianIncome MedianHouseValue
0 -114.31 34.19 15 5612 1283 1015 472 1.4936 66900
1 -114.47 34.4 19 7650 1901 1129 463 1.82 80100
2 -114.56 33.69 17 720 174 333 117 1.6509 85700
3 -114.57 33.64 14 1501 337 515 226 3.1917 73400
4 -114.57 33.57 20 1454 326 624 262 1.925 65500
5 -114.58 33.63 29 1387 236 671 239 3.3438 74000
6 -114.58 33.61 25 2907 680 1841 633 2.6768 82400
7 -114.59 34.83 41 812 168 375 158 1.7083 48500
8 -114.59 33.61 34 4789 1175 3134 1056 2.1782 58400
9 -114.6 34.83 46 1497 309 787 271 2.1908 48100
Notice how all the columns are numbers in the range of 0..3000? The median house value column is an outlier because it contains values in a range of 0..500,000.
Remember when we talked about training data science models that we discussed having all data in a similar range?
So let's fix that now by using data scaling. We're going to divide the median house value by 1,000 to bring it down to a range more in line with the other data columns.
Let's add the following class:
https://gist.github.com/5aad08fff3e0f55c70139350d7d7489f
and a bit more code:
https://gist.github.com/ed7cc80dc7bcf89772a335e01c71eef4
Machine learning models in ML.NET are built with pipelines which are sequences of data-loading, transformation, and learning components.
This pipeline has only one component:

CustomMapping which takes the median house values, divides them by 1,000 and stores them in a new column called NormalizedMedianHouseValue. Note that we need the new ToMedianHouseValue class to access this new column in code.

Let's see if the conversion worked. But first we're going to need a quick helper method to print the results of the machine learning pipeline:
https://gist.github.com/d5059f8f4be673e133c8f53bd097052a
This code sets up an output formatter for Jupyter that can display DataDebuggerPreview values which we get from running the machine learning pipeline.
Let's run the pipeline now, grab the first 10 results and display them:
https://gist.github.com/1f94a037764938a94ea969132bcd44b7
index Longitude Latitude HousingMedianAge TotalRooms TotalBedrooms Population Households MedianIncome MedianHouseValue NormalizedMedianHouseValue
0 -114.31 34.19 15 5612 1283 1015 472 1.4936 66900 66.9
1 -114.47 34.4 19 7650 1901 1129 463 1.82 80100 80.1
2 -114.56 33.69 17 720 174 333 117 1.6509 85700 85.7
3 -114.57 33.64 14 1501 337 515 226 3.1917 73400 73.4
4 -114.57 33.57 20 1454 326 624 262 1.925 65500 65.5
5 -114.58 33.63 29 1387 236 671 239 3.3438 74000 74
6 -114.58 33.61 25 2907 680 1841 633 2.6768 82400 82.4
7 -114.59 34.83 41 812 168 375 158 1.7083 48500 48.5
8 -114.59 33.61 34 4789 1175 3134 1056 2.1782 58400 58.4
9 -114.6 34.83 46 1497 309 787 271 2.1908 48100 48.1
The Fit method sets up the pipeline, creates a machine learning model and stores it in the model variable. The Transform method then runs all data through the pipeline and stores the result in transformedData. And finally the Preview method extracts a 10-row preview from the transformed data.
Notice the NormalizedMedianHouseValue column at the end? It contains house values divided by 1,000. The pipeline is working!
Now let's fix the latitude and longitude. We're reading them in directly, but remember that we discussed how Geo data should always be binned, one-hot encoded, and crossed?
Let's do that now. We'll start by adding the following classes:
https://gist.github.com/c273674287bb3379365fe39cf3a9708f
We're going to use these classes in the upcoming code snippets.
Now we will extend the pipeline with extra steps to process the latitude and longitude:
https://gist.github.com/03527d34ed716ab2b37779c741c5ef03
Note how we're extending the data loading pipeline with extra components. The new components are:

A NormalizeBinning component that bins the longitude values into 10 bins
A NormalizeBinning component that bins the latitude values into 10 bins

Let's see if that worked:
https://gist.github.com/d491bdb5e1c0387e21a268e58a8b07dd
index Longitude Latitude HousingMedianAge TotalRooms TotalBedrooms Population Households MedianIncome MedianHouseValue NormalizedMedianHouseValue BinnedLongitude BinnedLatitude
0 -114.31 34.19 15 5612 1283 1015 472 1.4936 66900 66.9 0 0.44444445
1 -114.47 34.4 19 7650 1901 1129 463 1.82 80100 80.1 0 0.5555556
2 -114.56 33.69 17 720 174 333 117 1.6509 85700 85.7 0 0.11111111
3 -114.57 33.64 14 1501 337 515 226 3.1917 73400 73.4 0 0.11111111
4 -114.57 33.57 20 1454 326 624 262 1.925 65500 65.5 0 0
5 -114.58 33.63 29 1387 236 671 239 3.3438 74000 74 0 0.11111111
6 -114.58 33.61 25 2907 680 1841 633 2.6768 82400 82.4 0 0
7 -114.59 34.83 41 812 168 375 158 1.7083 48500 48.5 0 0.5555556
8 -114.59 33.61 34 4789 1175 3134 1056 2.1782 58400 58.4 0 0
9 -114.6 34.83 46 1497 309 787 271 2.1908 48100 48.1 0 0.5555556
Check out the BinnedLongitude and BinnedLatitude columns at the end. Each unique longitude and latitude value has been grouped into a set of 10 bins.
Let's plot the bins to get a feel for what just happened:
https://gist.github.com/d0851dfe795145f8a2736a3cfa25a533
I've added a quick helper class called BinnedHouseBlockData to access the two new binned columns, and the plotting code is exactly the same as before.
Check out the result. The plot again shows median house value by latitude and longitude, but now all locations have been binned into a 10x10 grid of tiles. This helps a machine learning algorithm pick up course-grained location patterns without getting bogged down in details.
Now let's one-hot encode the binned latitude and longitude:
https://gist.github.com/fc00028624cbc2f968cc6fcc7c37ba1c
Note how we're extending the data loading pipeline again. The new components are:

An OneHotEncoding component that one-hot encodes the longitude bins
An OneHotEncoding component that one-hot encodes the latitude bins
A CustomMapping component that crosses the one-hot encoded vectors of the longitude and latitude. ML.NET has no built-in support for crossing one-hot encoded vectors, so we do it manually with a nested for loop and store the result in a new column called Location.
A final DropColumns component to delete all columns from the data view that we don't need anymore.

Let's see if this worked:
https://gist.github.com/d888365573d926ffffba075a89fbbedc
index HousingMedianAge TotalRooms TotalBedrooms Population Households MedianIncome NormalizedMedianHouseValue Location
0 15 5612 1283 1015 472 1.4936 66.9 { Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
1 19 7650 1901 1129 463 1.82 80.1 { Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
2 17 720 174 333 117 1.6509 85.7 { Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
3 14 1501 337 515 226 3.1917 73.4 { Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
4 20 1454 326 624 262 1.925 65.5 { Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
5 29 1387 236 671 239 3.3438 74 { Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
6 25 2907 680 1841 633 2.6768 82.4 { Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
7 41 812 168 375 158 1.7083 48.5 { Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
8 34 4789 1175 3134 1056 2.1782 58.4 { Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
9 46 1497 309 787 271 2.1908 48.1 { Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
Note how we now have an extra column called Location with a 100-element buffer of Single values. This is the result of our feature cross of longiture and latitude. Each vector will contain almost all zeroes with only a single 1.
Let's display the crossed vector to make sure everything is working:
https://gist.github.com/df878ffd34b57a923bfd4b717e0031b0
index 
0 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
1 0100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
2 0010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
3 0010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
4 0001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
5 0010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
6 0001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
7 0100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
8 0001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
9 0100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
That looks great. There's only a single 1 in every row, just as expected.
index	Longitude	Latitude	HousingMedianAge	TotalRooms	TotalBedrooms	Population	Households	MedianIncome	MedianHouseValue
0	-114.31	34.19	15	5612	1283	1015	472	1.4936	66900
1	-114.47	34.4	19	7650	1901	1129	463	1.82	80100
2	-114.56	33.69	17	720	174	333	117	1.6509	85700
3	-114.57	33.64	14	1501	337	515	226	3.1917	73400
4	-114.57	33.57	20	1454	326	624	262	1.925	65500
5	-114.58	33.63	29	1387	236	671	239	3.3438	74000
6	-114.58	33.61	25	2907	680	1841	633	2.6768	82400
7	-114.59	34.83	41	812	168	375	158	1.7083	48500
8	-114.59	33.61	34	4789	1175	3134	1056	2.1782	58400
9	-114.6	34.83	46	1497	309	787	271	2.1908	48100
index
0	1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
1	0100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
2	0010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
3	0010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
4	0001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
5	0010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
6	0001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
7	0100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
8	0001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
9	0100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000