Skip to content

Instantly share code, notes, and snippets.

@mdfarragher
Created November 7, 2019 14:23
Show Gist options
  • Save mdfarragher/c5608619186e15e526cfd4df9a7f5a94 to your computer and use it in GitHub Desktop.
Save mdfarragher/c5608619186e15e526cfd4df9a7f5a94 to your computer and use it in GitHub Desktop.

Assignment: Load California housing data

In this assignment you're going to build an app that can load a dataset with the prices of houses in California. The data is not ready for training yet and needs a bit of processing.

The first thing you'll need is a data file with house prices. The data from the 1990 California cencus has exactly what we need. This is a CSV file with 17,000 records that looks like this:

Data File

The file contains information on 17k housing blocks all over the state of California:

  • Column 1: The longitude of the housing block
  • Column 2: The latitude of the housing block
  • Column 3: The median age of all the houses in the block
  • Column 4: The total number of rooms in all houses in the block
  • Column 5: The total number of bedrooms in all houses in the block
  • Column 6: The total number of people living in all houses in the block
  • Column 7: The total number of households in all houses in the block
  • Column 8: The median income of all people living in all houses in the block
  • Column 9: The median house value for all houses in the block

We can use this data to train an app to predict the value of any house in and outside the state of California.

Unfortunately we cannot train on this dataset directly. The data needs to be processed first to make it suitable for training. This is what you will do in this assignment.

Let's get started and install the NuGet packages we need:

https://gist.github.com/faf6dac3c232baf895a4502ea7a00db9

Installing package Microsoft.ML..............done! Successfully added reference to package Microsoft.ML, version 1.3.1

Microsoft.ML is the Microsoft machine learning package. We will use to build all our applications in this course.

Now we're ready to add code. Let's start with a bunch of using statements:

https://gist.github.com/589e20835e18dc6137b0d38ca5c7dd31

Note the XPlot.Plotly. This is the awesome XPlot plotting library that Jupyter loads by default. We'll use it in this assignment to plot the data in our California Housing dataset.

Now we are ready to add classes. We're going to need one class to hold all the information for a single housing block:

https://gist.github.com/a3d11466453532769e3e10c3a5c74e95

The HouseBlockData class holds all the data for one single housing block. Note how each field is tagged with a LoadColumn attribute that will tell the CSV data loading code which column to import data from.

Now we need to load the data in memory:

https://gist.github.com/4c8fbc9689081ffa540c8f40b240baf4

This code calls the LoadFromTextFile method to load the CSV data in memory. Note the HouseBlockData type argument that tells the method which class to use to load the data.

So we have the data in memory as a data view. Now let's convert that to an enumeration of HouseBlockData instances:

https://gist.github.com/94003a728c5e193a6fd6e94dc95120e8

This code calls CreateEnumerable to convert the data view to an enumeration of HouseDataBlock instances.

Now we can plot the median house value by latitude and longitude. Let's see what happens:

https://gist.github.com/db5e2c618edc4674bd8f10680dfc9d89

Yup, that looks like California. Notice the two high-value areas around San Francisco and Los Angeles, and how the house value gradually drops as we move further eastward.

We're now going to search for a linear relationship between the median house value and any of the other input variables. Let's start by creating a plot of the median house value as a function of median income and see what happens.

If there is a linear relationship between median house value and median income, we expect the plot to show a straight line. So let's check that now:

https://gist.github.com/1612dd467df28924cf8b6cf763bc52b3

As the median income increases, the median house value also increases. There's a big spread in the house values but a vague 'cigar' shape is visible which suggests a linear relationship between these two variables.

But look at the horizontal line at 500,000. What's that all about?

This is what clipping looks like. The creator of this dataset has clipped all housing blocks with a median house value above $500,000 to $500,000. We see this appear in the graph as a horizontal line that disrupts the linear cigar shape.

Let's start by using data scrubbing to get rid of these clipped records:

https://gist.github.com/91ecc33c3c983533a0370115239d0243

The FilterRowsByColumn method will keep only those records with a median house value of 500,000 or less, and remove all other records from the dataset.

Let's check if that worked:

https://gist.github.com/683669c431569b5fc9981f206c2c0f80

Much better! Notice how the horizontal line at $500k is gone now?

Now let's take a closer look at the CSV data:

https://gist.github.com/1c36de103a945bf75a3472589bc1d540

indexLongitudeLatitudeHousingMedianAgeTotalRoomsTotalBedroomsPopulationHouseholdsMedianIncomeMedianHouseValue
0-114.3134.19155612128310154721.493666900
1-114.4734.4197650190111294631.8280100
2-114.5633.69177201743331171.650985700
3-114.5733.641415013375152263.191773400
4-114.5733.572014543266242621.92565500
5-114.5833.632913872366712393.343874000
6-114.5833.6125290768018416332.676882400
7-114.5934.83418121683751581.708348500
8-114.5933.613447891175313410562.178258400
9-114.634.834614973097872712.190848100

Notice how all the columns are numbers in the range of 0..3000? The median house value column is an outlier because it contains values in a range of 0..500,000.

Remember when we talked about training data science models that we discussed having all data in a similar range?

So let's fix that now by using data scaling. We're going to divide the median house value by 1,000 to bring it down to a range more in line with the other data columns.

Let's add the following class:

https://gist.github.com/5aad08fff3e0f55c70139350d7d7489f

and a bit more code:

https://gist.github.com/ed7cc80dc7bcf89772a335e01c71eef4

Machine learning models in ML.NET are built with pipelines which are sequences of data-loading, transformation, and learning components.

This pipeline has only one component:

  • CustomMapping which takes the median house values, divides them by 1,000 and stores them in a new column called NormalizedMedianHouseValue. Note that we need the new ToMedianHouseValue class to access this new column in code.

Let's see if the conversion worked. But first we're going to need a quick helper method to print the results of the machine learning pipeline:

https://gist.github.com/d5059f8f4be673e133c8f53bd097052a

This code sets up an output formatter for Jupyter that can display DataDebuggerPreview values which we get from running the machine learning pipeline.

Let's run the pipeline now, grab the first 10 results and display them:

https://gist.github.com/1f94a037764938a94ea969132bcd44b7

indexLongitudeLatitudeHousingMedianAgeTotalRoomsTotalBedroomsPopulationHouseholdsMedianIncomeMedianHouseValueNormalizedMedianHouseValue
0-114.3134.19155612128310154721.49366690066.9
1-114.4734.4197650190111294631.828010080.1
2-114.5633.69177201743331171.65098570085.7
3-114.5733.641415013375152263.19177340073.4
4-114.5733.572014543266242621.9256550065.5
5-114.5833.632913872366712393.34387400074
6-114.5833.6125290768018416332.67688240082.4
7-114.5934.83418121683751581.70834850048.5
8-114.5933.613447891175313410562.17825840058.4
9-114.634.834614973097872712.19084810048.1

The Fit method sets up the pipeline, creates a machine learning model and stores it in the model variable. The Transform method then runs all data through the pipeline and stores the result in transformedData. And finally the Preview method extracts a 10-row preview from the transformed data.

Notice the NormalizedMedianHouseValue column at the end? It contains house values divided by 1,000. The pipeline is working!

Now let's fix the latitude and longitude. We're reading them in directly, but remember that we discussed how Geo data should always be binned, one-hot encoded, and crossed?

Let's do that now. We'll start by adding the following classes:

https://gist.github.com/c273674287bb3379365fe39cf3a9708f

We're going to use these classes in the upcoming code snippets.

Now we will extend the pipeline with extra steps to process the latitude and longitude:

https://gist.github.com/03527d34ed716ab2b37779c741c5ef03

Note how we're extending the data loading pipeline with extra components. The new components are:

  • A NormalizeBinning component that bins the longitude values into 10 bins
  • A NormalizeBinning component that bins the latitude values into 10 bins

Let's see if that worked:

https://gist.github.com/d491bdb5e1c0387e21a268e58a8b07dd

indexLongitudeLatitudeHousingMedianAgeTotalRoomsTotalBedroomsPopulationHouseholdsMedianIncomeMedianHouseValueNormalizedMedianHouseValueBinnedLongitudeBinnedLatitude
0-114.3134.19155612128310154721.49366690066.900.44444445
1-114.4734.4197650190111294631.828010080.100.5555556
2-114.5633.69177201743331171.65098570085.700.11111111
3-114.5733.641415013375152263.19177340073.400.11111111
4-114.5733.572014543266242621.9256550065.500
5-114.5833.632913872366712393.3438740007400.11111111
6-114.5833.6125290768018416332.67688240082.400
7-114.5934.83418121683751581.70834850048.500.5555556
8-114.5933.613447891175313410562.17825840058.400
9-114.634.834614973097872712.19084810048.100.5555556

Check out the BinnedLongitude and BinnedLatitude columns at the end. Each unique longitude and latitude value has been grouped into a set of 10 bins.

Let's plot the bins to get a feel for what just happened:

https://gist.github.com/d0851dfe795145f8a2736a3cfa25a533

I've added a quick helper class called BinnedHouseBlockData to access the two new binned columns, and the plotting code is exactly the same as before.

Check out the result. The plot again shows median house value by latitude and longitude, but now all locations have been binned into a 10x10 grid of tiles. This helps a machine learning algorithm pick up course-grained location patterns without getting bogged down in details.

Now let's one-hot encode the binned latitude and longitude:

https://gist.github.com/fc00028624cbc2f968cc6fcc7c37ba1c

Note how we're extending the data loading pipeline again. The new components are:

  • An OneHotEncoding component that one-hot encodes the longitude bins
  • An OneHotEncoding component that one-hot encodes the latitude bins
  • A CustomMapping component that crosses the one-hot encoded vectors of the longitude and latitude. ML.NET has no built-in support for crossing one-hot encoded vectors, so we do it manually with a nested for loop and store the result in a new column called Location.
  • A final DropColumns component to delete all columns from the data view that we don't need anymore.

Let's see if this worked:

https://gist.github.com/d888365573d926ffffba075a89fbbedc

indexHousingMedianAgeTotalRoomsTotalBedroomsPopulationHouseholdsMedianIncomeNormalizedMedianHouseValueLocation
0155612128310154721.493666.9{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
1197650190111294631.8280.1{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
2177201743331171.650985.7{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
31415013375152263.191773.4{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
42014543266242621.92565.5{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
52913872366712393.343874{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
625290768018416332.676882.4{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
7418121683751581.708348.5{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
83447891175313410562.178258.4{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }
94614973097872712.190848.1{ Microsoft.ML.Data.VBuffer<System.Single>: IsDense: True, Length: 100 }

Note how we now have an extra column called Location with a 100-element buffer of Single values. This is the result of our feature cross of longiture and latitude. Each vector will contain almost all zeroes with only a single 1.

Let's display the crossed vector to make sure everything is working:

https://gist.github.com/df878ffd34b57a923bfd4b717e0031b0

index
01000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
10100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
20010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
30010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
40001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
50010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
60001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
70100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
80001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
90100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

That looks great. There's only a single 1 in every row, just as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment