Skip to content

Instantly share code, notes, and snippets.

@janderit
Created November 21, 2015 14:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save janderit/8e74ec08c677aacf890b to your computer and use it in GitHub Desktop.
Save janderit/8e74ec08c677aacf890b to your computer and use it in GitHub Desktop.
(*
The goal in this exercise is to create a model to predict
how many bicycles will be used in a day, given the
available information.
Data source: UC Irvine Machine Learning dataset, "bike sharing":
https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
We will first explore the data, using the CSV type provider
and fsharp.Charting to visualize it.
Then we will develop a regression model to predict usage,
using Math.NET and a bit of linear algebra, starting simple,
and progressively increasing power and complexity.
*)
(*
Step 1. Opening the data using the CSV Type Provider
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Type Providers make it extremely easy to access data in
various formats: point to a sample, generate a type based
on the sample, and start exploring using IntelliSense.
*)
// TODO: run the following code, step-by-step
#I @"../packages/"
#r @"FSharp.Data.2.2.5/lib/net40/FSharp.Data.dll"
open FSharp.Data
// we create a type based on sample data
type Dataset = CsvProvider<"day.csv">
type Datapoint = Dataset.Row
// we can now read data...
let dataset = Dataset.Load("day.csv")
let data = dataset.Rows
// ... which is statically typed
let firstObservation = data |> Seq.head
printfn "First date: %s" (firstObservation.Dteday.ToShortDateString())
// we can print out the file headers
match dataset.Headers with
| None -> printfn "No headers found"
| Some(headers) ->
headers
|> Seq.iter (fun header -> printfn "%s" header)
// ... or print the total number of users for 5 first days
data
|> Seq.take 5
|> Seq.iter (fun day -> printfn "Total users: %i" day.Cnt)
// TODO: explore the dataset
// what was the day with the most wind?
data |> Seq.maxBy (fun e -> e.Windspeed )
|> (fun x->x.Dteday.ToLongDateString ())
// what is the average number of riders?
data |> Seq.averageBy(fun e-> e.Cnt|>float )
// what is the average number of riders
// during holidays? On Sundays?
data |> Seq.filter(fun e -> e.Holiday ) |> Seq.averageBy(fun e-> e.Cnt|>float )
data |> Seq.filter(fun e -> e.Weekday=0) |> Seq.averageBy(fun e-> e.Cnt|>float )
(*
Step 2. Visually exploring the data using F# Charting
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
It is usually easier to understand a dataset by looking at
it with charts. Let's use F# Charting to visualize our data.
*)
// TODO: run the following code, step-by-step
#load @"FSharp.Charting.0.90.13/FSharp.Charting.fsx"
open FSharp.Charting
// Creating basic charts with F# Charting
Chart.Column [ 1 .. 10 ]
Chart.Bar [ 1 .. 10 ]
Chart.Line [ 1 .. 10 ]
Chart.Point [ (1.0,1); (2.5,3); (3.9,8)]
Chart.Line [ "Monday", 1; "Tuesday", 2; "Wednesday", 3 ]
Chart.Line [ for x in 1 .. 20 -> x, x * x ]
// Using additional features
Chart.Line [ "Monday", 1; "Tuesday", 2; "Wednesday", 3 ]
|> Chart.WithTitle("Value by day")
// Combining charts
Chart.Combine [
Chart.Line [ 1 .. 10 ]
Chart.Line [ 3 .. 12 ]
]
// TODO: plot the number of users, Cnt, over time.
// Is the curve flat? Is there a trend?
let users =
data |> Seq.map(fun e -> e.Dteday, e.Cnt)
|> Chart.Line
// Scatterplots (Chart.Point) are often helpful to see if
// some features / variables are related.
// TODO: plot windspeed against count, temperature against
// count. Which one seems more informative?
let users1 =
data |> Seq.map(fun e -> e.Windspeed, e.Cnt)
|> Chart.Point
data |> Seq.map(fun e -> e.Weathersit, e.Hum) |> Chart.Point
(*
Step 3. Defining a Regression Model to predict Cnt
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
We want to create a model that, given data for a day, will
return a number, the predicted total number of users for
that day. This is called a Regression Model.
*)
// TODO: run the following code, step-by-step
// Model structure: take a day, predict the number of users
type Model = Datapoint -> float
// Our first model will be a straight line through the data
// The parameter theta is called the 'coefficient'.
let prediction (theta:float) (day:Datapoint) =
theta * (float day.Instant)
// compare the results of 3 models, using 3 different
// values for Theta (1.0, 5.0, 20.0), against the 'real'
// value we are trying to predict.
Chart.Combine [
Chart.Line (data |> Seq.map (fun day -> day.Cnt))
Chart.Line (data |> Seq.map (fun day -> prediction 1.0 day))
Chart.Line (data |> Seq.map (fun day -> prediction 5.0 day))
Chart.Line (data |> Seq.map (fun day -> prediction 20.0 day))
]
(*
Step 4. What is a 'good model'?
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
What is a good model? A model where for each point, the
error (the difference between the correct and the predicted
values) is small.
*)
// TODO: run the following code, step-by-step
// There are many possible ways to measure error; for
// instance, the absolute difference between the 2.
let absError (model:Model) (day:Datapoint) =
abs (model day - (float day.Cnt))
let meanAbsError (model:Model) (data:Datapoint seq) =
data |> Seq.averageBy (absError model)
// TODO: for theta = 0.0, 1.0, .. 20.0, compute the
// model error. What is the best value of Theta?
let thetas=[0..20] |> List.map float
thetas |> List.map(fun e ->e, meanAbsError( prediction e ) data) |> Chart.Line
(*
Step 5. More complex models
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
So far we have used only one coefficient. We can build more
complex models, by adding coefficients, for instance:
Prediction = Theta0 + Theta1 * day.Instant + Theta2 * day.Temp
*)
// TODO: run the following code, step-by-step
let complexPrediction (theta0:float,theta1:float,theta2:float) (day:Datapoint) =
theta0 + theta1 * (float day.Instant) + theta2 * (float day.Temp)
// now we have 3 coefficients, instead of just one
// note that complexModel still has a signature of
// Datapoint -> float: it is still a valid model
let complexModel = complexPrediction (500.0, 7.5, 2500.0)
let complexModelError = meanAbsError complexModel data
Chart.Combine [
Chart.Line (data |> Seq.map (fun day -> day.Cnt))
Chart.Line (data |> Seq.map (fun day -> complexModel day))
]
(*
Step 6. Using Math.NET & Algebra to find the best Theta
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
WARNING: MATH ALERT. The following section contains a bit
of algebra. If you don't follow everything, that's OK: we
are just using algebra to create an algorithm that will
automatically find the best model. In Section 7, we will
use our algorithm, showing a practical usage.
Manually searching for the best value of Theta is painful.
Fortunately, using Linear Algebra, we can find in one shot
the best solution for Theta, using what is called the
'normal form'.
*)
// TODO: run the following code, step-by-step
#r @"MathNet.Numerics.3.8.0/lib/net40/MathNet.Numerics.dll"
#r @"MathNet.Numerics.FSharp.3.8.0/lib/net40/MathNet.Numerics.FSharp.dll"
open MathNet
open MathNet.Numerics.LinearAlgebra
open MathNet.Numerics.LinearAlgebra.Double
type Vec = Vector<float>
type Mat = Matrix<float>
// instead of defining explicitly coefficients
// (theta0,theta1,theta2, ...), we are going to
// transform our model: we use a vector Theta to
// store all these values, and we will transform
// datapoints into another vector, so that our
// prediction becomes simply the product of 2 vectors:
let predict (theta:Vec) (v:Vec) = theta * v
// using Normal Form to estimate a model.
// For more information, see:
// https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)
let estimate (Y:Vec) (X:Mat) =
(X.Transpose() * X).Inverse() * X.Transpose() * Y
// We create a 'Featurizer', a function that will turn
// a Datapoint into a Vector:
type Featurizer = Datapoint -> float list
let predictor (f:Featurizer) (theta:Vec) =
f >> vector >> (*) theta
let evaluate = meanAbsError
// Given a Featurizer (what data we want to use
// from the Datapoints) and data, this will
// return the 'best' coefficients, as well as a
// function that predicts a value, given a Datapoint:
let model (f:Featurizer) (data:Datapoint seq) =
let Yt, Xt =
data
|> Seq.toList
|> List.map (fun obs -> float obs.Cnt, f obs)
|> List.unzip
let theta = estimate (vector Yt) (matrix Xt)
let predict = predictor f theta
theta,predict
(*
Step 7. Using our algorithm to create better models
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Now that we have an algorithm in place, let's use it on our
dataset, to see if we can produce better prediction models!
*)
// illustration: show no constant, add a constant
// we separated the dataset into 2 parts, one for training,
// one for testing the quality of our model.
let train = Dataset.Load("train.csv")
let test = Dataset.Load("test.csv")
// for our first model, we will use a constant (1.0), and
// day.Instant as features.
let featurizer0 (obs:Datapoint) =
[ 1.0;
float obs.Instant; ]
let (theta0,model0) = model featurizer0 train.Rows
Chart.Combine [
Chart.Line [ for obs in data -> float obs.Cnt ]
Chart.Line [ for obs in data -> model0 obs ] ]
// we can now compare how good the model does,
// both on train and test sets. A good model should
// have similar performance on both.
evaluate model0 train.Rows |> printfn "Training set: %.0f"
evaluate model0 test.Rows |> printfn "Testing set: %.0f"
Chart.Point [for obs in data -> float obs.Cnt, model0 obs ]
theta0 |> Seq.iteri (fun i x -> printfn "Coeff. %i: %.1f" i x)
// TODO: add temperature to the features
// is this better? worse?
let freq = 2.0*System.Math.PI / 365.25
let featurizer1 (obs:Datapoint) =
[ 1.0
float obs.Instant
-System.Math.Cos(freq*float obs.Instant)
float obs.Temp
float obs.Temp * float obs.Temp
(if obs.Holiday then 1.0 else 0.0)
float obs.Hum
float obs.Windspeed
]
let (theta1,model1) = model featurizer1 train.Rows
evaluate model1 test.Rows |> printfn "Testing set: %.0f"
evaluate model1 train.Rows |> printfn "Testing set: %.0f"
evaluate model1 data |> printfn "Testing set: %.0f"
Chart.Point [for obs in test.Rows -> float obs.Cnt, model1 obs ]
theta1 |> Seq.iteri (fun i x -> printfn "Coeff. %i: %.1f" i x)
// TODO: visualize the result
// TODO: try out some others, eg windspeed
(*
Step 8. Using discrete features
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
So far we have used numeric values (ex: temperature, wind)
as model input. Let's see if we can use information from
values that are not numbers, such as "is it a holiday"?
*)
let featurizer2 (obs:Datapoint) =
[ 1.0
float obs.Instant
float obs.Temp
(if obs.Holiday then 1.0 else 0.0) ]
let (theta2,model2) = model featurizer2 train.Rows
evaluate model2 test.Rows |> printfn "Testing set: %.0f"
Chart.Point [for obs in data -> float obs.Cnt, model2 obs ]
// Coeff. 3 shows you the impact of holidays:
// on average, we lose 760 users.
theta2 |> Seq.iteri (fun i x -> printfn "Coeff. %i: %.1f" i x)
// TODO: are Mondays bigger days than Saturdays?
(*
Bonus section: even more features!
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
At that point, we pretty much threw everything we had into
our model - is that all we can do? Well, no: we can do
potentially more, by combining features into new ones.
*)
// If you look at the temperature vs. users chart, it
// looks like very low or very high temperatures result in
// lower usage:
data
|> Seq.map (fun day -> day.Temp, day.Cnt)
|> Chart.Point
// instead of modelling usage = theta0 + theta1 x temp,
// you could do instead
// usage = theta0 + theta1 x temp + theta2 x (temp x temp).
// TODO: include square of the temperature in the model
let featurizer3 (obs:Datapoint) =
[ 1.0
float obs.Instant
// use your features
// use temp * temp
]
let (theta3,model3) = model featurizer3 train.Rows
// TODO: Is the quality improving the same way on train and test?
// TODO: play with more features
(*
Bonus section: accelerate computations with MKL
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
One of the beauties of algebra is, a lot of work has gone
into it to make it run fast. Math.NET supports MKL, which
runs algebra on a dedicated piece of hardware on the CPU.
To use it, install the corresponding NuGet package,
configure Math.NET to use MKL, and compare speed:
System.Environment.CurrentDirectory <- __SOURCE_DIRECTORY__
open MathNet.Numerics
open MathNet.Numerics.Providers.LinearAlgebra.Mkl
Control.LinearAlgebraProvider <- MklLinearAlgebraProvider()
*)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment