danielparton/md_sample.md

## md_sample.md

      
    Raw
  

              md_sample.md
            
          
    Linear Regression - lecture series II-IV

Model representation

Notation (used throughout course)

 
m = number of training samples
x = “input” variable / feature
y = “output” variable / “target” variable
(x,y) = one training example
$(x^{(i)}, y^{(i)})$ = ith training example
h = hypothesis*, aka model
θ = model parameters
 

Andrew Ng suggests the word “hypothesis” may not be completely appropriate, but it is widely used in the ml community

                    Training set
                         ||
                         \/
                  Learning algorithm
                         ||
                         \/
Size of house (x) => Hypothesis (h) => Estimated price (y)

In other words, h maps from x to y
 
Univariate linear regression

Model

 
$h_\theta(x) = \theta_0 + \theta_1x$
 
Cost function

$J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^m \left ( h_\theta (x^{(i)}) - y^{(i)}\right )^2$
$J(\theta_0,\theta_1)$ is the cost function
The aim is to vary $\theta_0$ and $\theta_1$ so as to minimize the cost function, i.e.:
$\arg \min_{\theta_0, \theta_1} = \frac{1}{2m} \sum_{i=1}^m \left ( h_\theta(x^{(i)}) - y^{(i)} \right )^2$
 
The particular cost function shown above is sometimes known as the squared error function - probably the most commonly used for linear regression problems.
It is a convex function.
Gradient descent

Algorithm

For a univariate linear model:
 
$$\textrm{repeat until convergence {} \
\theta_j := \theta_j - \alpha \frac{\delta}{\delta \theta_j} J(\theta_0, \theta_1) \
\textrm{(for } j = 1 \textrm{ and } j = 0 \textrm{)} \
\textrm{}}$$
:= means assign (similar to setting a variable in Python or C)
α = learning rate - a weight factor which determines the step distance
Note that the step distance will decrease as the function approaches a minimum, since the gradient decreases in magnitude
Also note that the step distances in $\theta_0$ and $\theta_1$ should be calculated independently before either parameter is updated.
Caveat about this method: gradient descent is susceptible to local optima
Batch gradient descent - when all training data is used in the calculation of J at each step. Other forms of gradient descent can be used when $m$ is particularly large or small. For example, stochastic (“online”) gradient descent samples a subset of training data at each step. Pure SGD uses only a single (randomly-chosen) training sample. Often it is better to use a small sample - mini-batch gradient descent (often considered a type of SGD). Andrew Ng says he often uses a sample size of 10.
SGD plus backprop is the de facto standard for training artificial neural networks.
Analytic partial derivatives of J for a univariate linear model

$$\frac{\delta}{\delta\theta_j}J(\theta_0,\theta_1) = \frac{\delta}{\delta\theta_j} \frac{1}{2m} \sum_{i=1}^m \left ( \theta_0 + \theta_1 x^{(i)} - y^{(i)} \right )^2 \\
j = 0: \frac{\delta}{\delta_0}J(\theta_0,\theta_1) = \frac{1}{m} \sum_{i=1}^m \left ( \theta_0 + \theta_1 x^{(i)} - y^{(i)} \right ) \\
j = 1: \frac{\delta}{\delta_0}J(\theta_0,\theta_1) = \frac{1}{m} \sum_{i=1}^m \left ( \theta_0 + \theta_1 x^{(i)} - y^{(i)} \right )x^{(i)}$$
(derive using multivariate calculus)
 
Of course, the partial derivatives of more complex cost functions may be more difficult to derive.
Multivariate linear regression

Multiple features x
Model

hθ(x) = θ0x0 + θ1x1 + θ2x2 + … + θnxn
$h_\theta(x) = \theta_0x_0 + \theta_1x_1 + \ldots + \theta_nx_n$
Using vector notation:
$$
h_\theta(x) = \widehat \theta ^\intercal \widehat{x} \\
h_\theta(x) = \left [ \begin{array}{cccc}
\theta_0 & \theta_1 & \ldots & \theta_n
\end{array} \right ]
\left [ \begin{array}{c}
x_0 \\
x_1 \\
\vdots \\
x_2 \\
\end{array} \right ]
$$
$h_\theta(x)$ = model output (scalar)
n = number of features
$\widehat x$ = feature vector of length n+1 (includes $x_0$ term, which is set to 1)
$\widehat{θ}$ = parameter vector of length n+1
$x^{(i)}_j$ = jth feature of ith training example
 
Note: in some cases it may be useful to combine features, e.g. combine house frontage and depth into land area.
Gradient descent with sum-of-squares cost function

Analytic partial derivatives of J

$$
\frac{\delta}{\delta\theta_j}J(\theta) = \frac{1}{m} \sum_{i=1}^m \left ( h_\theta(x^{(i)}) - y^{(i)} \right )x_j^{(i)}
$$
(Note: $x^{(i)}_0 = 1$)
 
Polynomial regression

e.g. cubic:
$\theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3$
 
Could also use square root, inverse, or other functions.
 
Note: feature scaling becomes even more important here
 
Choice of regression input parameters

Feature scaling

Generally it is best to normalize features so they are centered on 0 and range from -1 to 1 (approximately). This helps GD to work efficiently.
Typically:
$$
x_i := \frac{x_i - \mu_i}{\textrm{range}_i} \textrm{ or } \frac{x_i - \mu_i}{\sigma_i}
$$
Learning rate

Plot J(θ) vs no. of iterations, using multiple choices of α.
J(θ) should decrease monotonically.
Often good to separate α choices by factors of ~3, e.g. 0.001, 0.003, 0.01, 0.03, 0.1, 0.3
If J(θ) is decreasing very slowly, then increase α. Or decrease if it has become unstable.
[Note: can use “learning rate decay”, where the learning rate decreases if there has been little change between epochs]
Number of iterations

Can be very hard to estimate a priori, but can often make a guess based on above type of plot. May be 30, 3000, 3e6.
Convergence

Automatic convergence tests are possible, but can be difficult to choose an appropriate threshold. Often easier to just check visually if J(θ) has become flat (advantage here is that a plot would show something of the GD context, rather than just saying whether a threshold has been reached).
 
Normal equation

 
Solves $\arg \min_θ J(θ)$ analytically.
 
$$
\theta = (X^\intercal X)^{-1} X^\intercal y
$$
 
possible, sort-of derivation:
$y = X\theta$
$X^{\intercal}y = X^{\intercal}X\theta$
Note: no need for feature scaling
 
Gradient descent vs normal equation
 
Normal equation may be better/quicker if n is 1000 or less. 10,000 may be too much though. This is due to the required pseudoinverse calculation (which typically scales with O(n3)).
 
Gradient descent is better suited for big data, e.g. if n is massive (millions).
 
In some cases, (XTX)-1 may be degenerate (non-invertible). Should be quite rare.
May be caused by:

redundant features - too many features, e.g. n is much larger than m (deleting features may help, as may using regularization) 
 
Advanced optimization algorithms
 
May be used instead of gradient descent
 
Conjugate gradient
BFGS (Broyden-Fletcher-Goldfarb-Shannon)
L-BFGS (limited memory BFGS) 
 
Advantages:
No need to manually pick a learning rate (α)

Use a line search algorithm (cf. grid search) to pick an appropriate α value


Often faster than GD 
Disadvantages:
Debugging may be more difficult
Different libraries will have different implementations and performance 
Octave/matlab function to use is fminunc
 
Side note
Use of matrix multiplication to define the outputs of a set of proposed model parameters.
 
H = X θ
 
X is the m x (n+1) matrix of training input data, aka "design matrix"
θ is the (n+1) x a matrix of proposed model parameters, where a is the number of proposed models
H is the m x a matrix of model predictions, where each column vector comprises the predicted outputs of a single proposed model