Skip to content

Instantly share code, notes, and snippets.

@danielparton
Created June 21, 2017 18:01
Show Gist options
  • Save danielparton/e73096780db79f982248a6b657ff2f75 to your computer and use it in GitHub Desktop.
Save danielparton/e73096780db79f982248a6b657ff2f75 to your computer and use it in GitHub Desktop.

Linear Regression - lecture series II-IV

Model representation

Notation (used throughout course)

  m = number of training samples x = “input” variable / feature y = “output” variable / “target” variable (x,y) = one training example $(x^{(i)}, y^{(i)})$ = ith training example h = hypothesis*, aka model θ = model parameters  

  • Andrew Ng suggests the word “hypothesis” may not be completely appropriate, but it is widely used in the ml community
                    Training set
                         ||
                         \/
                  Learning algorithm
                         ||
                         \/
Size of house (x) => Hypothesis (h) => Estimated price (y)

In other words, h maps from x to y  

Univariate linear regression

Model

  $h_\theta(x) = \theta_0 + \theta_1x$  

Cost function

$J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^m \left ( h_\theta (x^{(i)}) - y^{(i)}\right )^2$

$J(\theta_0,\theta_1)$ is the cost function

The aim is to vary $\theta_0$ and $\theta_1$ so as to minimize the cost function, i.e.:

$\arg \min_{\theta_0, \theta_1} = \frac{1}{2m} \sum_{i=1}^m \left ( h_\theta(x^{(i)}) - y^{(i)} \right )^2$   The particular cost function shown above is sometimes known as the squared error function - probably the most commonly used for linear regression problems. It is a convex function.

Gradient descent

Algorithm

For a univariate linear model:   $$\textrm{repeat until convergence {} \ \theta_j := \theta_j - \alpha \frac{\delta}{\delta \theta_j} J(\theta_0, \theta_1) \ \textrm{(for } j = 1 \textrm{ and } j = 0 \textrm{)} \ \textrm{}}$$

:= means assign (similar to setting a variable in Python or C) α = learning rate - a weight factor which determines the step distance

Note that the step distance will decrease as the function approaches a minimum, since the gradient decreases in magnitude

Also note that the step distances in $\theta_0$ and $\theta_1$ should be calculated independently before either parameter is updated.

Caveat about this method: gradient descent is susceptible to local optima

Batch gradient descent - when all training data is used in the calculation of J at each step. Other forms of gradient descent can be used when $m$ is particularly large or small. For example, stochastic (“online”) gradient descent samples a subset of training data at each step. Pure SGD uses only a single (randomly-chosen) training sample. Often it is better to use a small sample - mini-batch gradient descent (often considered a type of SGD). Andrew Ng says he often uses a sample size of 10.

SGD plus backprop is the de facto standard for training artificial neural networks.

Analytic partial derivatives of J for a univariate linear model

$$\frac{\delta}{\delta\theta_j}J(\theta_0,\theta_1) = \frac{\delta}{\delta\theta_j} \frac{1}{2m} \sum_{i=1}^m \left ( \theta_0 + \theta_1 x^{(i)} - y^{(i)} \right )^2 \\ j = 0: \frac{\delta}{\delta_0}J(\theta_0,\theta_1) = \frac{1}{m} \sum_{i=1}^m \left ( \theta_0 + \theta_1 x^{(i)} - y^{(i)} \right ) \\ j = 1: \frac{\delta}{\delta_0}J(\theta_0,\theta_1) = \frac{1}{m} \sum_{i=1}^m \left ( \theta_0 + \theta_1 x^{(i)} - y^{(i)} \right )x^{(i)}$$

(derive using multivariate calculus)   Of course, the partial derivatives of more complex cost functions may be more difficult to derive.

Multivariate linear regression

Multiple features x

Model

hθ(x) = θ0x0 + θ1x1 + θ2x2 + … + θnxn

$h_\theta(x) = \theta_0x_0 + \theta_1x_1 + \ldots + \theta_nx_n$

Using vector notation:

$$ h_\theta(x) = \widehat \theta ^\intercal \widehat{x} \\ h_\theta(x) = \left [ \begin{array}{cccc} \theta_0 & \theta_1 & \ldots & \theta_n \end{array} \right ] \left [ \begin{array}{c} x_0 \\ x_1 \\ \vdots \\ x_2 \\ \end{array} \right ] $$

$h_\theta(x)$ = model output (scalar) n = number of features $\widehat x$ = feature vector of length n+1 (includes $x_0$ term, which is set to 1) $\widehat{θ}$ = parameter vector of length n+1 $x^{(i)}_j$ = jth feature of ith training example   Note: in some cases it may be useful to combine features, e.g. combine house frontage and depth into land area.

Gradient descent with sum-of-squares cost function

Analytic partial derivatives of J

$$ \frac{\delta}{\delta\theta_j}J(\theta) = \frac{1}{m} \sum_{i=1}^m \left ( h_\theta(x^{(i)}) - y^{(i)} \right )x_j^{(i)} $$

(Note: $x^{(i)}_0 = 1$)  

Polynomial regression

e.g. cubic:

$\theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3$   Could also use square root, inverse, or other functions.   Note: feature scaling becomes even more important here  

Choice of regression input parameters

Feature scaling

Generally it is best to normalize features so they are centered on 0 and range from -1 to 1 (approximately). This helps GD to work efficiently. Typically:

$$ x_i := \frac{x_i - \mu_i}{\textrm{range}_i} \textrm{ or } \frac{x_i - \mu_i}{\sigma_i} $$

Learning rate

Plot J(θ) vs no. of iterations, using multiple choices of α. J(θ) should decrease monotonically. Often good to separate α choices by factors of ~3, e.g. 0.001, 0.003, 0.01, 0.03, 0.1, 0.3

If J(θ) is decreasing very slowly, then increase α. Or decrease if it has become unstable.

[Note: can use “learning rate decay”, where the learning rate decreases if there has been little change between epochs]

Number of iterations

Can be very hard to estimate a priori, but can often make a guess based on above type of plot. May be 30, 3000, 3e6.

Convergence

Automatic convergence tests are possible, but can be difficult to choose an appropriate threshold. Often easier to just check visually if J(θ) has become flat (advantage here is that a plot would show something of the GD context, rather than just saying whether a threshold has been reached).  

Normal equation

  Solves $\arg \min_θ J(θ)$ analytically.   $$ \theta = (X^\intercal X)^{-1} X^\intercal y $$   possible, sort-of derivation:

$y = X\theta$

$X^{\intercal}y = X^{\intercal}X\theta$

Note: no need for feature scaling   Gradient descent vs normal equation   Normal equation may be better/quicker if n is 1000 or less. 10,000 may be too much though. This is due to the required pseudoinverse calculation (which typically scales with O(n3)).   Gradient descent is better suited for big data, e.g. if n is massive (millions).   In some cases, (XTX)-1 may be degenerate (non-invertible). Should be quite rare. May be caused by:

  • redundant features
- too many features, e.g. n is much larger than m (deleting features may help, as may using regularization)
   Advanced optimization algorithms   May be used instead of gradient descent  
  • Conjugate gradient
  • BFGS (Broyden-Fletcher-Goldfarb-Shannon)
  • L-BFGS (limited memory BFGS)
   Advantages:
  • No need to manually pick a learning rate (α)
    • Use a line search algorithm (cf. grid search) to pick an appropriate α value
  • Often faster than GD
 Disadvantages:
  • Debugging may be more difficult
  • Different libraries will have different implementations and performance
 Octave/matlab function to use is fminunc   Side note Use of matrix multiplication to define the outputs of a set of proposed model parameters.   H = X θ   X is the m x (n+1) matrix of training input data, aka "design matrix" θ is the (n+1) x a matrix of proposed model parameters, where a is the number of proposed models H is the m x a matrix of model predictions, where each column vector comprises the predicted outputs of a single proposed model
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment