m = number of training samples
x = “input” variable / feature
y = “output” variable / “target” variable
(x,y) = one training example
- Andrew Ng suggests the word “hypothesis” may not be completely appropriate, but it is widely used in the ml community
Training set
||
\/
Learning algorithm
||
\/
Size of house (x) => Hypothesis (h) => Estimated price (y)
In other words, h maps from x to y
The aim is to vary
For a univariate linear model:
:= means assign (similar to setting a variable in Python or C) α = learning rate - a weight factor which determines the step distance
Note that the step distance will decrease as the function approaches a minimum, since the gradient decreases in magnitude
Also note that the step distances in
Caveat about this method: gradient descent is susceptible to local optima
Batch gradient descent - when all training data is used in the calculation of J at each step. Other forms of gradient descent can be used when
SGD plus backprop is the de facto standard for training artificial neural networks.
(derive using multivariate calculus) Of course, the partial derivatives of more complex cost functions may be more difficult to derive.
Multiple features x
hθ(x) = θ0x0 + θ1x1 + θ2x2 + … + θnxn
Using vector notation:
(Note:
e.g. cubic:
Generally it is best to normalize features so they are centered on 0 and range from -1 to 1 (approximately). This helps GD to work efficiently. Typically:
Plot J(θ) vs no. of iterations, using multiple choices of α. J(θ) should decrease monotonically. Often good to separate α choices by factors of ~3, e.g. 0.001, 0.003, 0.01, 0.03, 0.1, 0.3
If J(θ) is decreasing very slowly, then increase α. Or decrease if it has become unstable.
[Note: can use “learning rate decay”, where the learning rate decreases if there has been little change between epochs]
Can be very hard to estimate a priori, but can often make a guess based on above type of plot. May be 30, 3000, 3e6.
Automatic convergence tests are possible, but can be difficult to choose an appropriate threshold. Often easier to just check visually if J(θ) has become flat (advantage here is that a plot would show something of the GD context, rather than just saying whether a threshold has been reached).
Solves
Note: no need for feature scaling Gradient descent vs normal equation Normal equation may be better/quicker if n is 1000 or less. 10,000 may be too much though. This is due to the required pseudoinverse calculation (which typically scales with O(n3)). Gradient descent is better suited for big data, e.g. if n is massive (millions). In some cases, (XTX)-1 may be degenerate (non-invertible). Should be quite rare. May be caused by:
- redundant features - too many features, e.g. n is much larger than m (deleting features may help, as may using regularization) Advanced optimization algorithms May be used instead of gradient descent
- Conjugate gradient
- BFGS (Broyden-Fletcher-Goldfarb-Shannon)
- L-BFGS (limited memory BFGS) Advantages:
- No need to manually pick a learning rate (α)
- Use a line search algorithm (cf. grid search) to pick an appropriate α value
- Often faster than GD Disadvantages:
- Debugging may be more difficult
- Different libraries will have different implementations and performance Octave/matlab function to use is fminunc Side note Use of matrix multiplication to define the outputs of a set of proposed model parameters. H = X θ X is the m x (n+1) matrix of training input data, aka "design matrix" θ is the (n+1) x a matrix of proposed model parameters, where a is the number of proposed models H is the m x a matrix of model predictions, where each column vector comprises the predicted outputs of a single proposed model