The Gradient_descent method iterates
xnew = xold - rate(t) * grad(xold)
GD is a workhorse in machine learning, because it's so simple,
uses gradients only (not function values), and can do very big x
.
rate(t)
is a step-size or "learning rate" (aka η, Greek eta).
It can be a constant, or decreasing, taking smaller steps as you move along.
One can tune a rate function to work sometimes, but everybody knows:
- rate too small: many tiny steps, long run times: try doubling it
- rate too big,
x
oscillates: try halving it.
gd_cross0.py
below improves on standard GD by:
- see where grad components cross 0,
g < 0 < gprev
org > 0 > gprev
- draw a line though the 2 points
(x, g) -- (xprev, gprev)
- set
xnew
= where the line == 0, and re-evaluategradfunc( xnew )
.
The method of False position
is very similar to the above line-through-2-points aka
Secant method,
but theoretically better:
once a zero is bracketed in an interval a < x < b
, it stays in there.
Does it matter in practice, with noisy gradients ?
Todo: compare the two; test cases welcome.
gd_cross0.py
a bare-bones implementation of GD with 0-crossing and Nesterov momentum
def gd_cross0( gradfunc, x0, params, ratefunc, callback=None, verbose=1 )
lingrad.py
linear gradient Ax - b, A = diag( 1/cond .. 1 ) -- lots of theory
*= (1 + ripple * sin( pi x ))
params.py
class Params
ratefunc.py
rate0 / t^rate_power
run-gd
shell script: for dim rate ... run test-gd.py
test-gd.py
run this rate0= momentum= ... in ipython
zutil.py
(If you download this lot, mv 1-gd_cross0.py gd_cross0.py
;
gist.github shows files in alphabetical order, so 0-* 1-*
first.)
Numerical Recipes p. 449 ff. .
lightning AdaGrad
-- fast cython SGD (not plain GD) for big, sparse classification
Open loop
cheers
-- denis
Last change: 20 June 2017