Skip to content

Instantly share code, notes, and snippets.

@denis-bz
Created November 18, 2016 14:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save denis-bz/437776030c98eadbd381b11fbb6e30ee to your computer and use it in GitHub Desktop.
Save denis-bz/437776030c98eadbd381b11fbb6e30ee to your computer and use it in GitHub Desktop.
downhill mnist 2016-11-18 Nov

mnist-downhill.py below runs the downhill optimizer on the MNIST test set of handwritten digits.

Downhill in a nutshell:

  • gradient descent optimizers: SGD, RMSProp, Ada*, with momentum
  • a thin wrapper for theano (really thin: 1500 lines, half comments)
  • well-written, narrative doc
  • makes it easy to monitor variables, to see what an optimizer is doing.

Log summaries

Downhill's rmsprop and sgd both reach error rates ~ 7 %:

	-- rmsprop/rmsprop-rate.002.log
error rate: 7.29 %  valid-loss 2.3     train-loss 0.286   epoch 3  stepq [-71 -31   0  17  48] 
error rate: 6.93 %  valid-loss 0.259   train-loss 0.265   epoch 6  stepq [-46 -21   0  12  33] 
...
error rate: 6.75 %  valid-loss 0.249   train-loss 0.241   epoch 21  stepq [-27 -14  -1   7  17] 


-- sgd/sgd-rate1.log
error rate: 8.40 %  valid-loss 2.3     train-loss 0.334   epoch 3  stepq [-31 -11   0  13  29] 
error rate: 7.99 %  valid-loss 0.297   train-loss 0.302   epoch 6  stepq [-16  -7   0   7  17] 
...
error rate: 6.94 %  valid-loss 0.254   train-loss 0.254   epoch 42  stepq [-6 -2  0  2  6] 

Rmsprop is faster, and takes bigger steps, to different W and b. (Rmsprop implementations differ; downhill follows Graves 2013, "Generating Sequences With Recurrent Neural Networks" .)

Obviously, tuning hyperparameters will change these results. And one should, of course, cross-validate.

Fwiw, scikit-learn KNeighborsClassifier gets to 2.5 % errors in 11 seconds, with no tuning other than n_neighbors=3 .

More methods than test cases ?

Say we have 5 different methods for handling a disease, and test them on N patients. The outcome is an N x 5 matrix of scores, all in the range say 80 % to 100 %. How big should N, the number of test cases, be to evaluate which method is best ? I'd say at least 5; 10 would be better -- cf. Wendel's theorem.

SGD, RMSProp and Ada* are 5+ methods, but MNIST is only one (1) test case.

(Evaluating "which method is best" is a problem in at least 3 dimensions: not just "score", but cost, understandable code, understandable model, what's "in", my value added ...)

Comments are welcome, test cases most welcome.

cheers
-- denis

Last change: 19 Nov 2016

""" downhill: classify mnist digits with logistic regression
see http://downhill.readthedocs.io -- 10 pages, well written
optimizers: SGD RMSProp Ada* with momentum
"""
# see also: theanets.readthedocs.io
# https://gist.github.com/denis-bz Downhill-mnist.md
from __future__ import division
import sys
import climate
import logging
import numpy as np
import theano
import theano.tensor as TT
import downhill # $downhill/base.py $downhill/adaptive.py
from skdata_mnist import load_mnist
from etc import etcutil as nu
__version__ = "2016-11-19 Nov denis-bz-py t-online de"
np.set_printoptions( threshold=20, edgeitems=10, linewidth=140,
formatter = dict( float = lambda x: "%.2g" % x )) # float arrays %.2g
def val( x ):
return x.get_value() if hasattr( x, "get_value" ) \
else x
print "\n", 80 * "-"
print " ".join(sys.argv)
#...............................................................................
algo = 'rmsprop'
rate = 1 if algo == 'sgd' else .002
momentum = .9
nesterov = False
w2loss = 0 # loss + w2loss (W^2).mean
# .001 no diff, .01 a bit worse -- loss and score both noisy
patience = 3 # min_improvement in patience * validate_every epochs, else quit
min_improvement = .01
validate_every = 3 # default 10
batch_size = 128
max_updates = None
seed = 1
tag = ".tmp" # save > tag.npz
save = 0
log = 1
# to change these params, run this.py a=1 b=None 'c = ...' in sh or ipython
for arg in sys.argv[1:]:
exec( arg )
np.random.seed( seed )
Params = """
algo %s
rate %.3g
momentum %.3g
nesterov %d
w2loss %.3g
patience %d
min_improvement %.3g
validate_every %d
batch_size %d
seed %d
""" % ( algo, rate, momentum, nesterov, w2loss, patience, min_improvement,
validate_every, batch_size, seed )
print Params
print "versions: downhill %s theano %s" % (
downhill.__version__, theano.__version__ )
#...............................................................................
(xtrain, ytrain), (xvalid, yvalid) = load_mnist( normalize=1 )
train_dataset = downhill.Dataset( [xtrain, ytrain], name="train",
batch_size=batch_size, rng=seed )
valid_dataset = downhill.Dataset( [xvalid, yvalid], name="valid",
batch_size=batch_size, rng=seed )
# better val, val2 -- error bars
#...............................................................................
# theano dataflow graphs: inputs x y, state W b -> loss
x = TT.fmatrix('x') # data, 28x28 pixels 0 .. 1
y = TT.ivector('y') # labels, ints 0 .. 9
W = theano.shared( np.zeros( (784, 10), dtype=np.float32 ), name='W' )
b = theano.shared( np.zeros( 10, dtype=np.float32 ), name='b' )
xdotW = TT.dot( x, W ) + b
p_y_given_x = TT.nnet.softmax( xdotW )
probf = theano.function( [x], p_y_given_x )
# from http://deeplearning.net/tutorial/code/logistic_sgd.py
# def negative_log_likelihood(self, y):
W2loss = w2loss * (W * W).mean()
loss = -TT.mean( TT.log( p_y_given_x )[TT.arange(y.shape[0]), y]) \
+ W2loss
# grad = TT.grad( loss, [x] ) # not [x, y] ?
# gradf = theano.function( [x, y], grad, name='gradf' )
def predict( x ):
return np.argmax( x.dot( W.get_value() ) + b.get_value(), axis=1 )
def error_percent( x, y ):
ypred = predict( x )
return (y != ypred).mean() * 100, ypred
if log:
climate.log.TTY_Formatter._DATE_FORMAT = ' ' # dates kill diff
climate.enable_default_logging()
#...............................................................................
opt = downhill.build(
algo=algo,
loss=loss,
params=[W, b],
inputs=[x, y],
monitors=[('W2loss', W2loss)],
monitor_gradients=True,
)
# monitor, save these --
mon = nu.Bag( # a dict with mon.key == mon["key"], mon.<tab> in ipython
tloss = [],
vloss = [],
W = [],
b = [],
)
minerr = np.inf
wprev = 0
iter = 0
stepq = np.r_[1, 10, 50, 90, 99]
print "step quantiles:", stepq
print ""
#...............................................................................
for tm, vm in opt.iterate( train_dataset, valid_dataset,
learning_rate=rate,
momentum=momentum,
nesterov=nesterov,
patience=patience,
min_improvement=min_improvement,
validate_every=validate_every,
max_gradient_elem=0,
max_gradient_norm=0, # not both
max_updates=max_updates,
):
iter += 1
tm = nu.Bag(tm)
vm = nu.Bag(vm) # most recent, default validate_every=10
tloss = tm.loss
vloss = vm.loss
Wval = W.get_value()
wstep = Wval - wprev
wprev = Wval
for _k, _v in mon.items():
_v.append( val( eval( _k ))) # mon.err.append( err ) ...
if (iter % validate_every) == 0:
verr, ypredict = error_percent( xvalid, yvalid ) # noisier than loss
if verr < minerr:
minerr = verr
minepoch = iter - 1
minpredict = ypredict
p = nu.ints( np.percentile( wstep, q=stepq ) * 100 )
print "error rate: %.2f %% valid-loss %-6.3g train-loss %-6.3g epoch %d stepq %s " % (
verr, vm.loss, tm.loss, iter, p )
for _k, _v in mon.items():
mon[_k] = np.array( _v ) # lists -> arrays
try:
from etc import confus
confus.pconfus( yvalid, minpredict, label=algo ) # print confusion matrix
except ImportError:
pass
# rmsprop steps >> sgd steps ?
W = mon.W[minepoch] # 784, 10
print "best W ", nu.quantiles( W, q=[1, 10, 50, 90, 99] )
print "best b:", mon.b[minepoch]
if save: # to plot
out = tag + ".npz"
print "\nsaving to", out
mon.Params = Params
mon.minerr = minerr
mon.minepoch = minepoch # 0-origin
mon.minpredict = minpredict.astype(np.uint8)
mon.yvalidate = yvalid
# pdict.pdict( mon )
nu.mkdirpart( out )
np.savez( out, **mon )
#!/bin/bash
# run xx.py [args] > log
trap exit 2 3 15
# pmset -g live | egrep -w sleep # mac
every=3
nesterov=0 # no diff ?
max=None
patience=3
rate=.002
w2loss=0
[[ $1 ]] &&
export "$@" # -> py
#...............................................................................
for algo in rmsprop sgd # adadelta adagrad adam # $downhill/adaptive.py
do
case $algo in
sgd ) rate=1 ;;
* ) rate=$rate
esac
dir=$algo
mkdir -p $dir 2> /dev/null
tag=$dir/$algo-rate$rate
[[ $every != 3 ]] && tag=$tag-every$every
[[ $nesterov == 1 ]] && tag=$tag-nesterov$nesterov
[[ $w2loss != 0 ]] && tag=$tag-w2loss$w2loss
log=$tag.log
mv $log $log- 2> /dev/null
#...............................................................................
py -from -time mnist-downhill.py tag=\"$tag\" algo=\"$algo\" \
max_updates=$max \
nesterov=$nesterov \
patience=$patience \
rate=$rate \
save=1 \
validate_every=$every \
w2loss=$w2loss \
"$@" \
> $log 2>&1
log-sum $log
done
'''Load the MNIST digits dataset, skdata.mnist.dataset.MNIST'''
# from $downhill/examples/mnist-sparse-factorization.py
# cf. ~/py/bz/etc/mnist.py from sklearn
import numpy as np
import skdata.mnist # $site/skdata/mnist/dataset.py
def load_mnist( normalize=1 ):
'''Load the MNIST digits dataset, skdata.mnist.dataset.MNIST'''
mnist = skdata.mnist.dataset.MNIST() # ~/.skdata/mnist/*ubyte.gz
mnist.meta # trigger download if needed.
def arr(n, dtype):
arr = mnist.arrays[n]
return arr.reshape((len(arr), -1)).astype(dtype)
train_images = arr('train_images', np.float32) / 256 #bz 128 - 1
train_labels = arr('train_labels', np.uint8)
if normalize:
print "load_mnist: rows /= |row|, cos distance"
train_images /= np.linalg.norm( train_images, axis=1 )[:,np.newaxis]
xtrain, ytrain = train_images[:50000], train_labels[:50000, 0]
xvalid, yvalid = train_images[50000:], train_labels[50000:, 0]
print "xtrain:", _asum(xtrain) # 50k, 784
print "xvalid:", _asum(xvalid) # 10k, 784
print "ytrain: %s ... counts %s" % (ytrain[:10], np.bincount(ytrain))
print "yvalid: %s ... counts %s" % (yvalid[:10], np.bincount(yvalid))
return (xtrain, ytrain), (xvalid, yvalid)
def _asum( X ):
""" array summary: "shape type min av max" """
if not hasattr( X, "dtype" ):
return str(X)
return "%s %s min av max %.3g %.3g %.3g" % (
X.shape, X.dtype, X.min(), X.mean(), X.max() )
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment