denis-bz/Downhill-mnist.md

## Downhill-mnist.md

      
    Raw
  

              Downhill-mnist.md
            
          
    mnist-downhill.py below runs the downhill
optimizer on the MNIST
test set of handwritten digits.
Downhill in a nutshell:


gradient descent optimizers: SGD, RMSProp, Ada*, with momentum
a thin wrapper for theano
(really thin: 1500 lines, half comments)
well-written, narrative doc
makes it easy to monitor variables, to see what an optimizer is doing.

Log summaries

Downhill's rmsprop and sgd both reach error rates ~ 7 %:
	-- rmsprop/rmsprop-rate.002.log
error rate: 7.29 %  valid-loss 2.3     train-loss 0.286   epoch 3  stepq [-71 -31   0  17  48] 
error rate: 6.93 %  valid-loss 0.259   train-loss 0.265   epoch 6  stepq [-46 -21   0  12  33] 
...
error rate: 6.75 %  valid-loss 0.249   train-loss 0.241   epoch 21  stepq [-27 -14  -1   7  17] 


-- sgd/sgd-rate1.log
error rate: 8.40 %  valid-loss 2.3     train-loss 0.334   epoch 3  stepq [-31 -11   0  13  29] 
error rate: 7.99 %  valid-loss 0.297   train-loss 0.302   epoch 6  stepq [-16  -7   0   7  17] 
...
error rate: 6.94 %  valid-loss 0.254   train-loss 0.254   epoch 42  stepq [-6 -2  0  2  6] 

Rmsprop is faster, and takes bigger steps, to different W and b.
(Rmsprop implementations differ; downhill follows
Graves 2013, "Generating Sequences With Recurrent Neural Networks"
.)
Obviously, tuning hyperparameters will change these results.
And one should, of course, cross-validate.
Fwiw, scikit-learn KNeighborsClassifier gets to 2.5 % errors in 11 seconds,
with no tuning other than n_neighbors=3 .
More methods than test cases ?

Say we have 5 different methods for handling a disease, and test them on N patients.
The outcome is an N x 5 matrix of scores, all in the range say 80 % to 100 %.
How big should N, the number of test cases, be to evaluate which method is best ?
I'd say at least 5; 10 would be better -- cf. Wendel's theorem.
SGD, RMSProp and Ada* are 5+ methods, but MNIST is only one (1) test case.
(Evaluating "which method is best" is a problem in at least 3 dimensions:
not just "score", but cost, understandable code, understandable model,
what's "in", my value added ...)
Comments are welcome, test cases most welcome.

cheers

-- denis
Last change: 19 Nov 2016

  
## mnist-downhill.py
""" downhill: classify mnist digits with logistic regression
    see http://downhill.readthedocs.io -- 10 pages, well written
    optimizers: SGD RMSProp Ada* with momentum
"""
    # see also: theanets.readthedocs.io
    # https://gist.github.com/denis-bz Downhill-mnist.md

from __future__ import division
import sys
import climate
import logging
import numpy as np
import theano
import theano.tensor as TT
import downhill  # $downhill/base.py $downhill/adaptive.py
from skdata_mnist import load_mnist

from etc import etcutil as nu

__version__ = "2016-11-19 Nov  denis-bz-py t-online de"

np.set_printoptions( threshold=20, edgeitems=10, linewidth=140,
        formatter = dict( float = lambda x: "%.2g" % x ))  # float arrays %.2g

def val( x ):
    return x.get_value() if hasattr( x, "get_value" ) \
        else x

print "\n", 80 * "-"
print " ".join(sys.argv)

#...............................................................................
algo = 'rmsprop'
rate = 1 if algo == 'sgd'  else .002
momentum = .9
nesterov = False
w2loss = 0  # loss + w2loss (W^2).mean
            # .001 no diff, .01 a bit worse -- loss and score both noisy
patience = 3  # min_improvement in patience * validate_every epochs, else quit
min_improvement = .01
validate_every = 3  # default 10
batch_size = 128
max_updates = None

seed = 1
tag = ".tmp"  # save > tag.npz
save = 0
log = 1

# to change these params, run this.py a=1 b=None 'c = ...'  in sh or ipython
for arg in sys.argv[1:]:
    exec( arg )

np.random.seed( seed )

Params = """
algo    %s
rate    %.3g
momentum %.3g
nesterov %d
w2loss %.3g
patience %d
min_improvement %.3g
validate_every %d
batch_size %d
seed    %d
""" % ( algo, rate, momentum, nesterov, w2loss, patience, min_improvement,
        validate_every, batch_size, seed )
print Params
print "versions: downhill %s  theano %s" % (
        downhill.__version__, theano.__version__ )

#...............................................................................
(xtrain, ytrain), (xvalid, yvalid) = load_mnist( normalize=1 )
train_dataset = downhill.Dataset( [xtrain, ytrain], name="train",
                                    batch_size=batch_size, rng=seed )
valid_dataset = downhill.Dataset( [xvalid, yvalid], name="valid",
                                    batch_size=batch_size, rng=seed )
    # better val, val2 -- error bars

#...............................................................................
    # theano dataflow graphs: inputs x y, state W b -> loss
x = TT.fmatrix('x')  # data, 28x28 pixels 0 .. 1
y = TT.ivector('y')  # labels, ints 0 .. 9
W = theano.shared( np.zeros( (784, 10), dtype=np.float32 ), name='W' )
b = theano.shared( np.zeros( 10, dtype=np.float32 ), name='b' )

xdotW = TT.dot( x, W ) + b
p_y_given_x = TT.nnet.softmax( xdotW )
probf = theano.function( [x], p_y_given_x )
    # from http://deeplearning.net/tutorial/code/logistic_sgd.py
    # def negative_log_likelihood(self, y):
W2loss = w2loss * (W * W).mean()
loss = -TT.mean( TT.log( p_y_given_x )[TT.arange(y.shape[0]), y]) \
    + W2loss

# grad = TT.grad( loss, [x] )  # not [x, y] ?
# gradf = theano.function( [x, y], grad, name='gradf' )

def predict( x ):
    return np.argmax( x.dot( W.get_value() ) + b.get_value(), axis=1 )

def error_percent( x, y ):
    ypred = predict( x )
    return (y != ypred).mean() * 100, ypred

if log:
    climate.log.TTY_Formatter._DATE_FORMAT = ' '  # dates kill diff
    climate.enable_default_logging()

#...............................................................................
opt = downhill.build(
    algo=algo,
    loss=loss,
    params=[W, b],
    inputs=[x, y],
    monitors=[('W2loss', W2loss)],
    monitor_gradients=True,
)

    # monitor, save these --
mon = nu.Bag(  # a dict with mon.key == mon["key"], mon.<tab> in ipython
    tloss = [],
    vloss = [],
    W = [],
    b = [],
    )
minerr = np.inf
wprev = 0
iter = 0
stepq = np.r_[1, 10, 50, 90, 99]
print "step quantiles:", stepq
print ""

#...............................................................................
for tm, vm in opt.iterate( train_dataset, valid_dataset,
        learning_rate=rate,
        momentum=momentum,
        nesterov=nesterov,
        patience=patience,
        min_improvement=min_improvement,
        validate_every=validate_every,
        max_gradient_elem=0,
        max_gradient_norm=0,  # not both
        max_updates=max_updates,
        ):
    iter += 1
    tm = nu.Bag(tm)
    vm = nu.Bag(vm)  # most recent, default validate_every=10
    tloss = tm.loss
    vloss = vm.loss
    Wval = W.get_value()
    wstep = Wval - wprev
    wprev = Wval
    for _k, _v in mon.items():
        _v.append( val( eval( _k )))  # mon.err.append( err ) ...

    if (iter % validate_every) == 0:
        verr, ypredict = error_percent( xvalid, yvalid )  # noisier than loss
        if verr < minerr:
            minerr = verr
            minepoch = iter - 1
            minpredict = ypredict
        p = nu.ints( np.percentile( wstep, q=stepq ) * 100 )
        print "error rate: %.2f %%  valid-loss %-6.3g  train-loss %-6.3g  epoch %d  stepq %s " % (
                verr, vm.loss, tm.loss, iter, p )

for _k, _v in mon.items():
    mon[_k] = np.array( _v )  # lists -> arrays

try:
    from etc import confus
    confus.pconfus( yvalid, minpredict, label=algo )  # print confusion matrix
except ImportError:
    pass

    # rmsprop steps >> sgd steps ?
W = mon.W[minepoch]  # 784, 10
print "best W ", nu.quantiles( W, q=[1, 10, 50, 90, 99] )
print "best b:", mon.b[minepoch]

if save:  # to plot
    out = tag + ".npz"
    print "\nsaving to", out
    mon.Params = Params
    mon.minerr = minerr
    mon.minepoch = minepoch  # 0-origin
    mon.minpredict = minpredict.astype(np.uint8)
    mon.yvalidate = yvalid
    # pdict.pdict( mon )
    nu.mkdirpart( out )
    np.savez( out, **mon )

## run-mnist
#!/bin/bash
# run xx.py [args] > log

trap exit 2 3 15
# pmset -g live | egrep -w sleep  # mac

every=3
nesterov=0  # no diff ?
max=None
patience=3
rate=.002
w2loss=0

[[ $1 ]] &&
	export "$@"	 # -> py

#...............................................................................
for algo in rmsprop sgd  # adadelta adagrad adam  # $downhill/adaptive.py
do

case $algo in
sgd )   rate=1 ;;
* )     rate=$rate
esac

dir=$algo
mkdir -p $dir  2> /dev/null

tag=$dir/$algo-rate$rate
[[ $every != 3 ]]  &&  tag=$tag-every$every
[[ $nesterov == 1 ]]  &&  tag=$tag-nesterov$nesterov
[[ $w2loss != 0 ]]  &&  tag=$tag-w2loss$w2loss
log=$tag.log
mv $log $log-  2> /dev/null

#...............................................................................
py -from -time  mnist-downhill.py  tag=\"$tag\"  algo=\"$algo\" \
	max_updates=$max \
	nesterov=$nesterov \
	patience=$patience \
	rate=$rate \
    save=1 \
    validate_every=$every \
    w2loss=$w2loss \
    "$@" \
    > $log  2>&1

log-sum $log

done

## skdata_mnist.py
'''Load the MNIST digits dataset, skdata.mnist.dataset.MNIST'''
    # from $downhill/examples/mnist-sparse-factorization.py
    # cf. ~/py/bz/etc/mnist.py from sklearn

import numpy as np
import skdata.mnist  # $site/skdata/mnist/dataset.py

def load_mnist( normalize=1 ):
    '''Load the MNIST digits dataset, skdata.mnist.dataset.MNIST'''
    mnist = skdata.mnist.dataset.MNIST()  # ~/.skdata/mnist/*ubyte.gz
    mnist.meta  # trigger download if needed.

    def arr(n, dtype):
        arr = mnist.arrays[n]
        return arr.reshape((len(arr), -1)).astype(dtype)

    train_images = arr('train_images', np.float32) / 256  #bz 128 - 1
    train_labels = arr('train_labels', np.uint8)
    if normalize:
        print "load_mnist: rows /= |row|, cos distance"
        train_images /= np.linalg.norm( train_images, axis=1 )[:,np.newaxis]
    xtrain, ytrain = train_images[:50000], train_labels[:50000, 0]
    xvalid, yvalid = train_images[50000:], train_labels[50000:, 0]
    print "xtrain:", _asum(xtrain)  # 50k, 784
    print "xvalid:", _asum(xvalid)  # 10k, 784
    print "ytrain: %s ... counts %s" % (ytrain[:10], np.bincount(ytrain))
    print "yvalid: %s ... counts %s" % (yvalid[:10], np.bincount(yvalid))

    return (xtrain, ytrain), (xvalid, yvalid)


def _asum( X ):
    """ array summary: "shape type min av max" """
    if not hasattr( X, "dtype" ):
        return str(X)
    return "%s %s  min av max %.3g %.3g %.3g" % (
            X.shape, X.dtype, X.min(), X.mean(), X.max() )
	""" downhill: classify mnist digits with logistic regression
	see http://downhill.readthedocs.io -- 10 pages, well written
	optimizers: SGD RMSProp Ada* with momentum
	"""
	# see also: theanets.readthedocs.io
	# https://gist.github.com/denis-bz Downhill-mnist.md

	from __future__ import division
	import sys
	import climate
	import logging
	import numpy as np
	import theano
	import theano.tensor as TT
	import downhill # $downhill/base.py $downhill/adaptive.py
	from skdata_mnist import load_mnist

	from etc import etcutil as nu

	__version__ = "2016-11-19 Nov denis-bz-py t-online de"

	np.set_printoptions( threshold=20, edgeitems=10, linewidth=140,
	formatter = dict( float = lambda x: "%.2g" % x )) # float arrays %.2g

	def val( x ):
	return x.get_value() if hasattr( x, "get_value" ) \
	else x

	print "\n", 80 * "-"
	print " ".join(sys.argv)

	#...............................................................................
	algo = 'rmsprop'
	rate = 1 if algo == 'sgd' else .002
	momentum = .9
	nesterov = False
	w2loss = 0 # loss + w2loss (W^2).mean
	# .001 no diff, .01 a bit worse -- loss and score both noisy
	patience = 3 # min_improvement in patience * validate_every epochs, else quit
	min_improvement = .01
	validate_every = 3 # default 10
	batch_size = 128
	max_updates = None

	seed = 1
	tag = ".tmp" # save > tag.npz
	save = 0
	log = 1

	# to change these params, run this.py a=1 b=None 'c = ...' in sh or ipython
	for arg in sys.argv[1:]:
	exec( arg )

	np.random.seed( seed )

	Params = """
	algo %s
	rate %.3g
	momentum %.3g
	nesterov %d
	w2loss %.3g
	patience %d
	min_improvement %.3g
	validate_every %d
	batch_size %d
	seed %d
	""" % ( algo, rate, momentum, nesterov, w2loss, patience, min_improvement,
	validate_every, batch_size, seed )
	print Params
	print "versions: downhill %s theano %s" % (
	downhill.__version__, theano.__version__ )

	#...............................................................................
	(xtrain, ytrain), (xvalid, yvalid) = load_mnist( normalize=1 )
	train_dataset = downhill.Dataset( [xtrain, ytrain], name="train",
	batch_size=batch_size, rng=seed )
	valid_dataset = downhill.Dataset( [xvalid, yvalid], name="valid",
	batch_size=batch_size, rng=seed )
	# better val, val2 -- error bars

	#...............................................................................
	# theano dataflow graphs: inputs x y, state W b -> loss
	x = TT.fmatrix('x') # data, 28x28 pixels 0 .. 1
	y = TT.ivector('y') # labels, ints 0 .. 9
	W = theano.shared( np.zeros( (784, 10), dtype=np.float32 ), name='W' )
	b = theano.shared( np.zeros( 10, dtype=np.float32 ), name='b' )

	xdotW = TT.dot( x, W ) + b
	p_y_given_x = TT.nnet.softmax( xdotW )
	probf = theano.function( [x], p_y_given_x )
	# from http://deeplearning.net/tutorial/code/logistic_sgd.py
	# def negative_log_likelihood(self, y):
	W2loss = w2loss * (W * W).mean()
	loss = -TT.mean( TT.log( p_y_given_x )[TT.arange(y.shape[0]), y]) \
	+ W2loss

	# grad = TT.grad( loss, [x] ) # not [x, y] ?
	# gradf = theano.function( [x, y], grad, name='gradf' )

	def predict( x ):
	return np.argmax( x.dot( W.get_value() ) + b.get_value(), axis=1 )

	def error_percent( x, y ):
	ypred = predict( x )
	return (y != ypred).mean() * 100, ypred

	if log:
	climate.log.TTY_Formatter._DATE_FORMAT = ' ' # dates kill diff
	climate.enable_default_logging()

	#...............................................................................
	opt = downhill.build(
	algo=algo,
	loss=loss,
	params=[W, b],
	inputs=[x, y],
	monitors=[('W2loss', W2loss)],
	monitor_gradients=True,
	)

	# monitor, save these --
	mon = nu.Bag( # a dict with mon.key == mon["key"], mon.<tab> in ipython
	tloss = [],
	vloss = [],
	W = [],
	b = [],
	)
	minerr = np.inf
	wprev = 0
	iter = 0
	stepq = np.r_[1, 10, 50, 90, 99]
	print "step quantiles:", stepq
	print ""

	#...............................................................................
	for tm, vm in opt.iterate( train_dataset, valid_dataset,
	learning_rate=rate,
	momentum=momentum,
	nesterov=nesterov,
	patience=patience,
	min_improvement=min_improvement,
	validate_every=validate_every,
	max_gradient_elem=0,
	max_gradient_norm=0, # not both
	max_updates=max_updates,
	):
	iter += 1
	tm = nu.Bag(tm)
	vm = nu.Bag(vm) # most recent, default validate_every=10
	tloss = tm.loss
	vloss = vm.loss
	Wval = W.get_value()
	wstep = Wval - wprev
	wprev = Wval
	for _k, _v in mon.items():
	_v.append( val( eval( _k ))) # mon.err.append( err ) ...

	if (iter % validate_every) == 0:
	verr, ypredict = error_percent( xvalid, yvalid ) # noisier than loss
	if verr < minerr:
	minerr = verr
	minepoch = iter - 1
	minpredict = ypredict
	p = nu.ints( np.percentile( wstep, q=stepq ) * 100 )
	print "error rate: %.2f %% valid-loss %-6.3g train-loss %-6.3g epoch %d stepq %s " % (
	verr, vm.loss, tm.loss, iter, p )

	for _k, _v in mon.items():
	mon[_k] = np.array( _v ) # lists -> arrays

	try:
	from etc import confus
	confus.pconfus( yvalid, minpredict, label=algo ) # print confusion matrix
	except ImportError:
	pass

	# rmsprop steps >> sgd steps ?
	W = mon.W[minepoch] # 784, 10
	print "best W ", nu.quantiles( W, q=[1, 10, 50, 90, 99] )
	print "best b:", mon.b[minepoch]

	if save: # to plot
	out = tag + ".npz"
	print "\nsaving to", out
	mon.Params = Params
	mon.minerr = minerr
	mon.minepoch = minepoch # 0-origin
	mon.minpredict = minpredict.astype(np.uint8)
	mon.yvalidate = yvalid
	# pdict.pdict( mon )
	nu.mkdirpart( out )
	np.savez( out, **mon )
	#!/bin/bash
	# run xx.py [args] > log

	trap exit 2 3 15
	# pmset -g live \| egrep -w sleep # mac

	every=3
	nesterov=0 # no diff ?
	max=None
	patience=3
	rate=.002
	w2loss=0

	[[ $1 ]] &&
	export "$@" # -> py

	#...............................................................................
	for algo in rmsprop sgd # adadelta adagrad adam # $downhill/adaptive.py
	do

	case $algo in
	sgd ) rate=1 ;;
	* ) rate=$rate
	esac

	dir=$algo
	mkdir -p $dir 2> /dev/null

	tag=$dir/$algo-rate$rate
	[[ $every != 3 ]] && tag=$tag-every$every
	[[ $nesterov == 1 ]] && tag=$tag-nesterov$nesterov
	[[ $w2loss != 0 ]] && tag=$tag-w2loss$w2loss
	log=$tag.log
	mv $log $log- 2> /dev/null

	#...............................................................................
	py -from -time mnist-downhill.py tag=\"$tag\" algo=\"$algo\" \
	max_updates=$max \
	nesterov=$nesterov \
	patience=$patience \
	rate=$rate \
	save=1 \
	validate_every=$every \
	w2loss=$w2loss \
	"$@" \
	> $log 2>&1

	log-sum $log

	done
	'''Load the MNIST digits dataset, skdata.mnist.dataset.MNIST'''
	# from $downhill/examples/mnist-sparse-factorization.py
	# cf. ~/py/bz/etc/mnist.py from sklearn

	import numpy as np
	import skdata.mnist # $site/skdata/mnist/dataset.py

	def load_mnist( normalize=1 ):
	'''Load the MNIST digits dataset, skdata.mnist.dataset.MNIST'''
	mnist = skdata.mnist.dataset.MNIST() # ~/.skdata/mnist/*ubyte.gz
	mnist.meta # trigger download if needed.

	def arr(n, dtype):
	arr = mnist.arrays[n]
	return arr.reshape((len(arr), -1)).astype(dtype)

	train_images = arr('train_images', np.float32) / 256 #bz 128 - 1
	train_labels = arr('train_labels', np.uint8)
	if normalize:
	print "load_mnist: rows /= \|row\|, cos distance"
	train_images /= np.linalg.norm( train_images, axis=1 )[:,np.newaxis]
	xtrain, ytrain = train_images[:50000], train_labels[:50000, 0]
	xvalid, yvalid = train_images[50000:], train_labels[50000:, 0]
	print "xtrain:", _asum(xtrain) # 50k, 784
	print "xvalid:", _asum(xvalid) # 10k, 784
	print "ytrain: %s ... counts %s" % (ytrain[:10], np.bincount(ytrain))
	print "yvalid: %s ... counts %s" % (yvalid[:10], np.bincount(yvalid))

	return (xtrain, ytrain), (xvalid, yvalid)


	def _asum( X ):
	""" array summary: "shape type min av max" """
	if not hasattr( X, "dtype" ):
	return str(X)
	return "%s %s min av max %.3g %.3g %.3g" % (
	X.shape, X.dtype, X.min(), X.mean(), X.max() )