Algorithms for classification, in particular binary classification, have two different objectives:
- a smooth, fast, approximate Loss function used in most optimizers
- real loss measured on a test set: expensive to calculate, so usually done only at the end.
How close are the two, smooth approximate loss vs. real rough test error ?
Everyone knows that they differ, but by how much ?
The plots and log summaries below look at this question
for a handful of test cases, run with a simple adagrad.py
based on
lightning AdaGrad
and squared-hinge loss.
Most SGD optimizers run classif.fit( Xtrain, ytrain )
or the like on a training set
for some user-specified number of iterations (aka epochs),
and return final coefficients w_final
.
It's up to the user to evaluate testerr( w_final, Xtest, ytest )
,
and ideally cross-validate in an outer loop.
Here I call testerr()
1000 times per iteration,
to compare minimum error and end error in each iteration:
-- 2newsgroups/2newsgroups-rate1.log
# test error minimum, end
iter 0: testerr 14.5 % at 12527, 15.1 % at 15075 30 % 0 wmin: quantiles [-4.8 -0.09 0 0.099 4.2]
iter 1: testerr 13.6 % at 14261, 14.7 % at 15075 39 % 0 wmin: quantiles [-4.9 -0.11 0 0.11 4.2]
iter 2: testerr 13.6 % at 14261, 14.8 % at 15075 42 % 0 wmin: quantiles [-4.9 -0.11 0 0.11 4.2]
-- ijcnn/ijcnn-rate10.log
iter 0: testerr 5.8 % at 34771, 6.9 % at 73359 66 % 0 wmin: quantiles [-10 -2.4 -0.57 1.6 5.1]
iter 1: testerr 5.8 % at 5207, 7.1 % at 73359 62 % 0 wmin: quantiles [-9.7 -2.5 -0.094 2.4 4.2]
iter 2: testerr 5.8 % at 5207, 7.8 % at 73359 61 % 0 wmin: quantiles [-9.7 -2.5 -0.094 2.4 4.2]
-- mnist49/mnist49-rate1.log
iter 0: testerr 3.1 % at 10307, 4.0 % at 11024 74 % 0 wmin: quantiles [-2.7 -0.75 0 0.88 3.6]
iter 1: testerr 3.0 % at 7000, 3.2 % at 11024 77 % 0 wmin: quantiles [-2.8 -0.8 0 1 4]
iter 2: testerr 2.9 % at 3913, 3.5 % at 11024 78 % 0 wmin: quantiles [-3 -0.82 0 1 4.2]
-- realsim/realsim-rate1.log
iter 0: testerr 2.7 % at 54028, 2.9 % at 57846 76 % 0 wmin: quantiles [-6.4 -0.46 0 0.46 3.6]
iter 1: testerr 2.6 % at 25741, 2.7 % at 57846 81 % 0 wmin: quantiles [-6.1 -0.47 -0.0037 0.46 3.6]
iter 2: testerr 2.6 % at 4396, 2.7 % at 57846 81 % 0 wmin: quantiles [-6.2 -0.49 -0.0058 0.48 3.8]
-- webspam/webspam-rate1.log
iter 0: testerr 7.7 % at 271039, 7.9 % at 279999 42 % 0 wmin: quantiles [-13 -2.2 0 1.4 6.5]
iter 1: testerr 7.4 % at 277479, 7.5 % at 279999 46 % 0 wmin: quantiles [-15 -2.3 0 1.6 7.1]
iter 2: testerr 7.4 % at 267679, 7.6 % at 279999 47 % 0 wmin: quantiles [-16 -2.4 0 1.6 7.4]
# (n_samples, n_features) --
2newsgroups (18846, 101631) csr .1 % non-0
ijcnn (91701, 22)
mnist49 (13782, 784)
realsim (72309, 20958) csr .25 %
webspam (350000, 254) csr 33 %
(The wmin
quantiles are irrelevant to error comparison,
but they're useful in comparing different methods, error rates etc.)
Summary: minimum test error is a bit less than final, but not a lot -- a 0.1 % change is only 1 per 1000.
Well, it depends on "how noisy".
IF the objective function is linear or quadratic + i.i.d. noise with a noise model,
then there's a body of
Extreme value theory
that might be applicable.
Practice seems far away; most optimization programs don't even have callback()
s,
so that users can even look at noise.
(It should callback( locals() )
so that users can log / print / plot
as the optimizer wanders up hill, down dale.)
The problem of trading off
- moving around
- looking around
on a noisy terrain (1d, 2d, Nd) has surely been studied, but I cannot find the right search terms, nor simple examples; can anyone help ?
http://AndrewGelman.com/2013/01/23/when-are-complicated-models-helpful-in-psychology-research-and-when-are-they-overkill/
"Are advanced statistical techniques really needed, or even legitimate, with ... very rough data ?
Or is it just fishing in the hope of discovering patterns that are not really there ?"
cheers
-- denis
Last change: 6 Feb 2017