Skip to content

Instantly share code, notes, and snippets.

@denis-bz
Last active February 6, 2017 14:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save denis-bz/ca1728492f0ef5b67e6fa044138f356d to your computer and use it in GitHub Desktop.
Save denis-bz/ca1728492f0ef5b67e6fa044138f356d to your computer and use it in GitHub Desktop.
How noisy is test error in classification ? 2017-02-06 Feb

How noisy is test error in classification ?

Algorithms for classification, in particular binary classification, have two different objectives:

  • a smooth, fast, approximate Loss function used in most optimizers
  • real loss measured on a test set: expensive to calculate, so usually done only at the end.

How close are the two, smooth approximate loss vs. real rough test error ? Everyone knows that they differ, but by how much ? The plots and log summaries below look at this question for a handful of test cases, run with a simple adagrad.py based on lightning AdaGrad and squared-hinge loss.

Plots

ijcnn-rate10 mnist49-rate1

Minimum vs. end-of-iteration test error

Most SGD optimizers run classif.fit( Xtrain, ytrain ) or the like on a training set for some user-specified number of iterations (aka epochs), and return final coefficients w_final. It's up to the user to evaluate testerr( w_final, Xtest, ytest ), and ideally cross-validate in an outer loop. Here I call testerr() 1000 times per iteration, to compare minimum error and end error in each iteration:

-- 2newsgroups/2newsgroups-rate1.log
	# test error minimum,        end
iter 0: testerr 14.5 % at 12527, 15.1 % at 15075  30 % 0  wmin: quantiles [-4.8 -0.09 0 0.099 4.2] 
iter 1: testerr 13.6 % at 14261, 14.7 % at 15075  39 % 0  wmin: quantiles [-4.9 -0.11 0 0.11 4.2] 
iter 2: testerr 13.6 % at 14261, 14.8 % at 15075  42 % 0  wmin: quantiles [-4.9 -0.11 0 0.11 4.2] 

-- ijcnn/ijcnn-rate10.log
iter 0: testerr 5.8 % at 34771, 6.9 % at 73359  66 % 0  wmin: quantiles [-10 -2.4 -0.57 1.6 5.1]
iter 1: testerr 5.8 % at  5207, 7.1 % at 73359  62 % 0  wmin: quantiles [-9.7 -2.5 -0.094 2.4 4.2]
iter 2: testerr 5.8 % at  5207, 7.8 % at 73359  61 % 0  wmin: quantiles [-9.7 -2.5 -0.094 2.4 4.2]

-- mnist49/mnist49-rate1.log
iter 0: testerr 3.1 % at 10307, 4.0 % at 11024  74 % 0  wmin: quantiles [-2.7 -0.75 0 0.88 3.6]
iter 1: testerr 3.0 % at  7000, 3.2 % at 11024  77 % 0  wmin: quantiles [-2.8 -0.8 0 1 4]
iter 2: testerr 2.9 % at  3913, 3.5 % at 11024  78 % 0  wmin: quantiles [-3 -0.82 0 1 4.2]

-- realsim/realsim-rate1.log
iter 0: testerr 2.7 % at 54028, 2.9 % at 57846  76 % 0  wmin: quantiles [-6.4 -0.46 0 0.46 3.6]
iter 1: testerr 2.6 % at 25741, 2.7 % at 57846  81 % 0  wmin: quantiles [-6.1 -0.47 -0.0037 0.46 3.6]
iter 2: testerr 2.6 % at  4396, 2.7 % at 57846  81 % 0  wmin: quantiles [-6.2 -0.49 -0.0058 0.48 3.8]

-- webspam/webspam-rate1.log
iter 0: testerr 7.7 % at 271039, 7.9 % at 279999  42 % 0  wmin: quantiles [-13 -2.2 0 1.4 6.5]
iter 1: testerr 7.4 % at 277479, 7.5 % at 279999  46 % 0  wmin: quantiles [-15 -2.3 0 1.6 7.1]
iter 2: testerr 7.4 % at 267679, 7.6 % at 279999  47 % 0  wmin: quantiles [-16 -2.4 0 1.6 7.4]


# (n_samples, n_features) --
2newsgroups  (18846, 101631) csr .1 % non-0
ijcnn        (91701, 22)
mnist49      (13782, 784)
realsim      (72309, 20958) csr .25 %
webspam      (350000, 254) csr 33 %

(The wmin quantiles are irrelevant to error comparison, but they're useful in comparing different methods, error rates etc.)

Summary: minimum test error is a bit less than final, but not a lot -- a 0.1 % change is only 1 per 1000.

How far is the next noisy minimum ?

Well, it depends on "how noisy". IF the objective function is linear or quadratic + i.i.d. noise with a noise model, then there's a body of Extreme value theory that might be applicable. Practice seems far away; most optimization programs don't even have callback() s, so that users can even look at noise. (It should callback( locals() ) so that users can log / print / plot as the optimizer wanders up hill, down dale.)

The problem of trading off

  • moving around
  • looking around

on a noisy terrain (1d, 2d, Nd) has surely been studied, but I cannot find the right search terms, nor simple examples; can anyone help ?

See also

http://AndrewGelman.com/2013/01/23/when-are-complicated-models-helpful-in-psychology-research-and-when-are-they-overkill/
"Are advanced statistical techniques really needed, or even legitimate, with ... very rough data ?
Or is it just fishing in the hope of discovering patterns that are not really there ?"

Comments are welcome, test cases most welcome.

cheers
-- denis

Last change: 6 Feb 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment