Instantly share code, notes, and snippets.

# JohannesBuchner/statistics-minimal.rst Last active Jul 31, 2016

ArXiV minimal statistics checklist

# ArXiV minimal statistics checklist

This checklist help you identify and fix common errors/misinterpretation in your analysis, or of a paper you are refereeing.

1. If you use p-values (from a KS test, Pearson correlation, etc.).
1. What do you think a low p-value says?
1. You have absolutely disproved the null hypothesis (e.g. "no correlation" is ruled out, the data are not sampled from this model, there is no difference between the population means).
2. You have found the probability of the null hypothesis being true.
3. You have absolutely proved your experimental hypothesis (e.g. that there is a difference between the population means, there is a correlation, the data are sampled from this model).
4. You can deduce the probability of the experimental hypothesis being true.
5. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.
6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.

• If you have chosen any of (a)-(f), you are wrong. p-values can not give you any of this information.
• p-values are not probabilities. p-values only specify the frequency of as-extreme data occuring, if the null hypothesis is true. The null hypothesis may not be true, and a rare occurance happens if you try often enough, so it can not tell you that it is improbable (a prior is needed for that).
• If you want to compare two models, try Bayesian model comparison instead.

1. Do you declare the p-value "significant" (or some creative variant thereof) based on some threshold?
• p-values are not probabilities, but random variables between 0 and 1, so sometimes they are low by chance.
• Choosing a "significance" threshold (0.1, 0.05, 3sigma, 5sigma) based on your result is poor practice.
• More important than the significance is to report the size of the effect, and place upper and lower limits on it.
• Beware that p-values are not the frequency of making a mistake or the probability of detection. Following this article, for a p-value of 0.05 (claiming a detection of something), the probability that there really is something is only ~64%. For 2 sigma it is 80%, for p<0.01, it is 90%, for 3 sigma it is 99%. This has to do with the fact that most hypothesis you test are probably not true (the article assumes 10% of all tested hypotheses are true). To first order you can estimate the probability that you really detected something by `100% - p-value / (fraction of hypotheses that are true)`.
• To actually determine the frequency of making a mistake, make Monte Carlo simulations without the effect, generating the same number of data points at the same time/frequencies etc. and apply your detection method.
• Publication bias additionally makes reported p-values unreliable (because they are random variable, preferentially only low values are reported).

1. Do you do multiple tests on the same data? For example, you have 10 variables, and try to find any correlations (N*(N-1)/2 combinations). Or you test many pixels for a significant detection.
• You need to correct for the number of tests, so the threshold should be set much lower.
• So control the False Discovery Rate (FDR) using the Benjamini-Hochberg procedure.
1. If you do a KS test:
1. Do you do a KS test in 2 or more dimensions?
• It is not valid there (specifically, you need to prove that in your application the statistic is distribution-free).
• Instead, use Monte Carlo simulations.
1. Do you do a KS test of your data against a fitted model (against the same data)?
• The KS test results are invalid then, and p-values are misleading. Do not use this!
• Use Monte Carlo simulations instead, look into bootstrapping and cross-validation.
1. For Anderson-Darling (which is better than KS) or Cramér–von Mises test, 2.1. & 2.2 also apply.

1. You do not detect anything with significance.
Don't be sad. A non-detection means you can place an upper-limit on the effect size! Don't let the opportunity slip.
1. If you do a likelihood ratio to compare two models
1. Do you test near the boundary of a parameter? For example test for the presence of an additional component?
• The likelihood ratio is not chi-square distributed there.
• You need to do simulations (using the null model) to predict the distribution of likelihood ratios.
• Or: Use Bayesian model comparison, or a Information Criterion (AIC, WAIC, WBIC).
1. Is one model a special case of the other model?
• If not, the likelihood ratio is not chi-square distributed.
• You need to do simulations (using the null model) to predict the distribution of likelihood ratios.
• Or: Use Bayesian model comparison.

1. If you do Goodness-of-Fit.
1. Do you use the chi^2 statistic based on binning chosen based on the data? (for example algorithms that provide adaptive binning)
2. Do you use the chi^2 statistic on non-linear models?
3. Do you use the chi^2 statistic on small datasets?
• The chi^2 statistic is then not valid or problematic. See "Dos and don'ts of reduced chi-squared"
• (What is good advice here?)
• An alternative to Goodness-of-Fit is to first make an over-fitting, empirical model, and then show that a physically motivated model has a similar likelihood (and given its higher prior should be preferred).
1. If you do Bayesian inference
1. Do you check how sensitive your results are to the priors?
• Try out other priors and see if the results are any different. [Assumed constants are also priors (delta function priors).]

More common Statistical Mistakes in the Astronomical Literature: by Eric Feigelson http://bccp.berkeley.edu/beach_program/COTB14Feigelson3.pdf