JohannesBuchner/statistics-minimal.rst

## statistics-minimal.rst

      
    Raw
  

              statistics-minimal.rst
            
          
    ArXiV minimal statistics checklist

This checklist help you identify and fix common errors/misinterpretation in your analysis, or of a paper you are refereeing.

If you use p-values (from a KS test, Pearson correlation, etc.).


What do you think a low p-value says?


You have absolutely disproved the null hypothesis (e.g. "no correlation" is ruled out, the data are not sampled from this model, there is no difference between the population means).
You have found the probability of the null hypothesis being true.
You have absolutely proved your experimental hypothesis (e.g. that there is a difference between the population means, there is a correlation, the data are sampled from this model).
You can deduce the probability of the experimental hypothesis being true.
You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.
You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.

Choose the right answer (a)-(f).

If you have chosen any of (a)-(f), you are wrong. p-values can not give you any of this information.
p-values are not probabilities. p-values only specify the frequency of as-extreme data occuring, if the null hypothesis is true. The null hypothesis may not be true, and a rare occurance happens if you try often enough, so it can not tell you that it is improbable (a prior is needed for that).
If you want to compare two models, try Bayesian model comparison instead.

Further reading:

The Null Ritual: What You Always Wanted to Know About Significance Testing but Were Afraid to Ask http://library.mpib-berlin.mpg.de/ft/gg/GG_Null_2004.pdf


Do you declare the p-value "significant" (or some creative variant thereof) based on some threshold?


p-values are not probabilities, but random variables between 0 and 1, so sometimes they are low by chance.
Choosing a "significance" threshold (0.1, 0.05, 3sigma, 5sigma) based on your result is poor practice.
More important than the significance is to report the size of the effect, and place upper and lower limits on it.
Beware that p-values are not the frequency of making a mistake or the probability of detection. Following this article, for a p-value of 0.05 (claiming a detection of something), the probability that there really is something is only ~64%. For 2 sigma it is 80%, for p<0.01, it is 90%, for 3 sigma it is 99%. This has to do with the fact that most hypothesis you test are probably not true (the article assumes 10% of all tested hypotheses are true). To first order you can estimate the probability that you really detected something by 100% - p-value / (fraction of hypotheses that are true).
To actually determine the frequency of making a mistake, make Monte Carlo simulations without the effect, generating the same number of data points at the same time/frequencies etc. and apply your detection method.
Publication bias additionally makes reported p-values unreliable (because they are random variable, preferentially only low values are reported).


Further reading:


http://www.medpagetoday.com/Blogs/TheMethodsMan/52171
The false detection rate (FDR) is also known as the false positive rate (alpha) and type I error.
The missed detections rate is also known as the false negative rate (beta) and type II error and is related to the power of a test, 1-beta. For low-power tests with high missed detections rate (say, >2%), highly significant features will be false detections. http://andrewgelman.com/2014/11/17/power-06-looks-like-get-used/


Do you do multiple tests on the same data? For example, you have 10 variables, and try to find any correlations (N*(N-1)/2 combinations). Or you test many pixels for a significant detection.


You need to correct for the number of tests, so the threshold should be set much lower.
So control the False Discovery Rate (FDR) using the Benjamini-Hochberg procedure.


If you do a KS test:


Do you do a KS test in 2 or more dimensions?


It is not valid there (specifically, you need to prove that in your application the statistic is distribution-free).
Instead, use Monte Carlo simulations.


Do you do a KS test of your data against a fitted model (against the same data)?


The KS test results are invalid then, and p-values are misleading. Do not use this!
Use Monte Carlo simulations instead, look into bootstrapping and cross-validation.


For Anderson-Darling (which is better than KS) or Cramér–von Mises test, 2.1. & 2.2 also apply.

Further reading:


https://asaip.psu.edu/Articles/beware-the-kolmogorov-smirnov-test


You do not detect anything with significance.


Don't be sad. A non-detection means you can place an upper-limit on the effect size! Don't let the opportunity slip.


If you do a likelihood ratio to compare two models


Do you test near the boundary of a parameter? For example test for the presence of an additional component?


The likelihood ratio is not chi-square distributed there.
You need to do simulations (using the null model) to predict the distribution of likelihood ratios.
Or: Use Bayesian model comparison, or a Information Criterion (AIC, WAIC, WBIC).


Is one model a special case of the other model?


If not, the likelihood ratio is not chi-square distributed.
You need to do simulations (using the null model) to predict the distribution of likelihood ratios.
Or: Use Bayesian model comparison.


Further reading:


Statistics, Handle with Care: Detecting Multiple Model Components with the Likelihood Ratio Test http://adsabs.harvard.edu/abs/2002ApJ...571..545P
Bayesian model comparison and Information Criteria: http://adsabs.harvard.edu/abs/2008ConPh..49...71T


If you do Goodness-of-Fit.


Do you use the chi^2 statistic based on binning chosen based on the data? (for example algorithms that provide adaptive binning)
Do you use the chi^2 statistic on non-linear models?
Do you use the chi^2 statistic on small datasets?


The chi^2 statistic is then not valid or problematic. See "Dos and don'ts of reduced chi-squared"
(What is good advice here?)
An alternative to Goodness-of-Fit is to first make an over-fitting, empirical model, and then show that a physically motivated model has a similar likelihood (and given its higher prior should be preferred).


If you do Bayesian inference


Do you check how sensitive your results are to the priors?


Try out other priors and see if the results are any different. [Assumed constants are also priors (delta function priors).]


More common Statistical Mistakes in the Astronomical Literature: by Eric Feigelson http://bccp.berkeley.edu/beach_program/COTB14Feigelson3.pdf
Join the Astrostatistics Facebook group to learn more and ask questions.