This checklist help you identify and fix common errors/misinterpretation in your analysis, or of a paper you are refereeing.
- If you use p-values (from a KS test, Pearson correlation, etc.).
- What do you think a low p-value says?
- You have absolutely disproved the null hypothesis (e.g. "no correlation" is ruled out, the data are not sampled from this model, there is no difference between the population means).
- You have found the probability of the null hypothesis being true.
- You have absolutely proved your experimental hypothesis (e.g. that there is a difference between the population means, there is a correlation, the data are sampled from this model).
- You can deduce the probability of the experimental hypothesis being true.
- You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.
- You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.
Choose the right answer (a)-(f).
- If you have chosen any of (a)-(f), you are wrong. p-values can not give you any of this information.
- p-values are not probabilities. p-values only specify the frequency of as-extreme data occuring, if the null hypothesis is true. The null hypothesis may not be true, and a rare occurance happens if you try often enough, so it can not tell you that it is improbable (a prior is needed for that).
- If you want to compare two models, try Bayesian model comparison instead.
Further reading:
- The Null Ritual: What You Always Wanted to Know About Significance Testing but Were Afraid to Ask http://library.mpib-berlin.mpg.de/ft/gg/GG_Null_2004.pdf
- Do you declare the p-value "significant" (or some creative variant thereof) based on some threshold?
- p-values are not probabilities, but random variables between 0 and 1, so sometimes they are low by chance.
- Choosing a "significance" threshold (0.1, 0.05, 3sigma, 5sigma) based on your result is poor practice.
- More important than the significance is to report the size of the effect, and place upper and lower limits on it.
- Beware that p-values are not the frequency of making a mistake or the probability of detection. Following this article, for a p-value of 0.05 (claiming a detection of something), the probability that there really is something is only ~64%. For 2 sigma it is 80%, for p<0.01, it is 90%, for 3 sigma it is 99%. This has to do with the fact that most hypothesis you test are probably not true (the article assumes 10% of all tested hypotheses are true). To first order you can estimate the probability that you really detected something by
100% - p-value / (fraction of hypotheses that are true)
.- To actually determine the frequency of making a mistake, make Monte Carlo simulations without the effect, generating the same number of data points at the same time/frequencies etc. and apply your detection method.
- Publication bias additionally makes reported p-values unreliable (because they are random variable, preferentially only low values are reported).
Further reading:
- http://www.medpagetoday.com/Blogs/TheMethodsMan/52171
- The false detection rate (FDR) is also known as the false positive rate (alpha) and type I error.
- The missed detections rate is also known as the false negative rate (beta) and type II error and is related to the power of a test, 1-beta. For low-power tests with high missed detections rate (say, >2%), highly significant features will be false detections. http://andrewgelman.com/2014/11/17/power-06-looks-like-get-used/
- Do you do multiple tests on the same data? For example, you have 10 variables, and try to find any correlations (N*(N-1)/2 combinations). Or you test many pixels for a significant detection.
- You need to correct for the number of tests, so the threshold should be set much lower.
- So control the False Discovery Rate (FDR) using the Benjamini-Hochberg procedure.
- If you do a KS test:
- Do you do a KS test in 2 or more dimensions?
- It is not valid there (specifically, you need to prove that in your application the statistic is distribution-free).
- Instead, use Monte Carlo simulations.
- Do you do a KS test of your data against a fitted model (against the same data)?
- The KS test results are invalid then, and p-values are misleading. Do not use this!
- Use Monte Carlo simulations instead, look into bootstrapping and cross-validation.
- For Anderson-Darling (which is better than KS) or Cramér–von Mises test, 2.1. & 2.2 also apply.
Further reading:
- You do not detect anything with significance.
Don't be sad. A non-detection means you can place an upper-limit on the effect size! Don't let the opportunity slip.
- If you do a likelihood ratio to compare two models
- Do you test near the boundary of a parameter? For example test for the presence of an additional component?
- The likelihood ratio is not chi-square distributed there.
- You need to do simulations (using the null model) to predict the distribution of likelihood ratios.
- Or: Use Bayesian model comparison, or a Information Criterion (AIC, WAIC, WBIC).
- Is one model a special case of the other model?
- If not, the likelihood ratio is not chi-square distributed.
- You need to do simulations (using the null model) to predict the distribution of likelihood ratios.
- Or: Use Bayesian model comparison.
Further reading:
- Statistics, Handle with Care: Detecting Multiple Model Components with the Likelihood Ratio Test http://adsabs.harvard.edu/abs/2002ApJ...571..545P
- Bayesian model comparison and Information Criteria: http://adsabs.harvard.edu/abs/2008ConPh..49...71T
- If you do Goodness-of-Fit.
- Do you use the chi^2 statistic based on binning chosen based on the data? (for example algorithms that provide adaptive binning)
- Do you use the chi^2 statistic on non-linear models?
- Do you use the chi^2 statistic on small datasets?
- The chi^2 statistic is then not valid or problematic. See "Dos and don'ts of reduced chi-squared"
- (What is good advice here?)
- An alternative to Goodness-of-Fit is to first make an over-fitting, empirical model, and then show that a physically motivated model has a similar likelihood (and given its higher prior should be preferred).
- If you do Bayesian inference
- Do you check how sensitive your results are to the priors?
- Try out other priors and see if the results are any different. [Assumed constants are also priors (delta function priors).]
More common Statistical Mistakes in the Astronomical Literature: by Eric Feigelson http://bccp.berkeley.edu/beach_program/COTB14Feigelson3.pdf
Join the Astrostatistics Facebook group to learn more and ask questions.