gwerbin/NHST.md

## NHST.md

      
    Raw
  

              NHST.md
            
          
    Null Hypothesis Significance Testing (NHST) is a procedure in which we try to learn something about the data by forming an hypothesis and then ruling out (or "rejecting) that hypothesis. Conceptually, NHST is similar to "deductive reasoning" in philosophy, or "differential diagnosis" in medicine: we can arrive on a single unambiguous conclusion only by ruling out all other possibilities. Therefore we can "prove" that the alternative hypothesis is true by ruling out the only other possibility, the null hypothesis.
In order to perform NHST, have to specify a null hypothesis, typically denoted H0, and an alternative hypothesis, typically denoted HA. H0 and HA must be mutually exclusive for NHST to make sense: if H0 is true then HA must be false, and if HA is true then H0 must be false.
NHST has two possible outcomes: we reject the null hypothesis in favor of the alternative hypothesis, or we fail to reject the null hypothesis. Failing to reject the null hypothesis does not mean that we reject the alternative! It only means that we are unable to tell which hypothesis is true, and that we cannot rule out H0. This is a very important concept to understand.

Suppose that I have 5 students' homework grades: 85, 89, 78, 91, 86. The average grade last year was 83, and I don't believe that I would see the same average this year. Therefore I would like to test that the true average grade G is definitely not 83 -- i.e. if there were no random fluctuations, all 5 students would have the same grade, and that this grade would be something that isn't 83. Therefore I set up a hypothesis test with H0: G = 83, HA: G ≠ 83. My objective now is to either reject H0, and thereby conclude that G ≠ 83 (HA is true), or fail to reject H0, and therefore be unable to reject the possibility that G = 83. This is admittedly a silly example, but I will use it to illustrate the NHST process.

To set up the test, we usually represent our data as a sequence of random variables, X1, ..., Xn, where n is the number of observations in the data set. The observed data then can be interpreted as a single "realization"/"observation"/"draw" from each of these random variables, x1, ..., xn. We usually also assume that all of the Xi rv's are independent and follow the same probability distribution - the "iid" assumption. All of the tests that you learn in an intro stats course will use this setup.

I know that all of these students have roughly similar intelligence and capabilities. Therefore it is safe to assume that these observations are random and identically distributed: the differences in grades are due only to random fluctuations, and not due to underlying differences between the students. I also assume, perhaps naively, that they did not collaborate on the homework. If that assumption is true, their grades are also independent: the random fluctuations that produced Student 1's grade are not related to the fluctuations that produced Student 4's grade.


To be somewhat more precise, Student 1's grade x1 is a draw from the X1 random variable, Student 2's grade x2 is a draw from the X2 random variable, etc. X1, ..., X5 are iid. Often you will see books and tutorials glossing over this arrangement by saying that all of the students' grades are drawn from the same random variable X. This is a valid way to set up the problem, but most of mathematical statistics uses the X1, ..., Xn formulation, and most formulas make more sense when presented that way. So it's worth getting used to seeing data represented this way.

Next, we compute a special quantity a called test statistic. I will denote the test statistic here as T, but note that different letters are used to represent the test statistic in different contexts. The test statistic is always defined as a function of the Xi random variables. Remember that a function of random variables is itself a random variable. That is, T itself is a random variable with some probability distribution.
As a random variable, we can observe draws/observations/realizations from T, which we will denote t. For example, if our test statistic is defined as T = X1 + ... + Xn, then a realization of T will be t = x1 + ... + xn. So if we have a dataset consisting of X1=13, X2=12, X3=7, X4=-1, then the test statistic in this dataset will be t = 3 + 12 + 7 + 4.

Without getting into technical details (yet), I know that when I am testing an hypothesis about the mean of a distribution, the test statistic should be t = (m - 83) / s, where m is the sample mean and s is the sample standard deviation. I also know that the test statistic should follow the Student's T distribution with 4 degrees of freedom. In this case m = 85.8 and s = 4.97 (computations omitted for brevity, now is a good time to practice and verify that you get the same results). I plug this into the formula and conclude that the test statistic is 0.56.

In NHST, T must be carefully constructed. The formula for T must be chosen so that we can compute the probability distribution of T when the null hypothesis is true. That is, we need to be able to compute P(T <= t | H0). I will denote this quantity as F0(t), so F0(t) = P(T <= t | H0).
Under the alternative hypothesis, the distribution of the test statistic might be anything. I will denote this as FA(t) = P(T <= t | HA). In fact, the NHST procedure is only possible if the distribution of the test statistic under the alternative is very different from the test statistic under the null. That is, FA(t) must be very different from F0(t).
Once we have computed t on the data, the test proceeds by computing the probability (under H0) that a randomly-drawn T would be greater (in absolute value) than the t value we computed. That is, we compute p = P({T < -t} ∪ {T > t} | H0). By the rules of probability, p = P({T < -t} ∪ {T > t} | H0) = P(T <= -t | H0) + (1 - P(T <= t | H0)) = F0(-t) + (1 - F0(t)). This probability p is the p-value of the test. Being a probability, the p-value is always between 0 and 1.
If the p-value is very small (very close to 0), it means that the probability of observing such a large (or small) t value has a very low probability. If this probability is below a pre-determined threshold, then we say that this t value is too improbable to have been produced by the distribution F0. If we believe that t is not a realization from the F0 distribution, then T must not follow the F0 distribution. And if T does not follow the F0 distribution, then H0 cannot be true. Therefore we must reject H0, and we must accept HA as its only alternative.
But if t is not unreasonably large (or small), then the p-value will be large (not very close to 0). In this case we are comfortable concluding that T follows the F0 distribution. It is very important to realize that we do not know what FA is, and we are unable to rule out the possibility that HA might still be true. All we have done is fail to reject H0.

I can use R to compute P(T <= -0.56 | H0) + (1 - P(T <= 0.56 | H0)) for a T(4) distribution: pt(-0.56, df = 4) + (1 - pt(0.56, df = 4)). This returns 0.605. So p = 0.605. If the mean student homework score G is less than or equal to 83, then in random samples of 5 students, we would observe an absolute t value of 0.56 or greater (i.e. we would observe t > 0.56 or t < -0.56) in 60.5% of such samples.

To understand the final step, you need to understand the four possible results of conducting an hypothesis test:
|                                              | H0 is true                  | H0 is false (HA is true)    |
| Reject H0 (in favor of HA)                 | False positive (Type 1 error) | True positive                 |
| Do not reject H0 (can't distinguish H0/HA) | True negative                 | False negative (Type 2 error) |
α is defined as the probability that we reject H0 when H0 is actually true, α = P(Reject | H0). This mistake is often called a "Type 1" error. β is defined as the probability that we fail to reject H0 when H0 is not true, i.e. when HA is true. This mistake is often called a "Type 2" error. In almost all real-world problems, if we adjust the test procedure such that α (probability of type 1 error) is very small, we cause β (probability of type 2 error) to increase. Hypothesis tests are typically designed such that we set our tolerance for Type 1 errors, and allow the probability of Type 2 errors to float in response.
This tolerance for Type 1 error, α, is known as the significance level of the test. What does it mean to decide our tolerance for Type 1 errors? It means that we must determine the maximum acceptable probability of rejecting a true null hypothesis. In some cases, this maximum acceptable probability is very small. Other times, we don't need such a strict test, so we set α to a larger number, in order to get a better chance of rejecting the null hypothesis if it is indeed false. That is, we can sometimes accept a larger probability of Type 1 errors (α) in order to reduce the probability of Type 2 errors (β).

Let's leave the student homework example aside briefly. Consider an hypothesis test that is used as a medical screening for a serious health condition. The null hypothesis would be that the medical condition is absent, and the alternative is that the medical condition is present. Suppose that the treatment for the condition is effective, but it can have dangerous side effects. Therefore we do not want to start treatment unless we are very certain that the screening test is correct. In such a scenario, we can tolerate only a very small probability of rejecting a true H0. Therefore we set α to a small number, maybe 0.01. Computing β is outside the scope of this tutorial, but it will depend on the specific test that is being conducted.


In the case of student homeworks, I know that the dataset is very small, and I believe that the amount of random variation in the data is large. As such, I believe there will be a high probability of failing to reject H0 except in very obvious cases, even when it is false. This is also not a high-stakes test, so I decide that I am willing to accept a 15% chance of rejecting a true null hypothesis. That is, I am willing to accept a 15% chance of committing a Type 1 error.

The final step in NHST is therefore simply to compare the probability of the test statistic under H0 (P(T <= t | H0) = F0(t) = p) to the significance level (α).

The p-value for the T test on the student homework data was 0.605, but my chosen α level is 0.15. Therefore I cannot reject the null hypothesis that the mean homework grade is 83. So even though my sample mean was 85.8, I do not have enough evidence to distinguish between H0 (G = 83) and HA (G ≠ 83) on such a small sample with so much variance in the data.

Alternatively, we can work backwards from α to compute a quantity called the critical value of the hypothesis test, which I will denote t0. You can then compare t to t0. The correct way to compare these two quantities will depend on the specific test being conducted. In NHST, you get the same results if you compare t to t0, or if you compare p to α. I usually recommend the p-to-α comparison, for reasons I will explain at the end of this document. Usually computing the critical value requires inverting the distribution function F0, i.e. "finding the function Q0(p) such that P(T <= Q0(p)) = p." I use the letter "Q" here because the inverse of the distribution function is called the quantile function.
In the student homework example, I used a two-sided test, and the distribution of T is symmetric about the mean, meaning that there is the same probability density above and below the mean. If the grade probability density (under H0) is plotted on a chart, the region of p <= α is located at both the far left and far right ends of the curve. This region is called the critical region. The critical region begins at the critical value and extends to the edge of the domain. The critical value itself (t0) is the value such that P({T <= -t0} ∪ {T > t0) <= α.
In terms of our quantile function, the critical value is Q(1 - α/2) = t0, or Q(α/2) = -t0. Without getting lost in the mathematics, we divide α / 2 because we need to evenly divide the critical region into two separate areas: the left tail and the right tail. The critical region starts at t0 and extends rightward as far as the maximum possible value of T, and it starts at -t0 and extends leftward as far as the minimum possible value of T. This is the area where t > t0 and t < -t0, meaning that the absolute value of t in my sample (|t|) is greater than the critical value t0. The area where we can't reject H0 is the rest of the distribution, in the middle of the density plot.

1 - (0.15 / 2) is 0.925, and I can use R to compute t0 such that P({T <= -t0} ∪ {T > t0) <= 0.15 for a T(4) distribution, i.e. Q(0.925) for a T(4) distribution: qt(0.925, df=4). This returns 1.78. My test statistic t was 0.56, which is less than the critical value of 1.78, but also much greater than the critical value of -1.78. I am conducting a right-sided test, so the critical region is everything between these critical values. Therefore my test statistic is not in the critical region, and I obtain the same result as when I computed the p-value and compared it to the significance level.

Occasionally in statistics we are interested in a one-sided test, where we know that everything to one side of H0 is impossible. For example, if we are testing the hypothesis that the average number of calls received by a call center on weekends is 0, we only have to test H0: Calls = 0 against HA: Calls > 0. Calls < 0 is impossible, so we don't need to (and shouldn't) include that in HA. Note that you must be very careful not to confuse H0: Calls = 0 with H0: Calls <= 0. For example, it is tempting in the student homework example to write H0: G <= 83. This is not a valid one-sided test, and using the formulas for a one-sided test will produce the wrong answer. To conduct an hypothesis test where H0 is a range of values (and not just a single value), we need a technique called "composite hypothesis testing", because H0 is a "composite" hypothesis. This is an advanced grad-school-level technique that you unfortunately will not learn in an intro stats course.
Many textbooks focus on comparing test statistics with critical values, and do not focus on comparing p values with significance levels. However, all of this complexity around the critical value and the critical region can make it easy to mistakenly choose the wrong critical value for your test. Furthermore, the notation for critical values can be very ugly, and the process of obtaining and comparing critical values in my opinion obscures the purpose and mechanics of NHST. On the other hand, if you are comfortable with the rules of probability, you will have no trouble figuring out the area in which p <= α, and you can usually read the critical region right from the equation for p.
For example, in most two-sided tests the p value is defined as p = P({T <= -t} ∪ {T > t}). You can see from this equation that a larger t causes the area under the probability density to decrease, by moving the covered area farther out along the tails of the distribution. Therefore the critical region must also be located in the tails of the distribution, specifically in the ranges T <= -t0 and T > t0. But you don't need to know that in order to compare p <= α, as long as you are correctly computing p. And the process of figuring out how to compute p will help you figure out where the critical region should be anyway.